Difference between revisions of "ISFDB:Author Names Cleanup"
DESiegel60 (talk | contribs) m (+cat) |
|||
(11 intermediate revisions by 3 users not shown) | |||
Line 44: | Line 44: | ||
<pre>John Pierce, M.S. | <pre>John Pierce, M.S. | ||
− | |||
− | |||
Epaminondas T. Snooks, D.T.G. | Epaminondas T. Snooks, D.T.G. | ||
− | |||
− | |||
Ronald V. Dorn, Jr. M.D. | Ronald V. Dorn, Jr. M.D. | ||
− | |||
− | |||
Arthur W. Weir, D.Sc. | Arthur W. Weir, D.Sc. | ||
Zuprik-Curtis Enterprises, Inc. | Zuprik-Curtis Enterprises, Inc. | ||
Universal City Studios, Inc. | Universal City Studios, Inc. | ||
Ben R., Ph.D. Games | Ben R., Ph.D. Games | ||
− | |||
Neal Barrett, Jr | Neal Barrett, Jr | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
Yvonne, Fern Solow | Yvonne, Fern Solow | ||
− | |||
− | |||
Peter, Dr. Beckmann | Peter, Dr. Beckmann | ||
− | |||
Charles, Waugh | Charles, Waugh | ||
Arlan Andrews, Kris Andrews, Joe Giarratano | Arlan Andrews, Kris Andrews, Joe Giarratano | ||
− | |||
− | |||
Jenifer, A. Ruth | Jenifer, A. Ruth | ||
O. David, Dr West | O. David, Dr West | ||
− | |||
− | |||
Riley W., Jr Sanson | Riley W., Jr Sanson | ||
Hiccup Horrendous, III Haddock | Hiccup Horrendous, III Haddock | ||
− | |||
Louis, Jr. Porter | Louis, Jr. Porter | ||
− | |||
Joseph S., Jr. Nye | Joseph S., Jr. Nye | ||
Rob, Jr Potchak | Rob, Jr Potchak | ||
− | |||
Jr, Bill Martin | Jr, Bill Martin | ||
Normand, R. Bernier | Normand, R. Bernier | ||
Wilson, Tortosa | Wilson, Tortosa | ||
− | |||
− | |||
Robert S., Jr. Sanders | Robert S., Jr. Sanders | ||
− | |||
− | |||
Douglas M., Sir Price | Douglas M., Sir Price | ||
− | |||
Joseph, Jr Covino | Joseph, Jr Covino | ||
Lovelee, I. Dagum | Lovelee, I. Dagum | ||
Line 103: | Line 74: | ||
MJ Studios, cover art Jim Seward | MJ Studios, cover art Jim Seward | ||
Seton Hall University, Dr. Dermot Quinn | Seton Hall University, Dr. Dermot Quinn | ||
− | |||
David, Niall Wilson | David, Niall Wilson | ||
Jimmie E., Jr. Cain | Jimmie E., Jr. Cain | ||
Line 113: | Line 83: | ||
# Neal, Jr. Barrett - just merged into {{A|Neal_Barrett,_Jr.}} There seem to remain also {{A|Neal_Barrett, Jr}} and {{A|Neal_Barrett}} although they are empty; so - merge them too? | # Neal, Jr. Barrett - just merged into {{A|Neal_Barrett,_Jr.}} There seem to remain also {{A|Neal_Barrett, Jr}} and {{A|Neal_Barrett}} although they are empty; so - merge them too? | ||
# Mishima, Yukio - just merged into {{A|Yukio_Mishima}} | # Mishima, Yukio - just merged into {{A|Yukio_Mishima}} | ||
− | # Gordon, R. Dickinson - | + | # Gordon, R. Dickinson - just removed from a stray pub with Gordon R. Dickson |
# Chesterton scholar, Aidan Mackey - merged into {{A|Aidan_Mackey}} several weeks ago | # Chesterton scholar, Aidan Mackey - merged into {{A|Aidan_Mackey}} several weeks ago | ||
# Richard Gilliam, Wendy Webb, Edward E. Kramer, Martin H. Greenberg; Richard Gilliam, Martin H. Greenberg, Thomas R. Hanlon; Richard Gilliam, Martin H. Greenberg; Richard Gilliam, Martin H. Greenberg, Kathleen M. Massie-Ferch; Janet Berliner, Uwe Luserke, Martin H. Greenberg; Richard Gilliam, Martin H. Greenberg, Edward E. Kramer - seem to have been split long time ago as no longer present | # Richard Gilliam, Wendy Webb, Edward E. Kramer, Martin H. Greenberg; Richard Gilliam, Martin H. Greenberg, Thomas R. Hanlon; Richard Gilliam, Martin H. Greenberg; Richard Gilliam, Martin H. Greenberg, Kathleen M. Massie-Ferch; Janet Berliner, Uwe Luserke, Martin H. Greenberg; Richard Gilliam, Martin H. Greenberg, Edward E. Kramer - seem to have been split long time ago as no longer present | ||
Line 119: | Line 89: | ||
Is there a use for documenting this so meticulously, or would it be enough just to delete from the list above? --[[User:JVjr|JV]][[User talk:JVjr|jr]] 06:50, 10 Mar 2007 (CST) | Is there a use for documenting this so meticulously, or would it be enough just to delete from the list above? --[[User:JVjr|JV]][[User talk:JVjr|jr]] 06:50, 10 Mar 2007 (CST) | ||
+ | |||
+ | : "PhD, C. Malcolm Trowbridge" changed to "C. Malcolm Trowbridge, Ph.D." and "Ph.D, Sandra Eubanks" merged with "Sandra Eubanks, Ph.D." [[User:BLongley|BLongley]] 11:20, 31 Mar 2007 (CDT) | ||
+ | |||
+ | :: I checked a few more: we've lost these and I suspect in these ways: | ||
+ | Michael, W. Perry -> Michael W. Perry | ||
+ | Arlan Keith Andrews, Sr -> pseudonym | ||
+ | John C. Wright, Esq. -> pseudonym | ||
+ | Roscoe Clark, F.R.C.S. -> pseuodonym | ||
+ | Evelyn A. Archer, P.I. -> pseuodonym | ||
+ | The 1992 James Tiptree, Jr Award Judges -> pseuodonym | ||
+ | The 1995 James Tiptree, Jr Award Judges -> pseuodonym | ||
+ | J. A, Lawrence -> J. A. Lawrence | ||
+ | Rockwell, Carey - > Carey Rockwell | ||
+ | Mary, H Herbert -> Mary H. Herbert | ||
+ | James, White - > James White | ||
+ | Zora, N. Hurston - > Zora Neale Hurston | ||
+ | Francis M., Jr. Nevins -> Francis M. Nevins, Jr. | ||
+ | Dora and MacGregor,Eleanor Pantell -> Dora Pantell, Ellen MacGregor | ||
+ | Stuart, Gordon -> Stuart Gordon | ||
+ | Stanley Grauman, Weinbaum -> Stanley Grauman Weinbaum, pseuodonym | ||
+ | Walter, Jr. Wangerin -> Walter Wangerin, Jr. | ||
+ | Roberta Carter, Rogers, Jacqueline Clark - > Roberta Carter Clark, Jacqueline Rogers | ||
+ | William, F. Nolan -> William F. Nolan | ||
+ | W., Rev Awdry -> Reverend W. Awdry | ||
+ | Emmett O., III Saunders -> Emmett O. Saunders, III | ||
+ | Esme Nichola Author Winter, Barbara Illustrator Shilletto -> Esme Nichola Shilletto, Barbara Winter | ||
+ | Mike, Jr. Deodato -> Gone? | ||
+ | Kenneth, Jr. Faig -> Kenneth W. Faig Jr. | ||
+ | Philip Harbottle, Editor -> Philip Harbottle | ||
+ | Michael Simon Bodner, PhD -> Michael Simon Bodner, Ph.D. | ||
+ | Richard, J. O'brien -> Richard J. O'Brien | ||
+ | D., M. Brown -> D. M. Brown | ||
+ | :: [[User:BLongley|BLongley]] 16:52, 12 Apr 2007 (CDT) | ||
==Suspected Duplicate Author Names== | ==Suspected Duplicate Author Names== | ||
Line 175: | Line 178: | ||
Some Author records, e.g. Tad Williams', have bad URLs in the "Web page" field, which were possibly created by the ISFDB1-to-ISFDB2 conversion. We need to find and fix them. [[User:Ahasuerus|Ahasuerus]] 19:07, 27 Dec 2006 (CST) | Some Author records, e.g. Tad Williams', have bad URLs in the "Web page" field, which were possibly created by the ISFDB1-to-ISFDB2 conversion. We need to find and fix them. [[User:Ahasuerus|Ahasuerus]] 19:07, 27 Dec 2006 (CST) | ||
+ | |||
+ | ==Questionable First/Middle/Last Names== | ||
+ | |||
+ | The following script was developed in a hurry and needs to be cleaned up. It assumes that "c:/ISFDB/Authors.txt" is a dump of the MySQL Author table. [[User:Ahasuerus|Ahasuerus]] 22:07, 9 Apr 2007 (CDT) | ||
+ | |||
+ | <pre>use strict; | ||
+ | my $mainfile = "c:/ISFDB/Authors.txt"; | ||
+ | #my $space = '\s'; | ||
+ | my $first = '[A-Z]{1}[-\'a-z]{1,25}\s'; | ||
+ | my $last = "[A-Z]{1}[-a-z']{1,25}"; | ||
+ | my $init = '[A-Z]{1}\. '; | ||
+ | my $middle = '[A-Z]{1}[-a-z]{1,20}\s'; | ||
+ | # | ||
+ | open(AUTHORS,$mainfile) || die("can't open file $mainfile"); | ||
+ | while (<AUTHORS>) { | ||
+ | my $string = $_; | ||
+ | # Put the suffix (anything after ",") into the $suffix[1] | ||
+ | my @suffix = split /,\s*/, $string; | ||
+ | # if there is more than 1 comma, then there is an error | ||
+ | if ($#suffix > 1) { | ||
+ | next; | ||
+ | } | ||
+ | # If there is a suffix, check if it's in the list of approved suffixes | ||
+ | if ($#suffix == 1) { | ||
+ | if (!($suffix[1] =~ /^Sr.$|Jr\.$|^II$|^III$|^IV$|^Ph\.D\.$|^M.D.$/)) { | ||
+ | next; | ||
+ | } | ||
+ | } | ||
+ | $_ = $suffix[0]; | ||
+ | next if /^($first)($last)$/; #FirstName LastName | ||
+ | next if /^($init)($last)$/; #Initial. LastName | ||
+ | next if /^($init)($middle)($last)$/; #Initial. MiddleName LastName | ||
+ | next if /^($first)($init)($last)$/; #FirstName Initial. LastName | ||
+ | next if /^($first)($init)($init)($last)$/; #FirstName Initial1. Initial2. LastName | ||
+ | next if /^($init)($init)($init)($last)$/; #Initial1. Initial2. Initial3. LastName | ||
+ | next if /^($init)($init)($last)$/; #Initial1. Initiail2. LastName | ||
+ | next if /^($first)($middle)($last)$/; #FirstName MiddleName LastName | ||
+ | my @word = split /\s+/; | ||
+ | my $lname = $word[$#word]; | ||
+ | my $fname = $word[0]; | ||
+ | if ($lname =~ / | ||
+ | print "$_"; | ||
+ | next; | ||
+ | } | ||
+ | =pod | ||
+ | print "$_" if / II$/; | ||
+ | print "$_" if / III$/; | ||
+ | print "$_" if / Jr.$/; | ||
+ | =pod | ||
+ | </pre> | ||
+ | |||
+ | |||
+ | Here's my stab at a Python version of an author names script. It requires you to have the Python MySQL module installed so that it can retrieve the author names from the db. You'll need to edit the first few lines as appropriate for your setup. | ||
+ | |||
+ | It's short because most of the complexity has been compressed into that one big regexp. Essentially it's just flagging any name that doesn't follow a simple pattern of (several-names-or-initials optional-byname-particle lastname optional-suffix-with-comma). It flags plenty of perfectly valid names, but it narrows things down enough that it's easy to scan the list by eye. --[[User:WimLewis|WimLewis]] 02:59, 10 Apr 2007 (CDT) | ||
+ | |||
+ | <pre> | ||
+ | # If your mysql module isn't installed in the system path, include the path here | ||
+ | #import sys | ||
+ | #sys.path.append('/Users/Shared/wiml/pmysql/MySQL-python-1.2.1_p2/build/lib.darwin-8.9.0-Power_Macintosh-2.3') | ||
+ | |||
+ | import MySQLdb | ||
+ | import re | ||
+ | |||
+ | conn = MySQLdb.connect( user='root', db='isfdb' ) | ||
+ | sess = conn.cursor() | ||
+ | |||
+ | sess.execute('SELECT a.author_id, a.author_canonical FROM authors a;') | ||
+ | |||
+ | auname = re.compile('(?:(?:[A-Z]\.|[A-Z][a-z]+) )*(?:|[Dd]e ?|[Dd]u ?|O\'|Mac|Mc|[Vv]on )[A-Z][A-Za-z]+(?:, [A-Za-z\.]+)?') | ||
+ | |||
+ | while 1: | ||
+ | row = sess.fetchone() | ||
+ | if row is None: | ||
+ | break | ||
+ | (oid, name) = row | ||
+ | if not auname.match(name): | ||
+ | print oid, name | ||
+ | |||
+ | </pre> | ||
+ | |||
+ | |||
+ | [[Category:Bibliographic Projects|Author Names cleanup]] |
Latest revision as of 17:56, 7 August 2009
Project Description
The Author Names Cleanup project aims to find and eliminate mistyped, duplicate, poorly formatted or otherwise erroneous Author records.
Sub-projects
Questionable Suffixes
"Authors.pl" is a Perl script which searches a flat file of all ISFDB Author names extracted from the MySQL database for unusual suffixes. "Suffixes" are defined as any characters following a comma. "Usual" suffixes are defined as:
- Sr.
- Jr.
- II
- III
- IV
- Ph.D.
- M.D.
Script
use strict; my $mainfile = "c:/ISFDB/Authors.txt"; open(AUTHORS,$mainfile) || die("can't open file $mainfile"); while (<AUTHORS>) { #foreach (@lines) { my $string = $_; # Put the suffix (anything after ",") into $suffix[1] my @suffix = split /,\s*/, $string; # if there is more than 1 comma, then there is an error if ($#suffix > 1) { print $string; next; } # If there is a suffix, check if it's in the list of approved suffixes if ($#suffix == 1) { if (!($suffix[1] =~ /^Sr.$|Jr\.$|^II$|^III$|^IV$|^Ph\.D\.$|^M.D.$/)) { print "$string"; } next; } next; }
Identified questionable suffixes
Results as of 09/10/06 when run aganst the 08/27/06 ISFDB backup file using ActivePerl/Windows XP: (solved cases moved below)
John Pierce, M.S. Epaminondas T. Snooks, D.T.G. Ronald V. Dorn, Jr. M.D. Arthur W. Weir, D.Sc. Zuprik-Curtis Enterprises, Inc. Universal City Studios, Inc. Ben R., Ph.D. Games Neal Barrett, Jr Yvonne, Fern Solow Peter, Dr. Beckmann Charles, Waugh Arlan Andrews, Kris Andrews, Joe Giarratano Jenifer, A. Ruth O. David, Dr West Riley W., Jr Sanson Hiccup Horrendous, III Haddock Louis, Jr. Porter Joseph S., Jr. Nye Rob, Jr Potchak Jr, Bill Martin Normand, R. Bernier Wilson, Tortosa Robert S., Jr. Sanders Douglas M., Sir Price Joseph, Jr Covino Lovelee, I. Dagum John, A. Hall James, Sir Knowles Mark, Edward Hall MJ Studios, cover art Jim Seward Seton Hall University, Dr. Dermot Quinn David, Niall Wilson Jimmie E., Jr. Cain Todd, F. Davis Hugh J., Jr. Luke
Cleaned up Suffixes
- Neal, Jr. Barrett - just merged into Neal_Barrett,_Jr. There seem to remain also Neal_Barrett, Jr and Neal_Barrett although they are empty; so - merge them too?
- Mishima, Yukio - just merged into Yukio_Mishima
- Gordon, R. Dickinson - just removed from a stray pub with Gordon R. Dickson
- Chesterton scholar, Aidan Mackey - merged into Aidan_Mackey several weeks ago
- Richard Gilliam, Wendy Webb, Edward E. Kramer, Martin H. Greenberg; Richard Gilliam, Martin H. Greenberg, Thomas R. Hanlon; Richard Gilliam, Martin H. Greenberg; Richard Gilliam, Martin H. Greenberg, Kathleen M. Massie-Ferch; Janet Berliner, Uwe Luserke, Martin H. Greenberg; Richard Gilliam, Martin H. Greenberg, Edward E. Kramer - seem to have been split long time ago as no longer present
- Brenda, W. Clough; Michael, Moorcock - long gone
Is there a use for documenting this so meticulously, or would it be enough just to delete from the list above? --JVjr 06:50, 10 Mar 2007 (CST)
- "PhD, C. Malcolm Trowbridge" changed to "C. Malcolm Trowbridge, Ph.D." and "Ph.D, Sandra Eubanks" merged with "Sandra Eubanks, Ph.D." BLongley 11:20, 31 Mar 2007 (CDT)
- I checked a few more: we've lost these and I suspect in these ways:
Michael, W. Perry -> Michael W. Perry Arlan Keith Andrews, Sr -> pseudonym John C. Wright, Esq. -> pseudonym Roscoe Clark, F.R.C.S. -> pseuodonym Evelyn A. Archer, P.I. -> pseuodonym The 1992 James Tiptree, Jr Award Judges -> pseuodonym The 1995 James Tiptree, Jr Award Judges -> pseuodonym J. A, Lawrence -> J. A. Lawrence Rockwell, Carey - > Carey Rockwell Mary, H Herbert -> Mary H. Herbert James, White - > James White Zora, N. Hurston - > Zora Neale Hurston Francis M., Jr. Nevins -> Francis M. Nevins, Jr. Dora and MacGregor,Eleanor Pantell -> Dora Pantell, Ellen MacGregor Stuart, Gordon -> Stuart Gordon Stanley Grauman, Weinbaum -> Stanley Grauman Weinbaum, pseuodonym Walter, Jr. Wangerin -> Walter Wangerin, Jr. Roberta Carter, Rogers, Jacqueline Clark - > Roberta Carter Clark, Jacqueline Rogers William, F. Nolan -> William F. Nolan W., Rev Awdry -> Reverend W. Awdry Emmett O., III Saunders -> Emmett O. Saunders, III Esme Nichola Author Winter, Barbara Illustrator Shilletto -> Esme Nichola Shilletto, Barbara Winter Mike, Jr. Deodato -> Gone? Kenneth, Jr. Faig -> Kenneth W. Faig Jr. Philip Harbottle, Editor -> Philip Harbottle Michael Simon Bodner, PhD -> Michael Simon Bodner, Ph.D. Richard, J. O'brien -> Richard J. O'Brien D., M. Brown -> D. M. Brown
- BLongley 16:52, 12 Apr 2007 (CDT)
Suspected Duplicate Author Names
Need to develop a specification and write a script, possibly multiple scripts depending on what kinds of requirements we will come up with. Also, we will need a way of finding out whether, e.g., Nancy Farmer the popular children's author is the same person as Nancy Farmer the co-author of Update One - Federal Fisheries Management: A Guidebook to the Magnuson Fishery Conservation and Management Act (1987). Please discuss on the Talk page.
Is this the right place to list authors I think are probably duplicates? I suspect that Robert Boyer is the same as Robert H. Boyer, likewise the two Zahorskies. WimLewis 18:01, 15 Mar 2007 (CDT)
Anonymous, uncredited, etc.
Originally compiled by Marc Kupper 14:10, 16 Nov 2006 (CST):
Author | Long | Short | Awards | Comments |
---|---|---|---|---|
Anonymous | ~150 | ~250 | 6 | |
Anonyous | 0 | 3 | 0 | Should get merged with Anonymous [Fixed Marc Kupper 02:47, 22 Nov 2006 (CST)] |
Not Available | 4 | 0 | 0 | This sounds like our "not stated" [Fixed - I checked each of the four titles and for each was able to dig up the author name. Marc Kupper 03:01, 22 Nov 2006 (CST)] |
uncredit | 0 | 1 | 0 | Should get merged with uncredited. [Fixed Marc Kupper 02:47, 22 Nov 2006 (CST)] |
uncredited | 0 | ~1,000 | 0 | |
Unknown | ~1,050 | ~550 | 172 | Used for many cover/interior art instead of "unsigned" |
unknown | ~1,050 | ~550 | 172 | Both "unknown" and "Unknown" get used but search returned same list |
unknownAfghan | 0 | 1 | 0 | No explanation in the story notes about this |
unsigned | 45 | 0 | 0 | All 45 long works are for Interior Art |
Unknown Unknown | Listed on http://www.isfdb.org/DIR_U.html but not found | |||
Various | 69 | 0 | 0 | |
Various Authors | 2 | 0 | 0 | Should get merged into Various [both were bad entries and have been deleted. Ahasuerus] |
Malformed URLs
Some Author records, e.g. Tad Williams', have bad URLs in the "Web page" field, which were possibly created by the ISFDB1-to-ISFDB2 conversion. We need to find and fix them. Ahasuerus 19:07, 27 Dec 2006 (CST)
Questionable First/Middle/Last Names
The following script was developed in a hurry and needs to be cleaned up. It assumes that "c:/ISFDB/Authors.txt" is a dump of the MySQL Author table. Ahasuerus 22:07, 9 Apr 2007 (CDT)
use strict; my $mainfile = "c:/ISFDB/Authors.txt"; #my $space = '\s'; my $first = '[A-Z]{1}[-\'a-z]{1,25}\s'; my $last = "[A-Z]{1}[-a-z']{1,25}"; my $init = '[A-Z]{1}\. '; my $middle = '[A-Z]{1}[-a-z]{1,20}\s'; # open(AUTHORS,$mainfile) || die("can't open file $mainfile"); while (<AUTHORS>) { my $string = $_; # Put the suffix (anything after ",") into the $suffix[1] my @suffix = split /,\s*/, $string; # if there is more than 1 comma, then there is an error if ($#suffix > 1) { next; } # If there is a suffix, check if it's in the list of approved suffixes if ($#suffix == 1) { if (!($suffix[1] =~ /^Sr.$|Jr\.$|^II$|^III$|^IV$|^Ph\.D\.$|^M.D.$/)) { next; } } $_ = $suffix[0]; next if /^($first)($last)$/; #FirstName LastName next if /^($init)($last)$/; #Initial. LastName next if /^($init)($middle)($last)$/; #Initial. MiddleName LastName next if /^($first)($init)($last)$/; #FirstName Initial. LastName next if /^($first)($init)($init)($last)$/; #FirstName Initial1. Initial2. LastName next if /^($init)($init)($init)($last)$/; #Initial1. Initial2. Initial3. LastName next if /^($init)($init)($last)$/; #Initial1. Initiail2. LastName next if /^($first)($middle)($last)$/; #FirstName MiddleName LastName my @word = split /\s+/; my $lname = $word[$#word]; my $fname = $word[0]; if ($lname =~ / print "$_"; next; } =pod print "$_" if / II$/; print "$_" if / III$/; print "$_" if / Jr.$/; =pod
Here's my stab at a Python version of an author names script. It requires you to have the Python MySQL module installed so that it can retrieve the author names from the db. You'll need to edit the first few lines as appropriate for your setup.
It's short because most of the complexity has been compressed into that one big regexp. Essentially it's just flagging any name that doesn't follow a simple pattern of (several-names-or-initials optional-byname-particle lastname optional-suffix-with-comma). It flags plenty of perfectly valid names, but it narrows things down enough that it's easy to scan the list by eye. --WimLewis 02:59, 10 Apr 2007 (CDT)
# If your mysql module isn't installed in the system path, include the path here #import sys #sys.path.append('/Users/Shared/wiml/pmysql/MySQL-python-1.2.1_p2/build/lib.darwin-8.9.0-Power_Macintosh-2.3') import MySQLdb import re conn = MySQLdb.connect( user='root', db='isfdb' ) sess = conn.cursor() sess.execute('SELECT a.author_id, a.author_canonical FROM authors a;') auname = re.compile('(?:(?:[A-Z]\.|[A-Z][a-z]+) )*(?:|[Dd]e ?|[Dd]u ?|O\'|Mac|Mc|[Vv]on )[A-Z][A-Za-z]+(?:, [A-Za-z\.]+)?') while 1: row = sess.fetchone() if row is None: break (oid, name) = row if not auname.match(name): print oid, name