ISFDB:Author Names Cleanup
Project Description
The Author Names Cleanup project aims to find and eliminate mistyped, duplicate, poorly formatted or otherwise erroneous Author records.
Sub-projects
Questionable Suffixes
"Authors.pl" is a Perl script which searches a flat file of all ISFDB Author names extracted from the MySQL database for unusual suffixes. "Suffixes" are defined as any characters following a comma. "Usual" suffixes are defined as:
- Sr.
- Jr.
- II
- III
- IV
- Ph.D.
- M.D.
Script
use strict; my $mainfile = "c:/ISFDB/Authors.txt"; open(AUTHORS,$mainfile) || die("can't open file $mainfile"); while (<AUTHORS>) { #foreach (@lines) { my $string = $_; # Put the suffix (anything after ",") into $suffix[1] my @suffix = split /,\s*/, $string; # if there is more than 1 comma, then there is an error if ($#suffix > 1) { print $string; next; } # If there is a suffix, check if it's in the list of approved suffixes if ($#suffix == 1) { if (!($suffix[1] =~ /^Sr.$|Jr\.$|^II$|^III$|^IV$|^Ph\.D\.$|^M.D.$/)) { print "$string"; } next; } next; }
Identified questionable suffixes
Results as of 09/10/06 when run aganst the 08/27/06 ISFDB backup file using ActivePerl/Windows XP: (solved cases moved below)
John Pierce, M.S. Arlan Keith Andrews, Sr John C. Wright, Esq. Epaminondas T. Snooks, D.T.G. Roscoe Clark, F.R.C.S. Evelyn A. Archer, P.I. Ronald V. Dorn, Jr. M.D. The 1992 James Tiptree, Jr Award Judges The 1995 James Tiptree, Jr Award Judges Arthur W. Weir, D.Sc. Zuprik-Curtis Enterprises, Inc. Universal City Studios, Inc. Ben R., Ph.D. Games J. A, Lawrence Neal Barrett, Jr Rockwell, Carey Mary, H Herbert James, White Zora, N. Hurston Francis M., Jr. Nevins Dora and MacGregor,Eleanor Pantell Stuart, Gordon Yvonne, Fern Solow Stanley Grauman, Weinbaum Walter, Jr. Wangerin Peter, Dr. Beckmann Roberta Carter, Rogers, Jacqueline Clark Charles, Waugh Arlan Andrews, Kris Andrews, Joe Giarratano William, F. Nolan W., Rev Awdry Jenifer, A. Ruth O. David, Dr West Emmett O., III Saunders Esme Nichola Author Winter, Barbara Illustrator Shilletto Riley W., Jr Sanson Hiccup Horrendous, III Haddock Mike, Jr. Deodato Louis, Jr. Porter Joseph S., Jr. Nye Rob, Jr Potchak Kenneth, Jr. Faig Jr, Bill Martin Normand, R. Bernier Wilson, Tortosa Philip Harbottle, Editor Robert S., Jr. Sanders Michael Simon Bodner, PhD Richard, J. O'brien Douglas M., Sir Price D., M. Brown Joseph, Jr Covino Lovelee, I. Dagum John, A. Hall James, Sir Knowles Mark, Edward Hall MJ Studios, cover art Jim Seward Seton Hall University, Dr. Dermot Quinn Michael, W. Perry David, Niall Wilson Jimmie E., Jr. Cain Todd, F. Davis Hugh J., Jr. Luke
Cleaned up Suffixes
- Neal, Jr. Barrett - just merged into Neal_Barrett,_Jr. There seem to remain also Neal_Barrett, Jr and Neal_Barrett although they are empty; so - merge them too?
- Mishima, Yukio - just merged into Yukio_Mishima
- Gordon, R. Dickinson - just removed from a stray pub with Gordon R. Dickson
- Chesterton scholar, Aidan Mackey - merged into Aidan_Mackey several weeks ago
- Richard Gilliam, Wendy Webb, Edward E. Kramer, Martin H. Greenberg; Richard Gilliam, Martin H. Greenberg, Thomas R. Hanlon; Richard Gilliam, Martin H. Greenberg; Richard Gilliam, Martin H. Greenberg, Kathleen M. Massie-Ferch; Janet Berliner, Uwe Luserke, Martin H. Greenberg; Richard Gilliam, Martin H. Greenberg, Edward E. Kramer - seem to have been split long time ago as no longer present
- Brenda, W. Clough; Michael, Moorcock - long gone
Is there a use for documenting this so meticulously, or would it be enough just to delete from the list above? --JVjr 06:50, 10 Mar 2007 (CST)
"PhD, C. Malcolm Trowbridge" changed to "C. Malcolm Trowbridge, Ph.D." and "Ph.D, Sandra Eubanks" merged with "Sandra Eubanks, Ph.D." BLongley 11:20, 31 Mar 2007 (CDT)
Suspected Duplicate Author Names
Need to develop a specification and write a script, possibly multiple scripts depending on what kinds of requirements we will come up with. Also, we will need a way of finding out whether, e.g., Nancy Farmer the popular children's author is the same person as Nancy Farmer the co-author of Update One - Federal Fisheries Management: A Guidebook to the Magnuson Fishery Conservation and Management Act (1987). Please discuss on the Talk page.
Is this the right place to list authors I think are probably duplicates? I suspect that Robert Boyer is the same as Robert H. Boyer, likewise the two Zahorskies. WimLewis 18:01, 15 Mar 2007 (CDT)
Anonymous, uncredited, etc.
Originally compiled by Marc Kupper 14:10, 16 Nov 2006 (CST):
Author | Long | Short | Awards | Comments |
---|---|---|---|---|
Anonymous | ~150 | ~250 | 6 | |
Anonyous | 0 | 3 | 0 | Should get merged with Anonymous [Fixed Marc Kupper 02:47, 22 Nov 2006 (CST)] |
Not Available | 4 | 0 | 0 | This sounds like our "not stated" [Fixed - I checked each of the four titles and for each was able to dig up the author name. Marc Kupper 03:01, 22 Nov 2006 (CST)] |
uncredit | 0 | 1 | 0 | Should get merged with uncredited. [Fixed Marc Kupper 02:47, 22 Nov 2006 (CST)] |
uncredited | 0 | ~1,000 | 0 | |
Unknown | ~1,050 | ~550 | 172 | Used for many cover/interior art instead of "unsigned" |
unknown | ~1,050 | ~550 | 172 | Both "unknown" and "Unknown" get used but search returned same list |
unknownAfghan | 0 | 1 | 0 | No explanation in the story notes about this |
unsigned | 45 | 0 | 0 | All 45 long works are for Interior Art |
Unknown Unknown | Listed on http://www.isfdb.org/DIR_U.html but not found | |||
Various | 69 | 0 | 0 | |
Various Authors | 2 | 0 | 0 | Should get merged into Various [both were bad entries and have been deleted. Ahasuerus] |
Malformed URLs
Some Author records, e.g. Tad Williams', have bad URLs in the "Web page" field, which were possibly created by the ISFDB1-to-ISFDB2 conversion. We need to find and fix them. Ahasuerus 19:07, 27 Dec 2006 (CST)
Questionable First/Middle/Last Names
The following script was developed in a hurry and needs to be cleaned up. It assumes that "c:/ISFDB/Authors.txt" is a dump of the MySQL Author table. Ahasuerus 22:07, 9 Apr 2007 (CDT)
use strict; my $mainfile = "c:/ISFDB/Authors.txt"; #my $space = '\s'; my $first = '[A-Z]{1}[-\'a-z]{1,25}\s'; my $last = "[A-Z]{1}[-a-z']{1,25}"; my $init = '[A-Z]{1}\. '; my $middle = '[A-Z]{1}[-a-z]{1,20}\s'; # open(AUTHORS,$mainfile) || die("can't open file $mainfile"); while (<AUTHORS>) { my $string = $_; # Put the suffix (anything after ",") into the $suffix[1] my @suffix = split /,\s*/, $string; # if there is more than 1 comma, then there is an error if ($#suffix > 1) { next; } # If there is a suffix, check if it's in the list of approved suffixes if ($#suffix == 1) { if (!($suffix[1] =~ /^Sr.$|Jr\.$|^II$|^III$|^IV$|^Ph\.D\.$|^M.D.$/)) { next; } } $_ = $suffix[0]; next if /^($first)($last)$/; #FirstName LastName next if /^($init)($last)$/; #Initial. LastName next if /^($init)($middle)($last)$/; #Initial. MiddleName LastName next if /^($first)($init)($last)$/; #FirstName Initial. LastName next if /^($first)($init)($init)($last)$/; #FirstName Initial1. Initial2. LastName next if /^($init)($init)($init)($last)$/; #Initial1. Initial2. Initial3. LastName next if /^($init)($init)($last)$/; #Initial1. Initiail2. LastName next if /^($first)($middle)($last)$/; #FirstName MiddleName LastName my @word = split /\s+/; my $lname = $word[$#word]; my $fname = $word[0]; if ($lname =~ / print "$_"; next; } =pod print "$_" if / II$/; print "$_" if / III$/; print "$_" if / Jr.$/; =pod