ISFDB:Author Names Cleanup

From ISFDB

Jump to: navigation, search

Contents

Project Description

The Author Names Cleanup project aims to find and eliminate mistyped, duplicate, poorly formatted or otherwise erroneous Author records.

Sub-projects

Questionable Suffixes

"Authors.pl" is a Perl script which searches a flat file of all ISFDB Author names extracted from the MySQL database for unusual suffixes. "Suffixes" are defined as any characters following a comma. "Usual" suffixes are defined as:

  • Sr.
  • Jr.
  • II
  • III
  • IV
  • Ph.D.
  • M.D.

Script

use strict;
my $mainfile = "c:/ISFDB/Authors.txt";
open(AUTHORS,$mainfile) || die("can't open file $mainfile");
while (<AUTHORS>) {
#foreach (@lines) {
	my $string = $_;
	# Put the suffix (anything after ",") into $suffix[1]
	my @suffix = split /,\s*/, $string;
	# if there is more than 1 comma, then there is an error
	if ($#suffix > 1) {
		print $string;
		next;
	}
	# If there is a suffix, check if it's in the list of approved suffixes
	if ($#suffix == 1) {
		if (!($suffix[1] =~ /^Sr.$|Jr\.$|^II$|^III$|^IV$|^Ph\.D\.$|^M.D.$/)) {
			print "$string";
		}
		next;
	}
	next;
}

Identified questionable suffixes

Results as of 09/10/06 when run aganst the 08/27/06 ISFDB backup file using ActivePerl/Windows XP: (solved cases moved below)

John Pierce, M.S.
Epaminondas T. Snooks, D.T.G.
Ronald V. Dorn, Jr. M.D.
Arthur W. Weir, D.Sc.
Zuprik-Curtis Enterprises, Inc.
Universal City Studios, Inc.
Ben R., Ph.D. Games
Neal Barrett, Jr
Yvonne, Fern Solow
Peter, Dr. Beckmann
Charles, Waugh
Arlan Andrews, Kris Andrews, Joe Giarratano
Jenifer, A. Ruth
O. David, Dr West
Riley W., Jr Sanson
Hiccup Horrendous, III Haddock
Louis, Jr. Porter
Joseph S., Jr. Nye
Rob, Jr Potchak
Jr, Bill Martin
Normand, R. Bernier
Wilson, Tortosa
Robert S., Jr. Sanders
Douglas M., Sir Price
Joseph, Jr Covino
Lovelee, I. Dagum
John, A. Hall
James, Sir Knowles
Mark, Edward Hall
MJ Studios, cover art Jim Seward
Seton Hall University, Dr. Dermot Quinn
David, Niall Wilson
Jimmie E., Jr. Cain
Todd, F. Davis
Hugh J., Jr. Luke

Cleaned up Suffixes

  1. Neal, Jr. Barrett - just merged into Neal_Barrett,_Jr. There seem to remain also Neal_Barrett, Jr and Neal_Barrett although they are empty; so - merge them too?
  2. Mishima, Yukio - just merged into Yukio_Mishima
  3. Gordon, R. Dickinson - just removed from a stray pub with Gordon R. Dickson
  4. Chesterton scholar, Aidan Mackey - merged into Aidan_Mackey several weeks ago
  5. Richard Gilliam, Wendy Webb, Edward E. Kramer, Martin H. Greenberg; Richard Gilliam, Martin H. Greenberg, Thomas R. Hanlon; Richard Gilliam, Martin H. Greenberg; Richard Gilliam, Martin H. Greenberg, Kathleen M. Massie-Ferch; Janet Berliner, Uwe Luserke, Martin H. Greenberg; Richard Gilliam, Martin H. Greenberg, Edward E. Kramer - seem to have been split long time ago as no longer present
  6. Brenda, W. Clough; Michael, Moorcock - long gone

Is there a use for documenting this so meticulously, or would it be enough just to delete from the list above? --JVjr 06:50, 10 Mar 2007 (CST)

"PhD, C. Malcolm Trowbridge" changed to "C. Malcolm Trowbridge, Ph.D." and "Ph.D, Sandra Eubanks" merged with "Sandra Eubanks, Ph.D." BLongley 11:20, 31 Mar 2007 (CDT)
I checked a few more: we've lost these and I suspect in these ways:
Michael, W. Perry   ->  Michael W. Perry
Arlan Keith Andrews, Sr -> pseudonym
John C. Wright, Esq. -> pseudonym
Roscoe Clark, F.R.C.S. -> pseuodonym
Evelyn A. Archer, P.I. -> pseuodonym
The 1992 James Tiptree, Jr Award Judges -> pseuodonym
The 1995 James Tiptree, Jr Award Judges -> pseuodonym
J. A, Lawrence -> J. A. Lawrence
Rockwell, Carey - > Carey Rockwell
Mary, H Herbert -> Mary H. Herbert
James, White - > James White
Zora, N. Hurston - > Zora Neale Hurston
Francis M., Jr. Nevins -> Francis M. Nevins, Jr.
Dora and MacGregor,Eleanor Pantell -> Dora Pantell, Ellen MacGregor
Stuart, Gordon -> Stuart Gordon
Stanley Grauman, Weinbaum -> Stanley Grauman Weinbaum, pseuodonym
Walter, Jr. Wangerin -> Walter Wangerin, Jr. 
Roberta Carter, Rogers, Jacqueline Clark - > Roberta Carter Clark, Jacqueline Rogers
William, F. Nolan -> William F. Nolan
W., Rev Awdry -> Reverend W. Awdry
Emmett O., III Saunders -> Emmett O. Saunders, III
Esme Nichola Author Winter, Barbara Illustrator Shilletto -> Esme Nichola Shilletto, Barbara Winter
Mike, Jr. Deodato -> Gone?
Kenneth, Jr. Faig -> Kenneth W. Faig Jr. 
Philip Harbottle, Editor -> Philip Harbottle
Michael Simon Bodner, PhD -> Michael Simon Bodner, Ph.D.
Richard, J. O'brien -> Richard J. O'Brien
D., M. Brown -> D. M. Brown
BLongley 16:52, 12 Apr 2007 (CDT)

Suspected Duplicate Author Names

Need to develop a specification and write a script, possibly multiple scripts depending on what kinds of requirements we will come up with. Also, we will need a way of finding out whether, e.g., Nancy Farmer the popular children's author is the same person as Nancy Farmer the co-author of Update One - Federal Fisheries Management: A Guidebook to the Magnuson Fishery Conservation and Management Act (1987). Please discuss on the Talk page.


Is this the right place to list authors I think are probably duplicates? I suspect that Robert Boyer is the same as Robert H. Boyer, likewise the two Zahorskies. WimLewis 18:01, 15 Mar 2007 (CDT)

Anonymous, uncredited, etc.

Originally compiled by Marc Kupper 14:10, 16 Nov 2006 (CST):

AuthorLongShortAwardsComments
Anonymous~150~2506
Anonyous030Should get merged with Anonymous [Fixed Marc Kupper 02:47, 22 Nov 2006 (CST)]
Not Available400This sounds like our "not stated" [Fixed - I checked each of the four titles and for each was able to dig up the author name. Marc Kupper 03:01, 22 Nov 2006 (CST)]
uncredit010Should get merged with uncredited. [Fixed Marc Kupper 02:47, 22 Nov 2006 (CST)]
uncredited0~1,0000
Unknown~1,050~550172Used for many cover/interior art instead of "unsigned"
unknown~1,050~550172Both "unknown" and "Unknown" get used but search returned same list
unknownAfghan010No explanation in the story notes about this
unsigned4500All 45 long works are for Interior Art
Unknown Unknown Listed on http://www.isfdb.org/DIR_U.html but not found
Various6900
Various Authors200Should get merged into Various [both were bad entries and have been deleted. Ahasuerus]

Malformed URLs

Some Author records, e.g. Tad Williams', have bad URLs in the "Web page" field, which were possibly created by the ISFDB1-to-ISFDB2 conversion. We need to find and fix them. Ahasuerus 19:07, 27 Dec 2006 (CST)

Questionable First/Middle/Last Names

The following script was developed in a hurry and needs to be cleaned up. It assumes that "c:/ISFDB/Authors.txt" is a dump of the MySQL Author table. Ahasuerus 22:07, 9 Apr 2007 (CDT)

use strict;
my $mainfile = "c:/ISFDB/Authors.txt";
#my $space = '\s';
my $first = '[A-Z]{1}[-\'a-z]{1,25}\s';
my $last = "[A-Z]{1}[-a-z']{1,25}";
my $init = '[A-Z]{1}\. ';
my $middle = '[A-Z]{1}[-a-z]{1,20}\s';
#
open(AUTHORS,$mainfile) || die("can't open file $mainfile");
while (<AUTHORS>) {
	my $string = $_;
	# Put the suffix (anything after ",") into the $suffix[1]
	my @suffix = split /,\s*/, $string;
	# if there is more than 1 comma, then there is an error
	if ($#suffix > 1) {
		next;
	}
	# If there is a suffix, check if it's in the list of approved suffixes
	if ($#suffix == 1) {
		if (!($suffix[1] =~ /^Sr.$|Jr\.$|^II$|^III$|^IV$|^Ph\.D\.$|^M.D.$/)) {
			next;
		}
	}
	$_ = $suffix[0];
	next if /^($first)($last)$/; #FirstName LastName
	next if /^($init)($last)$/; #Initial. LastName
	next if /^($init)($middle)($last)$/; #Initial. MiddleName LastName
	next if /^($first)($init)($last)$/; #FirstName Initial. LastName
	next if /^($first)($init)($init)($last)$/; #FirstName Initial1. Initial2. LastName
	next if /^($init)($init)($init)($last)$/; #Initial1. Initial2. Initial3. LastName
	next if /^($init)($init)($last)$/; #Initial1. Initiail2. LastName
	next if /^($first)($middle)($last)$/; #FirstName MiddleName LastName
	my @word = split /\s+/;
	my $lname = $word[$#word];
	my $fname = $word[0];
	if ($lname =~ /
	print "$_";
	next;
}
=pod
	print "$_" if / II$/;
	print "$_" if / III$/;
	print "$_" if / Jr.$/;
=pod


Here's my stab at a Python version of an author names script. It requires you to have the Python MySQL module installed so that it can retrieve the author names from the db. You'll need to edit the first few lines as appropriate for your setup.

It's short because most of the complexity has been compressed into that one big regexp. Essentially it's just flagging any name that doesn't follow a simple pattern of (several-names-or-initials optional-byname-particle lastname optional-suffix-with-comma). It flags plenty of perfectly valid names, but it narrows things down enough that it's easy to scan the list by eye. --WimLewis 02:59, 10 Apr 2007 (CDT)

# If your mysql module isn't installed in the system path, include the path here
#import sys
#sys.path.append('/Users/Shared/wiml/pmysql/MySQL-python-1.2.1_p2/build/lib.darwin-8.9.0-Power_Macintosh-2.3')

import MySQLdb
import re

conn = MySQLdb.connect( user='root', db='isfdb' )
sess = conn.cursor()

sess.execute('SELECT a.author_id, a.author_canonical FROM authors a;')

auname = re.compile('(?:(?:[A-Z]\.|[A-Z][a-z]+) )*(?:|[Dd]e ?|[Dd]u ?|O\'|Mac|Mc|[Vv]on )[A-Z][A-Za-z]+(?:, [A-Za-z\.]+)?')

while 1:
    row = sess.fetchone()
    if row is None:
        break
    (oid, name) = row
    if not auname.match(name):
        print oid, name

Personal tools