Difference between revisions of "ISFDB:Author Names Cleanup"

From ISFDB
Jump to navigation Jump to search
m (+cat)
 
(17 intermediate revisions by 4 users not shown)
Line 1: Line 1:
 +
=Project Description=
 +
The '''Author Names Cleanup''' project aims to find and eliminate mistyped, duplicate, poorly formatted or otherwise erroneous Author records.
  
 +
=Sub-projects=
 +
 +
==Questionable Suffixes==
 +
"Authors.pl" is a Perl script which searches a flat file of all ISFDB Author names extracted from the MySQL database for unusual suffixes. "Suffixes" are defined as any characters following a comma. "Usual" suffixes are defined as:
 +
 +
* Sr.
 +
* Jr.
 +
* II
 +
* III
 +
* IV
 +
* Ph.D.
 +
* M.D.
 +
 +
===Script===
 +
<pre>use strict;
 +
my $mainfile = "c:/ISFDB/Authors.txt";
 +
open(AUTHORS,$mainfile) || die("can't open file $mainfile");
 +
while (<AUTHORS>) {
 +
#foreach (@lines) {
 +
my $string = $_;
 +
# Put the suffix (anything after ",") into $suffix[1]
 +
my @suffix = split /,\s*/, $string;
 +
# if there is more than 1 comma, then there is an error
 +
if ($#suffix > 1) {
 +
print $string;
 +
next;
 +
}
 +
# If there is a suffix, check if it's in the list of approved suffixes
 +
if ($#suffix == 1) {
 +
if (!($suffix[1] =~ /^Sr.$|Jr\.$|^II$|^III$|^IV$|^Ph\.D\.$|^M.D.$/)) {
 +
print "$string";
 +
}
 +
next;
 +
}
 +
next;
 +
}</pre>
 +
 +
===Identified questionable suffixes===
 +
 +
Results as of 09/10/06 when run aganst the 08/27/06 ISFDB backup file using ActivePerl/Windows XP: (solved cases moved below)
 +
 +
<pre>John Pierce, M.S.
 +
Epaminondas T. Snooks, D.T.G.
 +
Ronald V. Dorn, Jr. M.D.
 +
Arthur W. Weir, D.Sc.
 +
Zuprik-Curtis Enterprises, Inc.
 +
Universal City Studios, Inc.
 +
Ben R., Ph.D. Games
 +
Neal Barrett, Jr
 +
Yvonne, Fern Solow
 +
Peter, Dr. Beckmann
 +
Charles, Waugh
 +
Arlan Andrews, Kris Andrews, Joe Giarratano
 +
Jenifer, A. Ruth
 +
O. David, Dr West
 +
Riley W., Jr Sanson
 +
Hiccup Horrendous, III Haddock
 +
Louis, Jr. Porter
 +
Joseph S., Jr. Nye
 +
Rob, Jr Potchak
 +
Jr, Bill Martin
 +
Normand, R. Bernier
 +
Wilson, Tortosa
 +
Robert S., Jr. Sanders
 +
Douglas M., Sir Price
 +
Joseph, Jr Covino
 +
Lovelee, I. Dagum
 +
John, A. Hall
 +
James, Sir Knowles
 +
Mark, Edward Hall
 +
MJ Studios, cover art Jim Seward
 +
Seton Hall University, Dr. Dermot Quinn
 +
David, Niall Wilson
 +
Jimmie E., Jr. Cain
 +
Todd, F. Davis
 +
Hugh J., Jr. Luke</pre>
 +
 +
===Cleaned up Suffixes===
 +
 +
# Neal, Jr. Barrett - just merged into {{A|Neal_Barrett,_Jr.}} There seem to remain also {{A|Neal_Barrett, Jr}} and {{A|Neal_Barrett}} although they are empty; so - merge them too?
 +
# Mishima, Yukio - just merged into {{A|Yukio_Mishima}}
 +
# Gordon, R. Dickinson - just removed from a stray pub with Gordon R. Dickson
 +
# Chesterton scholar, Aidan Mackey - merged into {{A|Aidan_Mackey}} several weeks ago
 +
# Richard Gilliam, Wendy Webb, Edward E. Kramer, Martin H. Greenberg; Richard Gilliam, Martin H. Greenberg, Thomas R. Hanlon; Richard Gilliam, Martin H. Greenberg; Richard Gilliam, Martin H. Greenberg, Kathleen M. Massie-Ferch; Janet Berliner, Uwe Luserke, Martin H. Greenberg; Richard Gilliam, Martin H. Greenberg, Edward E. Kramer - seem to have been split long time ago as no longer present
 +
# <tt>Brenda, W. Clough; Michael, Moorcock</tt> - long gone
 +
 +
Is there a use for documenting this so meticulously, or would it be enough just to delete from the list above? --[[User:JVjr|JV]][[User talk:JVjr|jr]] 06:50, 10 Mar 2007 (CST)
 +
 +
: "PhD, C. Malcolm Trowbridge" changed to "C. Malcolm Trowbridge, Ph.D." and "Ph.D, Sandra Eubanks" merged with "Sandra Eubanks, Ph.D." [[User:BLongley|BLongley]] 11:20, 31 Mar 2007 (CDT)
 +
 +
:: I checked a few more: we've lost these and I suspect in these ways:
 +
Michael, W. Perry  ->  Michael W. Perry
 +
Arlan Keith Andrews, Sr -> pseudonym
 +
John C. Wright, Esq. -> pseudonym
 +
Roscoe Clark, F.R.C.S. -> pseuodonym
 +
Evelyn A. Archer, P.I. -> pseuodonym
 +
The 1992 James Tiptree, Jr Award Judges -> pseuodonym
 +
The 1995 James Tiptree, Jr Award Judges -> pseuodonym
 +
J. A, Lawrence -> J. A. Lawrence
 +
Rockwell, Carey - > Carey Rockwell
 +
Mary, H Herbert -> Mary H. Herbert
 +
James, White - > James White
 +
Zora, N. Hurston - > Zora Neale Hurston
 +
Francis M., Jr. Nevins -> Francis M. Nevins, Jr.
 +
Dora and MacGregor,Eleanor Pantell -> Dora Pantell, Ellen MacGregor
 +
Stuart, Gordon -> Stuart Gordon
 +
Stanley Grauman, Weinbaum -> Stanley Grauman Weinbaum, pseuodonym
 +
Walter, Jr. Wangerin -> Walter Wangerin, Jr.
 +
Roberta Carter, Rogers, Jacqueline Clark - > Roberta Carter Clark, Jacqueline Rogers
 +
William, F. Nolan -> William F. Nolan
 +
W., Rev Awdry -> Reverend W. Awdry
 +
Emmett O., III Saunders -> Emmett O. Saunders, III
 +
Esme Nichola Author Winter, Barbara Illustrator Shilletto -> Esme Nichola Shilletto, Barbara Winter
 +
Mike, Jr. Deodato -> Gone?
 +
Kenneth, Jr. Faig -> Kenneth W. Faig Jr.
 +
Philip Harbottle, Editor -> Philip Harbottle
 +
Michael Simon Bodner, PhD -> Michael Simon Bodner, Ph.D.
 +
Richard, J. O'brien -> Richard J. O'Brien
 +
D., M. Brown -> D. M. Brown
 +
:: [[User:BLongley|BLongley]] 16:52, 12 Apr 2007 (CDT)
 +
 +
==Suspected Duplicate Author Names==
 +
 +
Need to develop a specification and write a script, possibly multiple scripts depending on what kinds of requirements we will come up with. Also, we will need a way of finding out whether, e.g., Nancy Farmer the popular children's author is the same person as Nancy Farmer the co-author of ''Update One - Federal Fisheries Management: A Guidebook to the Magnuson Fishery Conservation and Management Act'' (1987). Please discuss on the Talk page.
 +
 +
 +
Is this the right place to list authors I think are probably duplicates? I suspect that [http://www.isfdb.org/cgi-bin/ea.cgi?Robert%20Boyer Robert Boyer] is the same as [http://www.isfdb.org/cgi-bin/ea.cgi?Robert%20H.%20Boyer Robert H. Boyer], likewise the [http://www.isfdb.org/cgi-bin/se.cgi?type=Name&arg=zahorski two Zahorskies]. [[User:WimLewis|WimLewis]] 18:01, 15 Mar 2007 (CDT)
 +
 +
==Anonymous, uncredited, etc.==
 +
 +
Originally compiled by [[User:Marc Kupper|Marc Kupper]] 14:10, 16 Nov 2006 (CST):
 +
 +
{| border=1
 +
|-
 +
! align=left|Author!! align=right|Long!! align=right|Short!! align=right|Awards!! align=left|Comments
 +
|-
 +
|[http://www.isfdb.org/cgi-bin/ea.cgi?Anonymous Anonymous]|| align=right|~150|| align=right|~250|| align=right|6
 +
|-
 +
|-
 +
|[http://www.isfdb.org/cgi-bin/ea.cgi?Anonyous Anonyous]|| align=right|0|| align=right|3|| align=right|0||Should get merged with Anonymous [Fixed [[User:Marc Kupper|Marc Kupper]] 02:47, 22 Nov 2006 (CST)]
 +
|-
 +
|-
 +
|[http://www.isfdb.org/cgi-bin/ea.cgi?Not_Available Not Available]|| align=right|4|| align=right|0|| align=right|0||This sounds like our "not stated" [Fixed - I checked each of the four titles and for each was able to dig up the author name. [[User:Marc Kupper|Marc Kupper]] 03:01, 22 Nov 2006 (CST)]
 +
|-
 +
|-
 +
|[http://www.isfdb.org/cgi-bin/ea.cgi?uncredit uncredit]|| align=right|0|| align=right|1|| align=right|0||Should get merged with uncredited.  [Fixed [[User:Marc Kupper|Marc Kupper]] 02:47, 22 Nov 2006 (CST)]
 +
|-
 +
|-
 +
|[http://www.isfdb.org/cgi-bin/ea.cgi?uncredited uncredited]|| align=right|0|| align=right|~1,000|| align=right|0
 +
|-
 +
|-
 +
|[http://www.isfdb.org/cgi-bin/ea.cgi?Unknown Unknown]|| align=right|~1,050|| align=right|~550|| align=right|172||Used for many cover/interior art instead of "unsigned"
 +
|-
 +
|-
 +
|[http://www.isfdb.org/cgi-bin/ea.cgi?unknown unknown]||  align=right|~1,050|| align=right|~550|| align=right|172||Both "unknown" and "Unknown" get used but search returned same list
 +
|-
 +
|-
 +
|[http://www.isfdb.org/cgi-bin/ea.cgi?unknownAfghan unknownAfghan]|| align=right|0|| align=right|1|| align=right|0||No explanation in the story notes about this
 +
|-
 +
|-
 +
|[http://www.isfdb.org/cgi-bin/ea.cgi?unsigned unsigned]|| align=right|45|| align=right|0|| align=right|0||All 45 long works are for Interior Art
 +
|-
 +
|-
 +
|[http://www.isfdb.org/cgi-bin/ea.cgi?Unknown_Unknown Unknown Unknown]|| || || ||Listed on http://www.isfdb.org/DIR_U.html but not found
 +
|-
 +
|-
 +
|[http://www.isfdb.org/cgi-bin/ea.cgi?Various Various]|| align=right|69|| align=right|0|| align=right|0
 +
|-
 +
|-
 +
|[http://www.isfdb.org/cgi-bin/ea.cgi?Various_Authors Various Authors]|| align=right|2|| align=right|0|| align=right|0||Should get merged into Various [both were bad entries and have been deleted. Ahasuerus]
 +
|-
 +
|}
 +
 +
==Malformed URLs==
 +
 +
Some Author records, e.g. Tad Williams', have bad URLs in the "Web page" field, which were possibly created by the ISFDB1-to-ISFDB2 conversion. We need to find and fix them. [[User:Ahasuerus|Ahasuerus]] 19:07, 27 Dec 2006 (CST)
 +
 +
==Questionable First/Middle/Last Names==
 +
 +
The following script was developed in a hurry and needs to be cleaned up. It assumes that "c:/ISFDB/Authors.txt" is a dump of the MySQL Author table. [[User:Ahasuerus|Ahasuerus]] 22:07, 9 Apr 2007 (CDT)
 +
 +
<pre>use strict;
 +
my $mainfile = "c:/ISFDB/Authors.txt";
 +
#my $space = '\s';
 +
my $first = '[A-Z]{1}[-\'a-z]{1,25}\s';
 +
my $last = "[A-Z]{1}[-a-z']{1,25}";
 +
my $init = '[A-Z]{1}\. ';
 +
my $middle = '[A-Z]{1}[-a-z]{1,20}\s';
 +
#
 +
open(AUTHORS,$mainfile) || die("can't open file $mainfile");
 +
while (<AUTHORS>) {
 +
my $string = $_;
 +
# Put the suffix (anything after ",") into the $suffix[1]
 +
my @suffix = split /,\s*/, $string;
 +
# if there is more than 1 comma, then there is an error
 +
if ($#suffix > 1) {
 +
next;
 +
}
 +
# If there is a suffix, check if it's in the list of approved suffixes
 +
if ($#suffix == 1) {
 +
if (!($suffix[1] =~ /^Sr.$|Jr\.$|^II$|^III$|^IV$|^Ph\.D\.$|^M.D.$/)) {
 +
next;
 +
}
 +
}
 +
$_ = $suffix[0];
 +
next if /^($first)($last)$/; #FirstName LastName
 +
next if /^($init)($last)$/; #Initial. LastName
 +
next if /^($init)($middle)($last)$/; #Initial. MiddleName LastName
 +
next if /^($first)($init)($last)$/; #FirstName Initial. LastName
 +
next if /^($first)($init)($init)($last)$/; #FirstName Initial1. Initial2. LastName
 +
next if /^($init)($init)($init)($last)$/; #Initial1. Initial2. Initial3. LastName
 +
next if /^($init)($init)($last)$/; #Initial1. Initiail2. LastName
 +
next if /^($first)($middle)($last)$/; #FirstName MiddleName LastName
 +
my @word = split /\s+/;
 +
my $lname = $word[$#word];
 +
my $fname = $word[0];
 +
if ($lname =~ /
 +
print "$_";
 +
next;
 +
}
 +
=pod
 +
print "$_" if / II$/;
 +
print "$_" if / III$/;
 +
print "$_" if / Jr.$/;
 +
=pod
 +
</pre>
 +
 +
 +
Here's my stab at a Python version of an author names script. It requires you to have the Python MySQL module installed so that it can retrieve the author names from the db. You'll need to edit the first few lines as appropriate for your setup.
 +
 +
It's short because most of the complexity has been compressed into that one big regexp. Essentially it's just flagging any name that doesn't follow a simple pattern of (several-names-or-initials optional-byname-particle lastname optional-suffix-with-comma). It flags plenty of perfectly valid names, but it narrows things down enough that it's easy to scan the list by eye.  --[[User:WimLewis|WimLewis]] 02:59, 10 Apr 2007 (CDT)
 +
 +
<pre>
 +
# If your mysql module isn't installed in the system path, include the path here
 +
#import sys
 +
#sys.path.append('/Users/Shared/wiml/pmysql/MySQL-python-1.2.1_p2/build/lib.darwin-8.9.0-Power_Macintosh-2.3')
 +
 +
import MySQLdb
 +
import re
 +
 +
conn = MySQLdb.connect( user='root', db='isfdb' )
 +
sess = conn.cursor()
 +
 +
sess.execute('SELECT a.author_id, a.author_canonical FROM authors a;')
 +
 +
auname = re.compile('(?:(?:[A-Z]\.|[A-Z][a-z]+) )*(?:|[Dd]e ?|[Dd]u ?|O\'|Mac|Mc|[Vv]on )[A-Z][A-Za-z]+(?:, [A-Za-z\.]+)?')
 +
 +
while 1:
 +
    row = sess.fetchone()
 +
    if row is None:
 +
        break
 +
    (oid, name) = row
 +
    if not auname.match(name):
 +
        print oid, name
 +
 +
</pre>
 +
 +
 +
[[Category:Bibliographic Projects|Author Names cleanup]]

Latest revision as of 17:56, 7 August 2009

Project Description

The Author Names Cleanup project aims to find and eliminate mistyped, duplicate, poorly formatted or otherwise erroneous Author records.

Sub-projects

Questionable Suffixes

"Authors.pl" is a Perl script which searches a flat file of all ISFDB Author names extracted from the MySQL database for unusual suffixes. "Suffixes" are defined as any characters following a comma. "Usual" suffixes are defined as:

  • Sr.
  • Jr.
  • II
  • III
  • IV
  • Ph.D.
  • M.D.

Script

use strict;
my $mainfile = "c:/ISFDB/Authors.txt";
open(AUTHORS,$mainfile) || die("can't open file $mainfile");
while (<AUTHORS>) {
#foreach (@lines) {
	my $string = $_;
	# Put the suffix (anything after ",") into $suffix[1]
	my @suffix = split /,\s*/, $string;
	# if there is more than 1 comma, then there is an error
	if ($#suffix > 1) {
		print $string;
		next;
	}
	# If there is a suffix, check if it's in the list of approved suffixes
	if ($#suffix == 1) {
		if (!($suffix[1] =~ /^Sr.$|Jr\.$|^II$|^III$|^IV$|^Ph\.D\.$|^M.D.$/)) {
			print "$string";
		}
		next;
	}
	next;
}

Identified questionable suffixes

Results as of 09/10/06 when run aganst the 08/27/06 ISFDB backup file using ActivePerl/Windows XP: (solved cases moved below)

John Pierce, M.S.
Epaminondas T. Snooks, D.T.G.
Ronald V. Dorn, Jr. M.D.
Arthur W. Weir, D.Sc.
Zuprik-Curtis Enterprises, Inc.
Universal City Studios, Inc.
Ben R., Ph.D. Games
Neal Barrett, Jr
Yvonne, Fern Solow
Peter, Dr. Beckmann
Charles, Waugh
Arlan Andrews, Kris Andrews, Joe Giarratano
Jenifer, A. Ruth
O. David, Dr West
Riley W., Jr Sanson
Hiccup Horrendous, III Haddock
Louis, Jr. Porter
Joseph S., Jr. Nye
Rob, Jr Potchak
Jr, Bill Martin
Normand, R. Bernier
Wilson, Tortosa
Robert S., Jr. Sanders
Douglas M., Sir Price
Joseph, Jr Covino
Lovelee, I. Dagum
John, A. Hall
James, Sir Knowles
Mark, Edward Hall
MJ Studios, cover art Jim Seward
Seton Hall University, Dr. Dermot Quinn
David, Niall Wilson
Jimmie E., Jr. Cain
Todd, F. Davis
Hugh J., Jr. Luke

Cleaned up Suffixes

  1. Neal, Jr. Barrett - just merged into Neal_Barrett,_Jr. There seem to remain also Neal_Barrett, Jr and Neal_Barrett although they are empty; so - merge them too?
  2. Mishima, Yukio - just merged into Yukio_Mishima
  3. Gordon, R. Dickinson - just removed from a stray pub with Gordon R. Dickson
  4. Chesterton scholar, Aidan Mackey - merged into Aidan_Mackey several weeks ago
  5. Richard Gilliam, Wendy Webb, Edward E. Kramer, Martin H. Greenberg; Richard Gilliam, Martin H. Greenberg, Thomas R. Hanlon; Richard Gilliam, Martin H. Greenberg; Richard Gilliam, Martin H. Greenberg, Kathleen M. Massie-Ferch; Janet Berliner, Uwe Luserke, Martin H. Greenberg; Richard Gilliam, Martin H. Greenberg, Edward E. Kramer - seem to have been split long time ago as no longer present
  6. Brenda, W. Clough; Michael, Moorcock - long gone

Is there a use for documenting this so meticulously, or would it be enough just to delete from the list above? --JVjr 06:50, 10 Mar 2007 (CST)

"PhD, C. Malcolm Trowbridge" changed to "C. Malcolm Trowbridge, Ph.D." and "Ph.D, Sandra Eubanks" merged with "Sandra Eubanks, Ph.D." BLongley 11:20, 31 Mar 2007 (CDT)
I checked a few more: we've lost these and I suspect in these ways:
Michael, W. Perry   ->  Michael W. Perry
Arlan Keith Andrews, Sr -> pseudonym
John C. Wright, Esq. -> pseudonym
Roscoe Clark, F.R.C.S. -> pseuodonym
Evelyn A. Archer, P.I. -> pseuodonym
The 1992 James Tiptree, Jr Award Judges -> pseuodonym
The 1995 James Tiptree, Jr Award Judges -> pseuodonym
J. A, Lawrence -> J. A. Lawrence
Rockwell, Carey - > Carey Rockwell
Mary, H Herbert -> Mary H. Herbert
James, White - > James White
Zora, N. Hurston - > Zora Neale Hurston
Francis M., Jr. Nevins -> Francis M. Nevins, Jr.
Dora and MacGregor,Eleanor Pantell -> Dora Pantell, Ellen MacGregor
Stuart, Gordon -> Stuart Gordon
Stanley Grauman, Weinbaum -> Stanley Grauman Weinbaum, pseuodonym
Walter, Jr. Wangerin -> Walter Wangerin, Jr. 
Roberta Carter, Rogers, Jacqueline Clark - > Roberta Carter Clark, Jacqueline Rogers
William, F. Nolan -> William F. Nolan
W., Rev Awdry -> Reverend W. Awdry
Emmett O., III Saunders -> Emmett O. Saunders, III
Esme Nichola Author Winter, Barbara Illustrator Shilletto -> Esme Nichola Shilletto, Barbara Winter
Mike, Jr. Deodato -> Gone?
Kenneth, Jr. Faig -> Kenneth W. Faig Jr. 
Philip Harbottle, Editor -> Philip Harbottle
Michael Simon Bodner, PhD -> Michael Simon Bodner, Ph.D.
Richard, J. O'brien -> Richard J. O'Brien
D., M. Brown -> D. M. Brown
BLongley 16:52, 12 Apr 2007 (CDT)

Suspected Duplicate Author Names

Need to develop a specification and write a script, possibly multiple scripts depending on what kinds of requirements we will come up with. Also, we will need a way of finding out whether, e.g., Nancy Farmer the popular children's author is the same person as Nancy Farmer the co-author of Update One - Federal Fisheries Management: A Guidebook to the Magnuson Fishery Conservation and Management Act (1987). Please discuss on the Talk page.


Is this the right place to list authors I think are probably duplicates? I suspect that Robert Boyer is the same as Robert H. Boyer, likewise the two Zahorskies. WimLewis 18:01, 15 Mar 2007 (CDT)

Anonymous, uncredited, etc.

Originally compiled by Marc Kupper 14:10, 16 Nov 2006 (CST):

Author Long Short Awards Comments
Anonymous ~150 ~250 6
Anonyous 0 3 0 Should get merged with Anonymous [Fixed Marc Kupper 02:47, 22 Nov 2006 (CST)]
Not Available 4 0 0 This sounds like our "not stated" [Fixed - I checked each of the four titles and for each was able to dig up the author name. Marc Kupper 03:01, 22 Nov 2006 (CST)]
uncredit 0 1 0 Should get merged with uncredited. [Fixed Marc Kupper 02:47, 22 Nov 2006 (CST)]
uncredited 0 ~1,000 0
Unknown ~1,050 ~550 172 Used for many cover/interior art instead of "unsigned"
unknown ~1,050 ~550 172 Both "unknown" and "Unknown" get used but search returned same list
unknownAfghan 0 1 0 No explanation in the story notes about this
unsigned 45 0 0 All 45 long works are for Interior Art
Unknown Unknown Listed on http://www.isfdb.org/DIR_U.html but not found
Various 69 0 0
Various Authors 2 0 0 Should get merged into Various [both were bad entries and have been deleted. Ahasuerus]

Malformed URLs

Some Author records, e.g. Tad Williams', have bad URLs in the "Web page" field, which were possibly created by the ISFDB1-to-ISFDB2 conversion. We need to find and fix them. Ahasuerus 19:07, 27 Dec 2006 (CST)

Questionable First/Middle/Last Names

The following script was developed in a hurry and needs to be cleaned up. It assumes that "c:/ISFDB/Authors.txt" is a dump of the MySQL Author table. Ahasuerus 22:07, 9 Apr 2007 (CDT)

use strict;
my $mainfile = "c:/ISFDB/Authors.txt";
#my $space = '\s';
my $first = '[A-Z]{1}[-\'a-z]{1,25}\s';
my $last = "[A-Z]{1}[-a-z']{1,25}";
my $init = '[A-Z]{1}\. ';
my $middle = '[A-Z]{1}[-a-z]{1,20}\s';
#
open(AUTHORS,$mainfile) || die("can't open file $mainfile");
while (<AUTHORS>) {
	my $string = $_;
	# Put the suffix (anything after ",") into the $suffix[1]
	my @suffix = split /,\s*/, $string;
	# if there is more than 1 comma, then there is an error
	if ($#suffix > 1) {
		next;
	}
	# If there is a suffix, check if it's in the list of approved suffixes
	if ($#suffix == 1) {
		if (!($suffix[1] =~ /^Sr.$|Jr\.$|^II$|^III$|^IV$|^Ph\.D\.$|^M.D.$/)) {
			next;
		}
	}
	$_ = $suffix[0];
	next if /^($first)($last)$/; #FirstName LastName
	next if /^($init)($last)$/; #Initial. LastName
	next if /^($init)($middle)($last)$/; #Initial. MiddleName LastName
	next if /^($first)($init)($last)$/; #FirstName Initial. LastName
	next if /^($first)($init)($init)($last)$/; #FirstName Initial1. Initial2. LastName
	next if /^($init)($init)($init)($last)$/; #Initial1. Initial2. Initial3. LastName
	next if /^($init)($init)($last)$/; #Initial1. Initiail2. LastName
	next if /^($first)($middle)($last)$/; #FirstName MiddleName LastName
	my @word = split /\s+/;
	my $lname = $word[$#word];
	my $fname = $word[0];
	if ($lname =~ /
	print "$_";
	next;
}
=pod
	print "$_" if / II$/;
	print "$_" if / III$/;
	print "$_" if / Jr.$/;
=pod


Here's my stab at a Python version of an author names script. It requires you to have the Python MySQL module installed so that it can retrieve the author names from the db. You'll need to edit the first few lines as appropriate for your setup.

It's short because most of the complexity has been compressed into that one big regexp. Essentially it's just flagging any name that doesn't follow a simple pattern of (several-names-or-initials optional-byname-particle lastname optional-suffix-with-comma). It flags plenty of perfectly valid names, but it narrows things down enough that it's easy to scan the list by eye. --WimLewis 02:59, 10 Apr 2007 (CDT)

# If your mysql module isn't installed in the system path, include the path here
#import sys
#sys.path.append('/Users/Shared/wiml/pmysql/MySQL-python-1.2.1_p2/build/lib.darwin-8.9.0-Power_Macintosh-2.3')

import MySQLdb
import re

conn = MySQLdb.connect( user='root', db='isfdb' )
sess = conn.cursor()

sess.execute('SELECT a.author_id, a.author_canonical FROM authors a;')

auname = re.compile('(?:(?:[A-Z]\.|[A-Z][a-z]+) )*(?:|[Dd]e ?|[Dd]u ?|O\'|Mac|Mc|[Vv]on )[A-Z][A-Za-z]+(?:, [A-Za-z\.]+)?')

while 1:
    row = sess.fetchone()
    if row is None:
        break
    (oid, name) = row
    if not auname.match(name):
        print oid, name