Difference between revisions of "ISFDB:Author Names Cleanup"

From ISFDB
Jump to navigation Jump to search
(→‎Questionable Suffixes: cleaning up Barrett jr. and others seem to have been long time ago - should this be documented meticulously, or just deleted?)
Line 123: Line 123:
  
 
Need to develop a specification and write a script, possibly multiple scripts depending on what kinds of requirements we will come up with. Also, we will need a way of finding out whether, e.g., Nancy Farmer the popular children's author is the same person as Nancy Farmer the co-author of ''Update One - Federal Fisheries Management: A Guidebook to the Magnuson Fishery Conservation and Management Act'' (1987). Please discuss on the Talk page.
 
Need to develop a specification and write a script, possibly multiple scripts depending on what kinds of requirements we will come up with. Also, we will need a way of finding out whether, e.g., Nancy Farmer the popular children's author is the same person as Nancy Farmer the co-author of ''Update One - Federal Fisheries Management: A Guidebook to the Magnuson Fishery Conservation and Management Act'' (1987). Please discuss on the Talk page.
 +
 +
 +
Is this the right place to list authors I think are probably duplicates? I suspect that [http://www.isfdb.org/cgi-bin/ea.cgi?Robert%20Boyer Robert Boyer] is the same as [http://www.isfdb.org/cgi-bin/ea.cgi?Robert%20H.%20Boyer Robert H. Boyer], likewise the [http://www.isfdb.org/cgi-bin/se.cgi?type=Name&arg=zahorski two Zahorskies]. [[User:WimLewis|WimLewis]] 18:01, 15 Mar 2007 (CDT)
  
 
==Anonymous, uncredited, etc.==
 
==Anonymous, uncredited, etc.==

Revision as of 19:01, 15 March 2007

Project Description

The Author Names Cleanup project aims to find and eliminate mistyped, duplicate, poorly formatted or otherwise erroneous Author records.

Sub-projects

Questionable Suffixes

"Authors.pl" is a Perl script which searches a flat file of all ISFDB Author names extracted from the MySQL database for unusual suffixes. "Suffixes" are defined as any characters following a comma. "Usual" suffixes are defined as:

  • Sr.
  • Jr.
  • II
  • III
  • IV
  • Ph.D.
  • M.D.

Script

use strict;
my $mainfile = "c:/ISFDB/Authors.txt";
open(AUTHORS,$mainfile) || die("can't open file $mainfile");
while (<AUTHORS>) {
#foreach (@lines) {
	my $string = $_;
	# Put the suffix (anything after ",") into $suffix[1]
	my @suffix = split /,\s*/, $string;
	# if there is more than 1 comma, then there is an error
	if ($#suffix > 1) {
		print $string;
		next;
	}
	# If there is a suffix, check if it's in the list of approved suffixes
	if ($#suffix == 1) {
		if (!($suffix[1] =~ /^Sr.$|Jr\.$|^II$|^III$|^IV$|^Ph\.D\.$|^M.D.$/)) {
			print "$string";
		}
		next;
	}
	next;
}

Identified questionable suffixes

Results as of 09/10/06 when run aganst the 08/27/06 ISFDB backup file using ActivePerl/Windows XP: (solved cases moved below)

John Pierce, M.S.
Arlan Keith Andrews, Sr
John C. Wright, Esq.
Epaminondas T. Snooks, D.T.G.
Roscoe Clark, F.R.C.S.
Evelyn A. Archer, P.I.
Ronald V. Dorn, Jr. M.D.
The 1992 James Tiptree, Jr Award Judges
The 1995 James Tiptree, Jr Award Judges
Arthur W. Weir, D.Sc.
Zuprik-Curtis Enterprises, Inc.
Universal City Studios, Inc.
Ben R., Ph.D. Games
J. A, Lawrence
Neal Barrett, Jr
Rockwell, Carey
Mary, H Herbert
James, White
Zora, N. Hurston
Francis M., Jr. Nevins
Dora and MacGregor,Eleanor Pantell
Stuart, Gordon
Yvonne, Fern Solow
Stanley Grauman, Weinbaum
Walter, Jr. Wangerin
Peter, Dr. Beckmann
Roberta Carter, Rogers, Jacqueline Clark
Charles, Waugh
Arlan Andrews, Kris Andrews, Joe Giarratano
William, F. Nolan
W., Rev Awdry
Jenifer, A. Ruth
O. David, Dr West
Emmett O., III Saunders
Esme Nichola Author Winter, Barbara Illustrator Shilletto
Riley W., Jr Sanson
Hiccup Horrendous, III Haddock
Mike, Jr. Deodato
Louis, Jr. Porter
Ph.D, Sandra  Eubanks
Joseph S., Jr. Nye
Rob, Jr Potchak
Kenneth, Jr. Faig
Jr, Bill Martin
Normand, R. Bernier
Wilson, Tortosa
PhD, C. Malcolm Trowbridge
Philip Harbottle, Editor
Robert S., Jr. Sanders
Michael Simon Bodner, PhD
Richard, J. O'brien
Douglas M., Sir Price
D., M. Brown
Joseph, Jr Covino
Lovelee, I. Dagum
John, A. Hall
James, Sir Knowles
Mark, Edward Hall
MJ Studios, cover art Jim Seward
Seton Hall University, Dr. Dermot Quinn
Michael, W. Perry
David, Niall Wilson
Jimmie E., Jr. Cain
Todd, F. Davis
Hugh J., Jr. Luke

Cleaned up Suffixes

  1. Neal, Jr. Barrett - just merged into Neal_Barrett,_Jr. There seem to remain also Neal_Barrett, Jr and Neal_Barrett although they are empty; so - merge them too?
  2. Mishima, Yukio - just merged into Yukio_Mishima
  3. Gordon, R. Dickinson - juste removed from a stray pub with Gordon R. Dickson
  4. Chesterton scholar, Aidan Mackey - merged into Aidan_Mackey several weeks ago
  5. Richard Gilliam, Wendy Webb, Edward E. Kramer, Martin H. Greenberg; Richard Gilliam, Martin H. Greenberg, Thomas R. Hanlon; Richard Gilliam, Martin H. Greenberg; Richard Gilliam, Martin H. Greenberg, Kathleen M. Massie-Ferch; Janet Berliner, Uwe Luserke, Martin H. Greenberg; Richard Gilliam, Martin H. Greenberg, Edward E. Kramer - seem to have been split long time ago as no longer present
  6. Brenda, W. Clough; Michael, Moorcock - long gone

Is there a use for documenting this so meticulously, or would it be enough just to delete from the list above? --JVjr 06:50, 10 Mar 2007 (CST)

Suspected Duplicate Author Names

Need to develop a specification and write a script, possibly multiple scripts depending on what kinds of requirements we will come up with. Also, we will need a way of finding out whether, e.g., Nancy Farmer the popular children's author is the same person as Nancy Farmer the co-author of Update One - Federal Fisheries Management: A Guidebook to the Magnuson Fishery Conservation and Management Act (1987). Please discuss on the Talk page.


Is this the right place to list authors I think are probably duplicates? I suspect that Robert Boyer is the same as Robert H. Boyer, likewise the two Zahorskies. WimLewis 18:01, 15 Mar 2007 (CDT)

Anonymous, uncredited, etc.

Originally compiled by Marc Kupper 14:10, 16 Nov 2006 (CST):

Author Long Short Awards Comments
Anonymous ~150 ~250 6
Anonyous 0 3 0 Should get merged with Anonymous [Fixed Marc Kupper 02:47, 22 Nov 2006 (CST)]
Not Available 4 0 0 This sounds like our "not stated" [Fixed - I checked each of the four titles and for each was able to dig up the author name. Marc Kupper 03:01, 22 Nov 2006 (CST)]
uncredit 0 1 0 Should get merged with uncredited. [Fixed Marc Kupper 02:47, 22 Nov 2006 (CST)]
uncredited 0 ~1,000 0
Unknown ~1,050 ~550 172 Used for many cover/interior art instead of "unsigned"
unknown ~1,050 ~550 172 Both "unknown" and "Unknown" get used but search returned same list
unknownAfghan 0 1 0 No explanation in the story notes about this
unsigned 45 0 0 All 45 long works are for Interior Art
Unknown Unknown Listed on http://www.isfdb.org/DIR_U.html but not found
Various 69 0 0
Various Authors 2 0 0 Should get merged into Various [both were bad entries and have been deleted. Ahasuerus]

Malformed URLs

Some Author records, e.g. Tad Williams', have bad URLs in the "Web page" field, which were possibly created by the ISFDB1-to-ISFDB2 conversion. We need to find and fix them. Ahasuerus 19:07, 27 Dec 2006 (CST)