Difference between revisions of "ISFDB:Data Consistency"

From ISFDB
Jump to navigation Jump to search
(→‎Dangling title-publication records: Records cleaned using a script; no more should exist)
Line 160: Line 160:
  
  
 
===Dangling title-publication records===
 
 
Likewise, there are many entries in the pub_content table whose pub_id or title_id is either NULL or refers to a nonexistent record:
 
 
<pre>
 
select * from pub_content pc where not exists (select * from pubs pu where pc.pub_id = pu.pub_id);
 
select * from pub_content pc where not exists (select * from titles ti where pc.title_id = title_id);
 
</pre>
 
[[User:WimLewis|WimLewis]] 16:59, 25 Mar 2007 (CDT)
 
 
Results from backup January 13, 2008:
 
747 rows; 90 rows
 
  
  

Revision as of 01:07, 29 October 2014

The Data Consistency project is a place to coordinate efforts to identify and repair data consistencies, including Stray Publications, malformed ISBNs, etc.

Publication Records

Invalid characters in Publication titles

Malformed ISBNs

As of 2006-12-18, 162 ISBNs failed consistency validation and 73 failed length validation. As of 2007-11-27, the data (excluding magazines, which are handled separately) looked very different:

Valid ISBN - 86910 (83.95%)
Catalog ID - 6608 (6.38%)
No ISBN - 7955 (7.68%)
B000 (from Amazon.com) - 1019 (0.98%)
ISBN is not 10 digits - 836 (0.81%)
Fails checksum validation - 200 (0.19%)

The difference is presumably due to a more granular validation script. The actual Wikified lists of bad records will be posted shortly. Ahasuerus 22:21, 9 Dec 2007 (CST)

ISBN fields that don't start with '#', and aren't 10 digits long" This section of the malformed ISBN field is done, but it should probably be checked again to see what was missed or whats new.Kraang 20:47, 3 Jan 2008 (CST)
Thanks, I'll run the validation script again early next week. I will have to modify it first to allow for 13 digit ISBNs. Ahasuerus 20:56, 3 Jan 2008 (CST)
When you do that can you do the Stray Pubs. again?Kraang 21:04, 3 Jan 2008 (CST)
I think that was Bill's SQL script, the one that he linked below, wasn't it? If so, he can run it any time while I won't be able to run any scripts until next week. Ahasuerus 21:09, 3 Jan 2008 (CST)

Duplicate ISBNs

  • Need a script to find duplicate ISBNs.
This query:
select pub_id, pub_isbn, count(pub_isbn), pub_title from pubs group by pub_isbn having count(pub_isbn) > 1;
returns 3710 rows, although some of them are non-ISBN duplicates such as "None" or "no ISBN".
But since the ISFDB captures multiple printings in multiple publication records, not all duplicate ISBNs are errors. This query finds publications with the same ISBN but different titles:
select pub_id, pub_isbn, count(pub_isbn), pub_title from pubs group by pub_isbn having count(distinct pub_title) > 1;
and returns 657 rows at the moment. Looking at the results of this query, more of them seem to be true duplicates.

Improved SQL excluding magazines, "ISBNs" flagged as not being true iSBNs by a '#' prefix, and trying to remove variants where one version is just a subset of another (e.g. "Title" as opposed to "Title, Series Name" or "Series Name: Title") gives 315 duplicate titles currently.

  select pub_isbn, MIN(pub_title), MAX(pub_title), count(*)
  from pubs
  where pub_isbn IS NOT NULL
  and   pub_isbn NOT LIKE '%Unknown%'
  and   pub_isbn NOT LIKE '%No ISBN%'
  and   pub_isbn NOT LIKE '%N/A%'
  and   pub_isbn NOT LIKE '%None%'
  and   pub_isbn NOT LIKE '#%'
  and   pub_isbn  != ''
  and   pub_ctype != 'MAGAZINE'
  group by pub_isbn
  having count(distinct pub_title) > 1
  AND INSTR(MIN(pub_title), MAX(pub_title)) = 0
  AND INSTR(MAX(pub_title), MIN(pub_title)) = 0;

Many are just minor punctuation differences. The following may be worth a closer look though:

 2253143	The Ancient	Trash
 6470238	The Gap into Ruin: This Day All Gods Die	The Runes of Sorcery
 26151707	Prisoners of Power	Roadside Picnic / Tale of the Troika
 30624363	Labyrinth	The Dark Crystal
 60083611	The Giant Surprise	The Giants, the Mice and the Marsh-Wiggles (Step into Narnia)
 60133821	Nebula Winners Fourteen	The Great Science Fiction Series
 99115913	The Antelope Company at Large	The Secret Visitors Fight Back
 140056513	Heroes and Villains	The Infernal Desire Machines of Doctor Hoffman
 246137762	Starborne	The Wild
 312198930	Greetings, Carbon-Based Bipeds: Collected Works 1934-1998	Greetings, Carbon-Based Bipeds: Collected Works      1944-1998
 312857586	Ender's Shadow	Shadow of the Giant
 330250337	Decade the 1940s	Decade the 1950s
 330375660	Silver Screen	Storming the Bastille
 345300378	A Fine and Private Place	The Last Unicorn
 345439031	The Witch Queen	The Witch's Honour
 370328280	The Spook's Apprentice	The Spook's Secret
 373625375	Deathlands: Demons of Eden	Nightmare Passage
 373720483	Brandyjack	Rebels of Merka
 380007568	Bring the Jubilee	Strange Relations
 380761602	Cold Chills	Cold Shocks
 038513679X	In Joy Still Felt: The Autobiography of Isaac Asimov, 1954-1978	In Memory Yet Green: The Autobiography of Isaac Asimov,  1920-1954
 399124098	The Dancers of Arun	The Northern Girl
 425032116	Callahan's Crosstime Saloon	Farewell to Yesterday's Tomorrow
 425098982	Playmates	The Maddening
 426064453	The Masks of Time	Vornan-19
 439176824	Such a Pretty Face	The Fall
 441006973	Forever Free	The Forever War
 441289037	The Girl from Farris's	The Girl From Hollywood
 441777678	The Multiple Man	The Space Swimmers
 441873464	Galactic Bounty	War World
 445206187	Hot Sky at Midnight	Star of Gypsies
 446607924	Courage of Falcons (The Secret Texts - Book 3)	The End of Magic
 451145747	Of Kings and Quests	Of Quests and Kings
 517551845	Men, Martians and Machines	The Joy Makers
 553229281	Lord Valentine's Castle	Majipoor Chronicles
 553292463	Foundation and Empire	The Trinity Paradox
 553562428	Illusion	Reluctant Voyagers
 583119999	The King of the Swords	The Knight of the Swords
 586043136	The Grain Kings	The Passing of the Dragons
 586049681	Homeworld	Wheelworld
 059303564X	Star Wars: The Truce at Bakura	The Courtship of Princess Leia
 593045580	The Kissed	The Standing Dead
 671015443	Vulcan's Forge	Vulcan's Heart
 671461494	Hitchhiker's Guide to Europe: The Galaxy	The Hitchhiker's Guide to the Galaxy
 671603833	My Friend, the Vampire	The Ordinary Princess
 671653938	The Bug Life Chronicles	Tower to the Sky
 067169801X	The Bug Life Chronicles	Tower to the Sky
 701167181	Angel Times	Merrick: A Novel
 708882633	Madbond	The Burning Realm
 712650504	Firebird	The Venging
 712650512	Firebird	The Venging
 722159048	Dragonflight	Dragonquest
 739407929	Journeys of the Catechist	The Sarantine Mosaic
 747253145	Mirage	Tides
 749710152	After the Plague	Come Lucky April
 812551257	Chaos in Lagrangia	The Lagrangists
 821732684	Stones	Voodoo Fury
 825695716	The Drowned World	When the Sky Burned
 861402030	The Light Fantastic	The Wizards & the Warriors
 884043797	L. Ron Hubbard Presents Writers of the Future, Volume V	Prosthetic Lady
 886773520	Magic's Pawn	Magic's Promise
 886778611	First Rider's Call	The Survivors
 088733010X	Lyonesse II: the Green Pearl	Stitch in Snow
 088733055X	Bitter Ends	Last Rites
 909106118	An Unusual Angle	Dreamworks: Strange New Stories
 937986585	The Sea of the Ravens	Yellow Men Sleep
 940841320	Seeker's Mask	Wizard of the Pigeons
 963094440	Now We Are Sick: An Anthology of Nasty Verse	Temporary Walls
 1557420262	George Alec Effinger	Stanislaw Lem
 1565041585	The Road to Science Fiction 5: The British Way	The Road to Science Fiction 6: Around the World
 158846864X	Arcana: Unearthed Children of the Ruin	Children of the Rune: Tales From the Land of the Diamond Throne
 1593150296	Everquest II (Everquest)	Ocean of Tears
 1596871032	Green Lantern: Sleepers: Book 2	Green Lantern: Sleepers: Book 3
 1841190802	The Mammoth Book of Awesome Comic Fantasy	The Mammoth Book of Historical Whodunnits: Brand New Collection
 184149108X	Maul	Y: A Chromosome
 1844351084	A Life in Pieces	The Purpura Pawn (Professor Bernice Summerfield Series)
 1857231082	The Black Unicorn	Wizard at Large
 1857231848	Treaty at Doona	Treaty Planet
 1857982371	The Black Gryphon	White Gryphon
 1857984382	Diaspora	Prince of Demons
 1880325071	From Hell	The Beast That Shouted Love at the Heart of the World
 1885865082	Doing Time	Erotica Vampirica
 1887666001	Blood Wedding	Vampire Dreams 

BLongley 13:23, 10 Dec 2007 (CST)

Y: A Chromosome could be a working title of Maul. And tp of Maul was published 2003. No info about Y: A Chromosome in Maul? --Roglo 13:48, 12 Jan 2008 (CST)
Not really, although Y Chromosomes are relevant to the plot. This supports the "working title" hypothesis though. BLongley 15:15, 12 Jan 2008 (CST)

Stray Publications

Note: "Stray publications" can happen when: (i) The author(s) of a publication do not match the author(s) of the title to which it is connected; (ii) The Title Type of the publication does not match the Title Type of the title to which it is connected (e.g. publication is NONGENRE while title is NONFICTION); (iii) A title is constructed as a variant title, and the parent title disagrees with the variant in one of these respects.

  • Need a script that finds "novel" Publications whose Author(s) do not match the Author(s) in the associated Title record.
  • ISFDB:Missing Titles - Need a script that finds Publications without an associated Title record.
  • Titles A-Z Fixed(only stray magazines left) (30 Dec 2007) Kraang 18:15, 30 Dec 2007 (CST)
See also here for related SQL scriptiness. BLongley 15:01, 10 Dec 2007 (CST)

Type mismatches

  • Need a script that finds book length Publications whose type doesn't match the type of the associated Title records, e.g. "collection" vs. "novel", "novel" vs. "omnibus", etc.

Missing data

  • Need a script that finds missing data, e.g. missing page counts, missing pb/tp/hc data, 0000-00-00 dates, etc.




Titles

Authorless titles

Two mini-quests for today: titles with authors missing. First, titles that don't match anything in the canonical_author table. I check if they have any pubs (it doesn't depend on having an author):

select tid, title_title, title_ttype, pubc_id, p.pub_id, pub_tag 
    from (select t.title_id tid, t.title_title, t.title_ttype 
             from titles t left outer join canonical_author ca on t.title_id = ca.title_id 
                 where ca.title_id is null) tmp 
         left outer join pub_content pc on pc.title_id = tmp.tid 
         left outer join pubs p on p.pub_id = pc.pub_id;

Results in the May 6, 2008 backup (three more titles than in the February 10, 2008 backup):

+--------+---------------------------------------------------------+--------------+---------+--------+------------+
| tid    | title_title                                             | title_ttype  | pubc_id | pub_id | pub_tag    |
+--------+---------------------------------------------------------+--------------+---------+--------+------------+
| 281671 | The Magazine of Fantasy & Science Fiction, January 2004 | EDITOR       |  350911 | 112701 | FSFJAN2004 | 
| 406811 | Cover: Planets of Wonder                                | COVERART     |    NULL |   NULL | NULL       | 
| 406821 | Cover: A World Out of Time                              | COVERART     |    NULL |   NULL | NULL       | 
| 837920 | Quotation                                               | ESSAY        |    NULL |   NULL | NULL       | 
| 863836 | Waiting                                                 | NULL         |    NULL |   NULL | NULL       | 
| 876893 | Farmer Griggs's Boggart                                 | SHORTFICTION |    NULL |   NULL | NULL       | 
| 876895 | The Last of The Huggermuggers                           | SHORTFICTION |    NULL |   NULL | NULL       | 
+--------+---------------------------------------------------------+--------------+---------+--------+------------+
7 rows in set (4.64 sec)

I have no idea what the first record is; the other authorless and publess titles should be safe to delete. All generate python errors while trying to match reviews by author and title. Anyway, links:

And now titles which have their id in the canonical_author table, but the author_id from the canonical_author record is not an id of any existing author. This is more complicated, because you can have multiple canonical_author records for any title, and some of them may point to nonexistent authors. So I try to find titles where one of the authors doesn't exist, but display all the authors.

select t.title_id, t.title_title, t.title_ttype, ca.author_id as nonexistent, ca2.author_id, a2.author_canonical 
    from titles t, 
         (canonical_author ca left outer join authors a on ca.author_id = a.author_id), 
         (canonical_author ca2 left outer join authors a2 on ca2.author_id = a2.author_id) 
    where t.title_id = ca.title_id and t.title_id = ca2.title_id and a.author_id is null ;

Results in the January 6, 2008 backup:

+----------+-----------------------------+--------------+-------------+-----------+--------------------+
| title_id | title_title                 | title_ttype  | nonexistent | author_id | author_canonical   |
+----------+-----------------------------+--------------+-------------+-----------+--------------------+
|   190423 | Hunter, Come Home           | SHORTFICTION |       11231 |     11231 | NULL               | 
|   190423 | Hunter, Come Home           | SHORTFICTION |       11231 |     53491 | Richard M. McKenna | 
|   192212 | The Man the Worlds Rejected | SHORTFICTION |       19035 |     19035 | NULL               | 
|   192212 | The Man the Worlds Rejected | SHORTFICTION |       19035 |     68421 | Gordon Dickson     | 
|   198701 | Strictly Confidential       | SHORTFICTION |       19035 |     19035 | NULL               | 
|   431031 | Spacepaw                    | NOVEL        |       19035 |     68421 | Gordon Dickson     | 
|   431031 | Spacepaw                    | NOVEL        |       19035 |     19035 | NULL               | 
+----------+-----------------------------+--------------+-------------+-----------+--------------------+
7 rows in set (5.02 sec)

We can't really clean the canonical_author table (adding and removing authors in the title editor doesn't help) but Strictly Confidential generates python error and we can fix that by adding at least one existing author (looks like it is Gordon Dickson again?). Links to all 4 titles:

--Roglo 10:46, 11 Jan 2008 (CST)

"Strictly Confidential" looks interesting: you can narrow that down to the pub:
 select * from pub_content pc
 where pc.title_id = 198701
 pubc_id	title_id	pub_id	pubc_page
 251922	198701		89769	112
And it's here, verified, and so theoretically the verifier could tell us if it's Gordon Dickson or Gordon R. Dickson. BLongley 12:47, 11 Jan 2008 (CST)
"Strictly Confidential" fixed. Attributed to "Gordon Dickson", so the pub record was edited, then a variant was created for "Gordon R. Dickson", and then merged with the existing title record. Mhhutchins 23:33, 12 Jan 2008 (CST)

Safe to auto-merge identical titles?

Related to the Gardner Dozois collection was that for nearly all of the titles I found had two title records that were identical other than one was a parent of the other. If it seems safe it seems it would safe some work to to do a sweep for title records that are identical and to auto-merge them. Marc Kupper 22:25, 22 Dec 2006 (CST)


Tags

It would be nice to standardize user-generated tags.

Synonyms


  • hard science fiction
  • hard sf

  • history of sf
  • history-of-sf

  • juvenile sf
  • juvenile-sf

  • recursive sf
  • meta sf

  • 'young-adult humorous sf' -> 'humorous sf' + 'young-adult sf'

etc etc

some one should probably come up with a standard then make some filters for input and run the filters on the existing tags.


Authors

Authors who are (probably) doubled

I don't know how we can browse authors summary pages by ids. Advanced search gives access only to editing authors.

These have the same canonical names and their legal names are either equal or at least one legal name is NULL (so they have to be checked before merging):

select a1.author_canonical canonical, a1.author_legalname legal, count(1) 
   from authors a1, authors a2 
   where a1.author_canonical = a2.author_canonical and a1.author_id <> a2.author_id 
     and (a1.author_legalname = a2.author_legalname or IsNull(a1.author_legalname) or IsNull(a2.author_legalname)) 
   group by 1, 2;

Results from backup January 27, 2008:

+-------------+--------------+----------+
| canonical   | legal        | count(1) |
+-------------+--------------+----------+
| Philip Kent | NULL         |        2 | 
| Simon Clark | NULL         |        1 | 
| Simon Clark | Clark, Simon |        1 | 
+-------------+--------------+----------+
3 rows in set (1.70 sec)

Looks like one Simon Clark is new. How do these get created? --Roglo 13:15, 28 Jan 2008 (CST)

You can't do it by changing author names in Titles and Publications since the software will assume that you want to use the existing author record instead of creating a new one. However, when you change an Author's name in "Author Data", the software doesn't check whether there is another author record with the same name. Thus, if we have "Simon Clark" and "Simon Clarke" on file and the latter is changed to "Simon Clark", we will end up with two "Simon Clark" records. When reviewing submissions that attempt to modify Author records, moderators have no way of telling whether another Author record with the same name already exists, so they often slip through. Ahasuerus 15:00, 28 Jan 2008 (CST)
P.S. I have deleted the results from the last two passes, but they are available via this page's History if anybody needs them. Ahasuerus 15:00, 28 Jan 2008 (CST)

2008-02-10 results:

+-------------+--------------+----------+
| canonical   | legal        | count(1) |
+-------------+--------------+----------+
| Simon Clark | NULL         |        1 |
| Simon Clark | Clark, Simon |        1 |
+-------------+--------------+----------+
Fixed on 2008-02-29. Ahasuerus 23:57, 29 Feb 2008 (CST)

Pseudonym consistency

There are 22 authors in the latest dump who are self-pseudonyms (they have a pseudonym record pointing back to themselves):

select ps.pseudo_id, ps.author_id, au.author_canonical
    from pseudonyms ps, authors au
    where ps.author_id = ps.pseudonym and ps.author_id = au.author_id;


There are two pairs of authors who are pseudonyms of each other:

  • Avram Davidson (author_id 501) and Ellery Queen (author_id 4921), twice
  • Douglas Stapleton (author_id 12921) and Doug Stapleton (author_id 12922)

There may also be longer cycles... WimLewis 22:28, 25 Mar 2007 (CDT)



Title vs. Publication Type Consistency

This section documents mismatches between Title types and associated Publication types:

Pseudonyms in Collections

The following page lists all known Collection Publications that include pseudonymous Titles as of the August 11, 2007 backup. Although this is not always indicative of an error, we estimate that a significant percentage of these occurrences need to be fixed.

Serial Dates

The following page lists all known Serial records whose Title dates do not match the dates of the Publications that they appeared in as of the 2008-02-10 backup.