Difference between revisions of "ISFDB:Data Consistency"
(→Titles which are variants of non-existent titles: Removin section - moved to cleanup reports) |
(→Series: Coverted to a cleanup report) |
||
Line 291: | Line 291: | ||
* [[ISFDB:Repair_Awards]] Need a script that locates bad/missing mapping records between the awards and titles tables. | * [[ISFDB:Repair_Awards]] Need a script that locates bad/missing mapping records between the awards and titles tables. | ||
− | |||
− | |||
== Authors == | == Authors == |
Revision as of 17:10, 28 October 2014
The Data Consistency project is a place to coordinate efforts to identify and repair data consistencies, including Stray Publications, malformed ISBNs, etc.
Publication Records
Invalid characters in Publication titles
- ISFDB:Invalid characters in Publication titles 14 as of 2007-11-27. Ahasuerus 19:36, 9 Dec 2007 (CST)
Malformed ISBNs
- ISFDB:Bad ISBN List - Need a script to find ISBNs that fail ISBN consistency validation.
As of 2006-12-18, 162 ISBNs failed consistency validation and 73 failed length validation. As of 2007-11-27, the data (excluding magazines, which are handled separately) looked very different:
Valid ISBN - 86910 (83.95%) Catalog ID - 6608 (6.38%) No ISBN - 7955 (7.68%) B000 (from Amazon.com) - 1019 (0.98%) ISBN is not 10 digits - 836 (0.81%) Fails checksum validation - 200 (0.19%)
The difference is presumably due to a more granular validation script. The actual Wikified lists of bad records will be posted shortly. Ahasuerus 22:21, 9 Dec 2007 (CST)
- ISBN fields that don't start with '#', and aren't 10 digits long" This section of the malformed ISBN field is done, but it should probably be checked again to see what was missed or whats new.Kraang 20:47, 3 Jan 2008 (CST)
- Thanks, I'll run the validation script again early next week. I will have to modify it first to allow for 13 digit ISBNs. Ahasuerus 20:56, 3 Jan 2008 (CST)
- When you do that can you do the Stray Pubs. again?Kraang 21:04, 3 Jan 2008 (CST)
- I think that was Bill's SQL script, the one that he linked below, wasn't it? If so, he can run it any time while I won't be able to run any scripts until next week. Ahasuerus 21:09, 3 Jan 2008 (CST)
Duplicate ISBNs
- Need a script to find duplicate ISBNs.
- This query:
select pub_id, pub_isbn, count(pub_isbn), pub_title from pubs group by pub_isbn having count(pub_isbn) > 1;
returns 3710 rows, although some of them are non-ISBN duplicates such as "None" or "no ISBN". - But since the ISFDB captures multiple printings in multiple publication records, not all duplicate ISBNs are errors. This query finds publications with the same ISBN but different titles:
select pub_id, pub_isbn, count(pub_isbn), pub_title from pubs group by pub_isbn having count(distinct pub_title) > 1;
and returns 657 rows at the moment. Looking at the results of this query, more of them seem to be true duplicates.
- This query:
Improved SQL excluding magazines, "ISBNs" flagged as not being true iSBNs by a '#' prefix, and trying to remove variants where one version is just a subset of another (e.g. "Title" as opposed to "Title, Series Name" or "Series Name: Title") gives 315 duplicate titles currently.
select pub_isbn, MIN(pub_title), MAX(pub_title), count(*) from pubs where pub_isbn IS NOT NULL and pub_isbn NOT LIKE '%Unknown%' and pub_isbn NOT LIKE '%No ISBN%' and pub_isbn NOT LIKE '%N/A%' and pub_isbn NOT LIKE '%None%' and pub_isbn NOT LIKE '#%' and pub_isbn != '' and pub_ctype != 'MAGAZINE' group by pub_isbn having count(distinct pub_title) > 1 AND INSTR(MIN(pub_title), MAX(pub_title)) = 0 AND INSTR(MAX(pub_title), MIN(pub_title)) = 0;
Many are just minor punctuation differences. The following may be worth a closer look though:
2253143 The Ancient Trash 6470238 The Gap into Ruin: This Day All Gods Die The Runes of Sorcery 26151707 Prisoners of Power Roadside Picnic / Tale of the Troika 30624363 Labyrinth The Dark Crystal 60083611 The Giant Surprise The Giants, the Mice and the Marsh-Wiggles (Step into Narnia) 60133821 Nebula Winners Fourteen The Great Science Fiction Series 99115913 The Antelope Company at Large The Secret Visitors Fight Back 140056513 Heroes and Villains The Infernal Desire Machines of Doctor Hoffman 246137762 Starborne The Wild 312198930 Greetings, Carbon-Based Bipeds: Collected Works 1934-1998 Greetings, Carbon-Based Bipeds: Collected Works 1944-1998 312857586 Ender's Shadow Shadow of the Giant 330250337 Decade the 1940s Decade the 1950s 330375660 Silver Screen Storming the Bastille 345300378 A Fine and Private Place The Last Unicorn 345439031 The Witch Queen The Witch's Honour 370328280 The Spook's Apprentice The Spook's Secret 373625375 Deathlands: Demons of Eden Nightmare Passage 373720483 Brandyjack Rebels of Merka 380007568 Bring the Jubilee Strange Relations 380761602 Cold Chills Cold Shocks 038513679X In Joy Still Felt: The Autobiography of Isaac Asimov, 1954-1978 In Memory Yet Green: The Autobiography of Isaac Asimov, 1920-1954 399124098 The Dancers of Arun The Northern Girl 425032116 Callahan's Crosstime Saloon Farewell to Yesterday's Tomorrow 425098982 Playmates The Maddening 426064453 The Masks of Time Vornan-19 439176824 Such a Pretty Face The Fall 441006973 Forever Free The Forever War 441289037 The Girl from Farris's The Girl From Hollywood 441777678 The Multiple Man The Space Swimmers 441873464 Galactic Bounty War World 445206187 Hot Sky at Midnight Star of Gypsies 446607924 Courage of Falcons (The Secret Texts - Book 3) The End of Magic 451145747 Of Kings and Quests Of Quests and Kings 517551845 Men, Martians and Machines The Joy Makers 553229281 Lord Valentine's Castle Majipoor Chronicles 553292463 Foundation and Empire The Trinity Paradox 553562428 Illusion Reluctant Voyagers 583119999 The King of the Swords The Knight of the Swords 586043136 The Grain Kings The Passing of the Dragons 586049681 Homeworld Wheelworld 059303564X Star Wars: The Truce at Bakura The Courtship of Princess Leia 593045580 The Kissed The Standing Dead 671015443 Vulcan's Forge Vulcan's Heart 671461494 Hitchhiker's Guide to Europe: The Galaxy The Hitchhiker's Guide to the Galaxy 671603833 My Friend, the Vampire The Ordinary Princess 671653938 The Bug Life Chronicles Tower to the Sky 067169801X The Bug Life Chronicles Tower to the Sky 701167181 Angel Times Merrick: A Novel 708882633 Madbond The Burning Realm 712650504 Firebird The Venging 712650512 Firebird The Venging 722159048 Dragonflight Dragonquest 739407929 Journeys of the Catechist The Sarantine Mosaic 747253145 Mirage Tides 749710152 After the Plague Come Lucky April 812551257 Chaos in Lagrangia The Lagrangists 821732684 Stones Voodoo Fury 825695716 The Drowned World When the Sky Burned 861402030 The Light Fantastic The Wizards & the Warriors 884043797 L. Ron Hubbard Presents Writers of the Future, Volume V Prosthetic Lady 886773520 Magic's Pawn Magic's Promise 886778611 First Rider's Call The Survivors 088733010X Lyonesse II: the Green Pearl Stitch in Snow 088733055X Bitter Ends Last Rites 909106118 An Unusual Angle Dreamworks: Strange New Stories 937986585 The Sea of the Ravens Yellow Men Sleep 940841320 Seeker's Mask Wizard of the Pigeons 963094440 Now We Are Sick: An Anthology of Nasty Verse Temporary Walls 1557420262 George Alec Effinger Stanislaw Lem 1565041585 The Road to Science Fiction 5: The British Way The Road to Science Fiction 6: Around the World 158846864X Arcana: Unearthed Children of the Ruin Children of the Rune: Tales From the Land of the Diamond Throne 1593150296 Everquest II (Everquest) Ocean of Tears 1596871032 Green Lantern: Sleepers: Book 2 Green Lantern: Sleepers: Book 3 1841190802 The Mammoth Book of Awesome Comic Fantasy The Mammoth Book of Historical Whodunnits: Brand New Collection 184149108X Maul Y: A Chromosome 1844351084 A Life in Pieces The Purpura Pawn (Professor Bernice Summerfield Series) 1857231082 The Black Unicorn Wizard at Large 1857231848 Treaty at Doona Treaty Planet 1857982371 The Black Gryphon White Gryphon 1857984382 Diaspora Prince of Demons 1880325071 From Hell The Beast That Shouted Love at the Heart of the World 1885865082 Doing Time Erotica Vampirica 1887666001 Blood Wedding Vampire Dreams
BLongley 13:23, 10 Dec 2007 (CST)
- Y: A Chromosome could be a working title of Maul. And tp of Maul was published 2003. No info about Y: A Chromosome in Maul? --Roglo 13:48, 12 Jan 2008 (CST)
- Not really, although Y Chromosomes are relevant to the plot. This supports the "working title" hypothesis though. BLongley 15:15, 12 Jan 2008 (CST)
Stray Publications
Note: "Stray publications" can happen when: (i) The author(s) of a publication do not match the author(s) of the title to which it is connected; (ii) The Title Type of the publication does not match the Title Type of the title to which it is connected (e.g. publication is NONGENRE while title is NONFICTION); (iii) A title is constructed as a variant title, and the parent title disagrees with the variant in one of these respects.
- Need a script that finds "novel" Publications whose Author(s) do not match the Author(s) in the associated Title record.
- ISFDB:Missing Titles - Need a script that finds Publications without an associated Title record.
- Titles A-Z Fixed(only stray magazines left) (30 Dec 2007) Kraang 18:15, 30 Dec 2007 (CST)
Type mismatches
- Need a script that finds book length Publications whose type doesn't match the type of the associated Title records, e.g. "collection" vs. "novel", "novel" vs. "omnibus", etc.
Missing data
- Need a script that finds missing data, e.g. missing page counts, missing pb/tp/hc data, 0000-00-00 dates, etc.
Dangling title-publication records
Likewise, there are many entries in the pub_content table whose pub_id or title_id is either NULL or refers to a nonexistent record:
select * from pub_content pc where not exists (select * from pubs pu where pc.pub_id = pu.pub_id); select * from pub_content pc where not exists (select * from titles ti where pc.title_id = title_id);
WimLewis 16:59, 25 Mar 2007 (CDT)
Results from backup January 13, 2008:
747 rows; 90 rows
Titles
Authorless titles
Two mini-quests for today: titles with authors missing. First, titles that don't match anything in the canonical_author table. I check if they have any pubs (it doesn't depend on having an author):
select tid, title_title, title_ttype, pubc_id, p.pub_id, pub_tag from (select t.title_id tid, t.title_title, t.title_ttype from titles t left outer join canonical_author ca on t.title_id = ca.title_id where ca.title_id is null) tmp left outer join pub_content pc on pc.title_id = tmp.tid left outer join pubs p on p.pub_id = pc.pub_id;
Results in the May 6, 2008 backup (three more titles than in the February 10, 2008 backup):
+--------+---------------------------------------------------------+--------------+---------+--------+------------+ | tid | title_title | title_ttype | pubc_id | pub_id | pub_tag | +--------+---------------------------------------------------------+--------------+---------+--------+------------+ | 281671 | The Magazine of Fantasy & Science Fiction, January 2004 | EDITOR | 350911 | 112701 | FSFJAN2004 | | 406811 | Cover: Planets of Wonder | COVERART | NULL | NULL | NULL | | 406821 | Cover: A World Out of Time | COVERART | NULL | NULL | NULL | | 837920 | Quotation | ESSAY | NULL | NULL | NULL | | 863836 | Waiting | NULL | NULL | NULL | NULL | | 876893 | Farmer Griggs's Boggart | SHORTFICTION | NULL | NULL | NULL | | 876895 | The Last of The Huggermuggers | SHORTFICTION | NULL | NULL | NULL | +--------+---------------------------------------------------------+--------------+---------+--------+------------+ 7 rows in set (4.64 sec)
I have no idea what the first record is; the other authorless and publess titles should be safe to delete. All generate python errors while trying to match reviews by author and title. Anyway, links:
- The Magazine of Fantasy & Science Fiction, January 2004 EDITOR FSFJAN2004 Pub
- Cover: Planets of Wonder COVERART
- Cover: A World Out of Time COVERART
- Quotation ESSAY
- Waiting
- Farmer Griggs's Boggart SHORTFICTION
- The Last of The Huggermuggers SHORTFICTION
And now titles which have their id in the canonical_author table, but the author_id from the canonical_author record is not an id of any existing author. This is more complicated, because you can have multiple canonical_author records for any title, and some of them may point to nonexistent authors. So I try to find titles where one of the authors doesn't exist, but display all the authors.
select t.title_id, t.title_title, t.title_ttype, ca.author_id as nonexistent, ca2.author_id, a2.author_canonical from titles t, (canonical_author ca left outer join authors a on ca.author_id = a.author_id), (canonical_author ca2 left outer join authors a2 on ca2.author_id = a2.author_id) where t.title_id = ca.title_id and t.title_id = ca2.title_id and a.author_id is null ;
Results in the January 6, 2008 backup:
+----------+-----------------------------+--------------+-------------+-----------+--------------------+ | title_id | title_title | title_ttype | nonexistent | author_id | author_canonical | +----------+-----------------------------+--------------+-------------+-----------+--------------------+ | 190423 | Hunter, Come Home | SHORTFICTION | 11231 | 11231 | NULL | | 190423 | Hunter, Come Home | SHORTFICTION | 11231 | 53491 | Richard M. McKenna | | 192212 | The Man the Worlds Rejected | SHORTFICTION | 19035 | 19035 | NULL | | 192212 | The Man the Worlds Rejected | SHORTFICTION | 19035 | 68421 | Gordon Dickson | | 198701 | Strictly Confidential | SHORTFICTION | 19035 | 19035 | NULL | | 431031 | Spacepaw | NOVEL | 19035 | 68421 | Gordon Dickson | | 431031 | Spacepaw | NOVEL | 19035 | 19035 | NULL | +----------+-----------------------------+--------------+-------------+-----------+--------------------+ 7 rows in set (5.02 sec)
We can't really clean the canonical_author table (adding and removing authors in the title editor doesn't help) but Strictly Confidential generates python error and we can fix that by adding at least one existing author (looks like it is Gordon Dickson again?). Links to all 4 titles:
- Hunter, Come Home SHORTFICTION by Richard M. McKenna and NULL
- The Man the Worlds Rejected SHORTFICTION by Gordon Dickson and NULL
- Strictly Confidential SHORTFICTION by NULL (the one interesting)
- Spacepaw NOVEL by Gordon Dickson and NULL
--Roglo 10:46, 11 Jan 2008 (CST)
- "Strictly Confidential" looks interesting: you can narrow that down to the pub:
select * from pub_content pc where pc.title_id = 198701
pubc_id title_id pub_id pubc_page 251922 198701 89769 112
- And it's here, verified, and so theoretically the verifier could tell us if it's Gordon Dickson or Gordon R. Dickson. BLongley 12:47, 11 Jan 2008 (CST)
- "Strictly Confidential" fixed. Attributed to "Gordon Dickson", so the pub record was edited, then a variant was created for "Gordon R. Dickson", and then merged with the existing title record. Mhhutchins 23:33, 12 Jan 2008 (CST)
Safe to auto-merge identical titles?
Related to the Gardner Dozois collection was that for nearly all of the titles I found had two title records that were identical other than one was a parent of the other. If it seems safe it seems it would safe some work to to do a sweep for title records that are identical and to auto-merge them. Marc Kupper 22:25, 22 Dec 2006 (CST)
Tags
It would be nice to standardize user-generated tags.
Synonyms
- hard science fiction
- hard sf
- history of sf
- history-of-sf
- juvenile sf
- juvenile-sf
- recursive sf
- meta sf
- 'young-adult humorous sf' -> 'humorous sf' + 'young-adult sf'
etc etc
some one should probably come up with a standard then make some filters for input and run the filters on the existing tags.
Awards
- ISFDB:Repair_Awards Need a script that locates bad/missing mapping records between the awards and titles tables.
Authors
Authors in the Author Directory that don't appear to have written anything
ISFDB:Authors that only exist due to reviews I've cleared up most of the "A"s, either by correcting/regularising the name in the review, or adding the missing publication or variant. The above page I created has links to each review causing a problem - many authors can be cleared up with ONE review fix, so have away at them! BLongley 13:54, 29 Dec 2007 (CST)
Authors who are (probably) doubled
I don't know how we can browse authors summary pages by ids. Advanced search gives access only to editing authors.
These have the same canonical names and their legal names are either equal or at least one legal name is NULL (so they have to be checked before merging):
select a1.author_canonical canonical, a1.author_legalname legal, count(1) from authors a1, authors a2 where a1.author_canonical = a2.author_canonical and a1.author_id <> a2.author_id and (a1.author_legalname = a2.author_legalname or IsNull(a1.author_legalname) or IsNull(a2.author_legalname)) group by 1, 2;
Results from backup January 27, 2008:
+-------------+--------------+----------+ | canonical | legal | count(1) | +-------------+--------------+----------+ | Philip Kent | NULL | 2 | | Simon Clark | NULL | 1 | | Simon Clark | Clark, Simon | 1 | +-------------+--------------+----------+ 3 rows in set (1.70 sec)
Looks like one Simon Clark is new. How do these get created? --Roglo 13:15, 28 Jan 2008 (CST)
- You can't do it by changing author names in Titles and Publications since the software will assume that you want to use the existing author record instead of creating a new one. However, when you change an Author's name in "Author Data", the software doesn't check whether there is another author record with the same name. Thus, if we have "Simon Clark" and "Simon Clarke" on file and the latter is changed to "Simon Clark", we will end up with two "Simon Clark" records. When reviewing submissions that attempt to modify Author records, moderators have no way of telling whether another Author record with the same name already exists, so they often slip through. Ahasuerus 15:00, 28 Jan 2008 (CST)
- P.S. I have deleted the results from the last two passes, but they are available via this page's History if anybody needs them. Ahasuerus 15:00, 28 Jan 2008 (CST)
2008-02-10 results:
+-------------+--------------+----------+ | canonical | legal | count(1) | +-------------+--------------+----------+ | Simon Clark | NULL | 1 | | Simon Clark | Clark, Simon | 1 | +-------------+--------------+----------+
- Fixed on 2008-02-29. Ahasuerus 23:57, 29 Feb 2008 (CST)
Pseudonym consistency
There are 22 authors in the latest dump who are self-pseudonyms (they have a pseudonym record pointing back to themselves):
select ps.pseudo_id, ps.author_id, au.author_canonical from pseudonyms ps, authors au where ps.author_id = ps.pseudonym and ps.author_id = au.author_id;
There are two pairs of authors who are pseudonyms of each other:
- Avram Davidson (author_id 501) and Ellery Queen (author_id 4921), twice
- Douglas Stapleton (author_id 12921) and Doug Stapleton (author_id 12922)
There may also be longer cycles... WimLewis 22:28, 25 Mar 2007 (CDT)
Title vs. Publication Type Consistency
This section documents mismatches between Title types and associated Publication types:
- Collections:
- Novels:
- Non-fiction:
- Other:
- Non-genre mismatches
- Serial mismatches
- Omnibus mismatches
- Anthology mismatches
- Editor mismatches
Pseudonyms in Collections
The following page lists all known Collection Publications that include pseudonymous Titles as of the August 11, 2007 backup. Although this is not always indicative of an error, we estimate that a significant percentage of these occurrences need to be fixed.
- Pseudonyms in Collections NEW as of 2008-05-06
Serial Dates
The following page lists all known Serial records whose Title dates do not match the dates of the Publications that they appeared in as of the 2008-02-10 backup.