Difference between revisions of "ISFDB:Data Consistency"

From ISFDB
Jump to navigation Jump to search
(→‎Duplicate ISBNs: Removing - superseded by a cleanup report)
(→‎Stray Publications: Replaced with a cleanup script; deleting)
Line 10: Line 10:
  
  
===Stray Publications===
 
Note: "Stray publications" can happen when: (i) The author(s) of a publication do not match the author(s) of the title to which it is connected; (ii) The Title Type of the publication does not match the Title Type of the title to which it is connected (e.g. publication is NONGENRE while title is NONFICTION); (iii) A title is constructed as a variant title, and the parent title disagrees with the variant in one of these respects.
 
  
*Need a script that finds "novel" Publications whose Author(s) do not match the Author(s) in the associated Title record.
 
* [[ISFDB:Missing Titles]] - Need a script that finds Publications without an associated Title record.
 
*Titles A-Z Fixed(only stray magazines left) (30 Dec 2007) [[User:Kraang|Kraang]] 18:15, 30 Dec 2007 (CST)
 
: See also [http://www.isfdb.org/wiki/index.php/Talk:Database_Schema#Let.27s_try_it_from_the_other_side here] for related SQL scriptiness. [[User:BLongley|BLongley]] 15:01, 10 Dec 2007 (CST)
 
  
 
===Type mismatches===
 
===Type mismatches===

Revision as of 19:16, 19 February 2015

The Data Consistency project is a place to coordinate efforts to identify and repair data consistencies, including Stray Publications, malformed ISBNs, etc.

Publication Records

Invalid characters in Publication titles




Type mismatches

  • Need a script that finds book length Publications whose type doesn't match the type of the associated Title records, e.g. "collection" vs. "novel", "novel" vs. "omnibus", etc.

Missing data

  • Need a script that finds missing data, e.g. missing page counts, missing pb/tp/hc data, 0000-00-00 dates, etc.




Titles

Authorless titles

Two mini-quests for today: titles with authors missing. First, titles that don't match anything in the canonical_author table. I check if they have any pubs (it doesn't depend on having an author):

select tid, title_title, title_ttype, pubc_id, p.pub_id, pub_tag 
    from (select t.title_id tid, t.title_title, t.title_ttype 
             from titles t left outer join canonical_author ca on t.title_id = ca.title_id 
                 where ca.title_id is null) tmp 
         left outer join pub_content pc on pc.title_id = tmp.tid 
         left outer join pubs p on p.pub_id = pc.pub_id;

Results in the May 6, 2008 backup (three more titles than in the February 10, 2008 backup):

+--------+---------------------------------------------------------+--------------+---------+--------+------------+
| tid    | title_title                                             | title_ttype  | pubc_id | pub_id | pub_tag    |
+--------+---------------------------------------------------------+--------------+---------+--------+------------+
| 281671 | The Magazine of Fantasy & Science Fiction, January 2004 | EDITOR       |  350911 | 112701 | FSFJAN2004 | 
| 406811 | Cover: Planets of Wonder                                | COVERART     |    NULL |   NULL | NULL       | 
| 406821 | Cover: A World Out of Time                              | COVERART     |    NULL |   NULL | NULL       | 
| 837920 | Quotation                                               | ESSAY        |    NULL |   NULL | NULL       | 
| 863836 | Waiting                                                 | NULL         |    NULL |   NULL | NULL       | 
| 876893 | Farmer Griggs's Boggart                                 | SHORTFICTION |    NULL |   NULL | NULL       | 
| 876895 | The Last of The Huggermuggers                           | SHORTFICTION |    NULL |   NULL | NULL       | 
+--------+---------------------------------------------------------+--------------+---------+--------+------------+
7 rows in set (4.64 sec)

I have no idea what the first record is; the other authorless and publess titles should be safe to delete. All generate python errors while trying to match reviews by author and title. Anyway, links:

And now titles which have their id in the canonical_author table, but the author_id from the canonical_author record is not an id of any existing author. This is more complicated, because you can have multiple canonical_author records for any title, and some of them may point to nonexistent authors. So I try to find titles where one of the authors doesn't exist, but display all the authors.

select t.title_id, t.title_title, t.title_ttype, ca.author_id as nonexistent, ca2.author_id, a2.author_canonical 
    from titles t, 
         (canonical_author ca left outer join authors a on ca.author_id = a.author_id), 
         (canonical_author ca2 left outer join authors a2 on ca2.author_id = a2.author_id) 
    where t.title_id = ca.title_id and t.title_id = ca2.title_id and a.author_id is null ;

Results in the January 6, 2008 backup:

+----------+-----------------------------+--------------+-------------+-----------+--------------------+
| title_id | title_title                 | title_ttype  | nonexistent | author_id | author_canonical   |
+----------+-----------------------------+--------------+-------------+-----------+--------------------+
|   190423 | Hunter, Come Home           | SHORTFICTION |       11231 |     11231 | NULL               | 
|   190423 | Hunter, Come Home           | SHORTFICTION |       11231 |     53491 | Richard M. McKenna | 
|   192212 | The Man the Worlds Rejected | SHORTFICTION |       19035 |     19035 | NULL               | 
|   192212 | The Man the Worlds Rejected | SHORTFICTION |       19035 |     68421 | Gordon Dickson     | 
|   198701 | Strictly Confidential       | SHORTFICTION |       19035 |     19035 | NULL               | 
|   431031 | Spacepaw                    | NOVEL        |       19035 |     68421 | Gordon Dickson     | 
|   431031 | Spacepaw                    | NOVEL        |       19035 |     19035 | NULL               | 
+----------+-----------------------------+--------------+-------------+-----------+--------------------+
7 rows in set (5.02 sec)

We can't really clean the canonical_author table (adding and removing authors in the title editor doesn't help) but Strictly Confidential generates python error and we can fix that by adding at least one existing author (looks like it is Gordon Dickson again?). Links to all 4 titles:

--Roglo 10:46, 11 Jan 2008 (CST)

"Strictly Confidential" looks interesting: you can narrow that down to the pub:
 select * from pub_content pc
 where pc.title_id = 198701
 pubc_id	title_id	pub_id	pubc_page
 251922	198701		89769	112
And it's here, verified, and so theoretically the verifier could tell us if it's Gordon Dickson or Gordon R. Dickson. BLongley 12:47, 11 Jan 2008 (CST)
"Strictly Confidential" fixed. Attributed to "Gordon Dickson", so the pub record was edited, then a variant was created for "Gordon R. Dickson", and then merged with the existing title record. Mhhutchins 23:33, 12 Jan 2008 (CST)

Safe to auto-merge identical titles?

Related to the Gardner Dozois collection was that for nearly all of the titles I found had two title records that were identical other than one was a parent of the other. If it seems safe it seems it would safe some work to to do a sweep for title records that are identical and to auto-merge them. Marc Kupper 22:25, 22 Dec 2006 (CST)


Tags

It would be nice to standardize user-generated tags.

Synonyms


  • hard science fiction
  • hard sf

  • history of sf
  • history-of-sf

  • juvenile sf
  • juvenile-sf

  • recursive sf
  • meta sf

  • 'young-adult humorous sf' -> 'humorous sf' + 'young-adult sf'

etc etc

some one should probably come up with a standard then make some filters for input and run the filters on the existing tags.


Authors

Authors who are (probably) doubled

I don't know how we can browse authors summary pages by ids. Advanced search gives access only to editing authors.

These have the same canonical names and their legal names are either equal or at least one legal name is NULL (so they have to be checked before merging):

select a1.author_canonical canonical, a1.author_legalname legal, count(1) 
   from authors a1, authors a2 
   where a1.author_canonical = a2.author_canonical and a1.author_id <> a2.author_id 
     and (a1.author_legalname = a2.author_legalname or IsNull(a1.author_legalname) or IsNull(a2.author_legalname)) 
   group by 1, 2;

Results from backup January 27, 2008:

+-------------+--------------+----------+
| canonical   | legal        | count(1) |
+-------------+--------------+----------+
| Philip Kent | NULL         |        2 | 
| Simon Clark | NULL         |        1 | 
| Simon Clark | Clark, Simon |        1 | 
+-------------+--------------+----------+
3 rows in set (1.70 sec)

Looks like one Simon Clark is new. How do these get created? --Roglo 13:15, 28 Jan 2008 (CST)

You can't do it by changing author names in Titles and Publications since the software will assume that you want to use the existing author record instead of creating a new one. However, when you change an Author's name in "Author Data", the software doesn't check whether there is another author record with the same name. Thus, if we have "Simon Clark" and "Simon Clarke" on file and the latter is changed to "Simon Clark", we will end up with two "Simon Clark" records. When reviewing submissions that attempt to modify Author records, moderators have no way of telling whether another Author record with the same name already exists, so they often slip through. Ahasuerus 15:00, 28 Jan 2008 (CST)
P.S. I have deleted the results from the last two passes, but they are available via this page's History if anybody needs them. Ahasuerus 15:00, 28 Jan 2008 (CST)

2008-02-10 results:

+-------------+--------------+----------+
| canonical   | legal        | count(1) |
+-------------+--------------+----------+
| Simon Clark | NULL         |        1 |
| Simon Clark | Clark, Simon |        1 |
+-------------+--------------+----------+
Fixed on 2008-02-29. Ahasuerus 23:57, 29 Feb 2008 (CST)

Pseudonym consistency

There are 22 authors in the latest dump who are self-pseudonyms (they have a pseudonym record pointing back to themselves):

select ps.pseudo_id, ps.author_id, au.author_canonical
    from pseudonyms ps, authors au
    where ps.author_id = ps.pseudonym and ps.author_id = au.author_id;


There are two pairs of authors who are pseudonyms of each other:

  • Avram Davidson (author_id 501) and Ellery Queen (author_id 4921), twice
  • Douglas Stapleton (author_id 12921) and Doug Stapleton (author_id 12922)

There may also be longer cycles... WimLewis 22:28, 25 Mar 2007 (CDT)



Title vs. Publication Type Consistency

This section documents mismatches between Title types and associated Publication types:

Pseudonyms in Collections

The following page lists all known Collection Publications that include pseudonymous Titles as of the August 11, 2007 backup. Although this is not always indicative of an error, we estimate that a significant percentage of these occurrences need to be fixed.

Serial Dates

The following page lists all known Serial records whose Title dates do not match the dates of the Publications that they appeared in as of the 2008-02-10 backup.