ISFDB:Data Consistency

From ISFDB
Revision as of 19:18, 19 February 2015 by Ahasuerus (talk | contribs) (→‎Authorless titles: Cleanup report exists; deleting)
Jump to navigation Jump to search

The Data Consistency project is a place to coordinate efforts to identify and repair data consistencies, including Stray Publications, malformed ISBNs, etc.

Publication Records

Invalid characters in Publication titles




Type mismatches

  • Need a script that finds book length Publications whose type doesn't match the type of the associated Title records, e.g. "collection" vs. "novel", "novel" vs. "omnibus", etc.

Missing data

  • Need a script that finds missing data, e.g. missing page counts, missing pb/tp/hc data, 0000-00-00 dates, etc.




Titles

Safe to auto-merge identical titles?

Related to the Gardner Dozois collection was that for nearly all of the titles I found had two title records that were identical other than one was a parent of the other. If it seems safe it seems it would safe some work to to do a sweep for title records that are identical and to auto-merge them. Marc Kupper 22:25, 22 Dec 2006 (CST)


Tags

It would be nice to standardize user-generated tags.

Synonyms


  • hard science fiction
  • hard sf

  • history of sf
  • history-of-sf

  • juvenile sf
  • juvenile-sf

  • recursive sf
  • meta sf

  • 'young-adult humorous sf' -> 'humorous sf' + 'young-adult sf'

etc etc

some one should probably come up with a standard then make some filters for input and run the filters on the existing tags.


Authors

Authors who are (probably) doubled

I don't know how we can browse authors summary pages by ids. Advanced search gives access only to editing authors.

These have the same canonical names and their legal names are either equal or at least one legal name is NULL (so they have to be checked before merging):

select a1.author_canonical canonical, a1.author_legalname legal, count(1) 
   from authors a1, authors a2 
   where a1.author_canonical = a2.author_canonical and a1.author_id <> a2.author_id 
     and (a1.author_legalname = a2.author_legalname or IsNull(a1.author_legalname) or IsNull(a2.author_legalname)) 
   group by 1, 2;

Results from backup January 27, 2008:

+-------------+--------------+----------+
| canonical   | legal        | count(1) |
+-------------+--------------+----------+
| Philip Kent | NULL         |        2 | 
| Simon Clark | NULL         |        1 | 
| Simon Clark | Clark, Simon |        1 | 
+-------------+--------------+----------+
3 rows in set (1.70 sec)

Looks like one Simon Clark is new. How do these get created? --Roglo 13:15, 28 Jan 2008 (CST)

You can't do it by changing author names in Titles and Publications since the software will assume that you want to use the existing author record instead of creating a new one. However, when you change an Author's name in "Author Data", the software doesn't check whether there is another author record with the same name. Thus, if we have "Simon Clark" and "Simon Clarke" on file and the latter is changed to "Simon Clark", we will end up with two "Simon Clark" records. When reviewing submissions that attempt to modify Author records, moderators have no way of telling whether another Author record with the same name already exists, so they often slip through. Ahasuerus 15:00, 28 Jan 2008 (CST)
P.S. I have deleted the results from the last two passes, but they are available via this page's History if anybody needs them. Ahasuerus 15:00, 28 Jan 2008 (CST)

2008-02-10 results:

+-------------+--------------+----------+
| canonical   | legal        | count(1) |
+-------------+--------------+----------+
| Simon Clark | NULL         |        1 |
| Simon Clark | Clark, Simon |        1 |
+-------------+--------------+----------+
Fixed on 2008-02-29. Ahasuerus 23:57, 29 Feb 2008 (CST)

Pseudonym consistency

There are 22 authors in the latest dump who are self-pseudonyms (they have a pseudonym record pointing back to themselves):

select ps.pseudo_id, ps.author_id, au.author_canonical
    from pseudonyms ps, authors au
    where ps.author_id = ps.pseudonym and ps.author_id = au.author_id;


There are two pairs of authors who are pseudonyms of each other:

  • Avram Davidson (author_id 501) and Ellery Queen (author_id 4921), twice
  • Douglas Stapleton (author_id 12921) and Doug Stapleton (author_id 12922)

There may also be longer cycles... WimLewis 22:28, 25 Mar 2007 (CDT)



Title vs. Publication Type Consistency

This section documents mismatches between Title types and associated Publication types:

Pseudonyms in Collections

The following page lists all known Collection Publications that include pseudonymous Titles as of the August 11, 2007 backup. Although this is not always indicative of an error, we estimate that a significant percentage of these occurrences need to be fixed.

Serial Dates

The following page lists all known Serial records whose Title dates do not match the dates of the Publications that they appeared in as of the 2008-02-10 backup.