Difference between revisions of "Talk:Database Schema"
m |
(Spam reversal) |
||
Line 1: | Line 1: | ||
− | + | == Examples of pub_content usage == | |
A simple NOVEL publication has a NOVEL content entry. When you display [http://www.isfdb.org/cgi-bin/pl.cgi?DTHBSTVXTX1978 the publication], there's no need to show the contents, as a Novel obviously contains a Novel. When you [http://www.isfdb.org/cgi-bin/edit/editpub.cgi?183625 edit] the pub though, the NOVEL content record is shown as you might want to put a page number on it, for instance. Here are the relevant bits of the relevant records, we have one entry in each table. | A simple NOVEL publication has a NOVEL content entry. When you display [http://www.isfdb.org/cgi-bin/pl.cgi?DTHBSTVXTX1978 the publication], there's no need to show the contents, as a Novel obviously contains a Novel. When you [http://www.isfdb.org/cgi-bin/edit/editpub.cgi?183625 edit] the pub though, the NOVEL content record is shown as you might want to put a page number on it, for instance. Here are the relevant bits of the relevant records, we have one entry in each table. | ||
Line 196: | Line 196: | ||
You may, for instance, have decided to have a look for all the Ace Doubles I pointed out earlier. And the Advanced Search does offer you the opportunity of searching for publications where Term 1 is 'Ace Double' and the search-type is 'Publisher'. This gives a nasty-looking set of Python error messages - don't worry, you haven't broken anything! Some of the error says: | You may, for instance, have decided to have a look for all the Ace Doubles I pointed out earlier. And the Advanced Search does offer you the opportunity of searching for publications where Term 1 is 'Ace Double' and the search-type is 'Publisher'. This gives a nasty-looking set of Python error messages - don't worry, you haven't broken anything! Some of the error says: | ||
− | select pubs.* from pubs where pubs.pub_publisher... ' | + | select pubs.* from pubs where pubs.pub_publisher... '%Ace Double%' order by pubs.pub_title limit 100" |
- this is a shortened version of the SQL it's trying - and the explanation is | - this is a shortened version of the SQL it's trying - and the explanation is | ||
"Unknown column 'pubs.pub_publisher' in 'where clause'" | "Unknown column 'pubs.pub_publisher' in 'where clause'" | ||
Line 213: | Line 213: | ||
select * from publishers | select * from publishers | ||
− | where publisher_name like ' | + | where publisher_name like '%Ace Double%' |
publisher_id publisher_name note_id | publisher_id publisher_name note_id | ||
Line 327: | Line 327: | ||
That looks familiar... oh yes, already fixed under the Ace Double initiative. Ignore. Here's an interesting pair though: | That looks familiar... oh yes, already fixed under the Ace Double initiative. Ignore. Here's an interesting pair though: | ||
− | Blood | + | Blood & Ivory: A Tapestry 4835 COLLECTION 2 |
+ | Blood and Ivory: A Tapestry 4836 COLLECTION 2 | ||
+ | |||
+ | Let's change the "&" version to NOVEL and see what's up. (The other doesn't have enough content to be interesting.) | ||
+ | |||
+ | Blood and Ivory: A Tapestry (expanded) • collection by P. C. Hodgell (aka Blood & Ivory: The Book of Jame) [as by P. C. Hodgell ] | ||
+ | Blood & Ivory: A Tapestry • collection by P. C. Hodgell | ||
+ | |||
+ | Well, somebody's tried some serious merges on titles there, along with a variant - I have NO clue what's correct. I AM going to leave the pub as a NOVEL, although remove the pair of collections, in the hope that someone can explain them all at some point. I'll put a NOVEL title into the pub as well just so that it shows up, and notes too. Am I breaking my rule of: | ||
+ | |||
+ | '''DO NOT TRY THIS IF YOU DO NOT KNOW HOW TO PUT THINGS BACK TO AT LEAST NO MORE THAN THE SAME SORT OF PROBLEM AS THEY HAD BEFORE!''' | ||
+ | |||
+ | I don't think so. I think it's LESS of a problem if the problem is still visible. | ||
+ | |||
+ | Here's an awkward one: it looks like two titles if you search that way, only one pub if you search that way: and it's VERIFIED. | ||
+ | |||
+ | Budayeen Nights 5665 COLLECTION 2 | ||
+ | |||
+ | OK, change to NOVEL: yes, there's two collections in there. Let's remove one - damn, it removed BOTH. The "two titles, one pub" should have given that away. OK, add one COLLECTION back and change the pub type back - sorted. No need to ask the absent verifier what I did wrong... but I really shouldn't be trying any more this late at night under the influence of alcohol. And that's only record 9 of 67... :-/ | ||
+ | |||
+ | I really wouldn't recommend this sort of fixing if you can't approve your own edits, it's difficult to explain WHY you're messing up things (in the short term) to a separate approver. But if you've got to that stage, then you might want to try this on some obscure titles that haven't been fully completed, and not verified. There's (thankfully!) not too many pubs in this state, but it's easy to create more (I have no idea if the problem is better or worse since the last backup, I can't think of a way to check the ''current'' ISFDB situation) so we have to be vigilant for such. I don't want to be the only fixer though. If you feel you can help, drop me a line on my talk page. [[User:BLongley|BLongley]] 17:48, 1 Nov 2007 (CDT) | ||
+ | |||
+ | : Note: as of tonight I've worked my way through all the current problems that this SQL example points out. I've left a few with the individual Verifiers to finish off. Thankfully, it seems a few had already been fixed, and NOT just by deleting the problem (which tends to delete good data along with the bad). But I'm sure there'll be some more visible after the next backup. In the meantime, if you spot apparent duplicate pubs, DO please use the "Diff Publications" option and check whether ISFDB is '''comparing a pub with itself''' - deleting either of the apparent duplicate titles will delete the ONE pub with all the useful information in, and you don't want to have to reenter it all, do you? [[User:BLongley|BLongley]] 17:10, 4 Nov 2007 (CST) | ||
+ | |||
+ | == Double Pseudonyms == | ||
+ | |||
+ | It's OK for an author to have many pseudonyms, but sometimes they get the same pseudonym twice: | ||
+ | |||
+ | select p.author_id, p.pseudonym, a2.author_canonical, a1.author_canonical, count(*) | ||
+ | from pseudonyms p, authors a1, authors a2 | ||
+ | where p.pseudonym = a1.author_id | ||
+ | and p.author_id = a2.author_id | ||
+ | group by p.author_id, p.pseudonym, a2.author_canonical, a1.author_canonical | ||
+ | having count(*) > 1 | ||
+ | order by 3 | ||
+ | |||
+ | author_id pseudonym author_canonical author_canonical1 count(*) | ||
+ | 131 21289 Brian W. Aldiss Brian Aldiss 2 | ||
+ | 4921 501 Ellery Queen Avram Davidson 2 | ||
+ | 1645 7807 Geo W. Proctor George W. Proctor 2 | ||
+ | 55321 68411 Jim Steranko Steranko 2 | ||
+ | 1170 11214 John Jakes John W. Jakes 2 | ||
+ | 60471 83981 Joseph Dreany J. Dreany 2 | ||
+ | 2902 21809 Michael A. Arnzen Mike Arnzen 2 | ||
+ | 517 40131 Miriam Allen deFord Miriam Allen de Ford 2 | ||
+ | 456 11213 Nelson S. Bond Nelson Bond 2 | ||
+ | 18315 81517 Richard Harris Barham R. H. Barham 2 | ||
+ | 29 70121 Robert A. Heinlein Robert Heinlein 2 | ||
+ | 10527 196 Robin Scott Wilson Robin Wilson 2 | ||
+ | 22 24197 Samuel R. Delany K. Leslie Steiner 2 | ||
+ | 10574 88989 Stephen P. Brown S. Patrick Brown 3 | ||
+ | 2188 32048 Vicente Segrelles Segrelles 2 | ||
+ | 1803 43381 Vincent Di Fate Vincent DiFate 2 | ||
+ | |||
+ | If anyone knows any method of cleaning these up that doesn't require SQL or massive numbers of edits, please let me know. [[User:BLongley|BLongley]] 14:26, 2 Nov 2007 (CDT) | ||
+ | |||
+ | : And I see there's three more now - please be careful, people! [[User:BLongley|BLongley]] 15:38, 8 Nov 2007 (CST) | ||
+ | |||
+ | == Entropy Measurements == | ||
+ | From [http://isfdb.org/wiki/index.php/ISFDB:Community_Portal#Database_Entropy an interesting discussion] here, let's see what we can do to measure our (hopefully continual) improvements. Taking the first suggestion: "For magazines, titles entered without page numbers", let's see what we can do. Here's a start: | ||
+ | |||
+ | select case ifnull(pubc_page, -999) when -999 then "Bad" else "Good" end, count(*) | ||
+ | from pub_content pc, pubs p, titles t | ||
+ | where p.pub_id = pc.pub_id | ||
+ | and p.pub_ctype = 'MAGAZINE' | ||
+ | and t.title_id = pc.title_id | ||
+ | and t.title_ttype NOT IN ('EDITOR','COVERART') | ||
+ | group by case ifnull(pubc_page, -999) when -999 then "Bad" else "Good" end | ||
+ | |||
+ | Bad 53750 | ||
+ | Good 48353 | ||
+ | |||
+ | Not even halfway yet, but this is not from the latest backup. Note that EDITOR and COVERART don't need page numbers, so shouldn't be included. And this doesn't actually include MAGAZINES with NO contents: | ||
+ | |||
+ | select count(*) | ||
+ | from pubs p where not exists (select 1 from pub_content pc where p.pub_id = pc.pub_id) | ||
+ | and p.pub_ctype = 'MAGAZINE' | ||
+ | |||
+ | But there's only 5 of those. As many problems as we have with NOVEL, NONFICTION, and COLLECTION(!) entries in magazines. [[User:BLongley|BLongley]] 14:58, 3 Dec 2007 (CST) | ||
+ | |||
+ | A couple of other quick Entropy Measurements, useful to watch over time maybe. | ||
+ | |||
+ | Pubs without pages, that aren't audio-books (which shouldn't have them) or electronic (which may or may not): | ||
+ | |||
+ | select case ifnull(pub_pages, -999) when -999 then "Bad" else "Good" end, count(*) | ||
+ | from pubs p | ||
+ | WHERE p.pub_ptype NOT LIKE '%audio%' | ||
+ | And p.pub_ptype NOT LIKE '%cassette%' | ||
+ | And p.pub_ptype NOT LIKE 'CD%' | ||
+ | And p.pub_ptype NOT LIKE 'compact disc%' | ||
+ | And p.pub_ptype NOT LIKE 'e%book%' | ||
+ | And p.pub_ptype NOT LIKE 'electron%' | ||
+ | And p.pub_ptype NOT LIKE '%web%' | ||
+ | And p.pub_ptype NOT LIKE '%ezine%' | ||
+ | And p.pub_ptype NOT LIKE '%internet%' | ||
+ | And p.pub_ptype NOT LIKE '%mp3%' | ||
+ | And p.pub_ptype NOT LIKE '%Adobe%' | ||
+ | And p.pub_ptype NOT LIKE '%Mobipocket%' | ||
+ | And p.pub_ptype NOT LIKE '%PDF%' | ||
+ | And p.pub_ptype NOT LIKE '%tape%' | ||
+ | And p.pub_ptype NOT LIKE '%www%' | ||
+ | And p.pub_ptype NOT LIKE '%digit%' | ||
+ | And p.pub_ptype NOT LIKE '%online%' | ||
+ | group by case ifnull(pub_pages, -999) when -999 then "Bad" else "Good" end | ||
+ | |||
+ | Bad 19535 | ||
+ | Good 82131 | ||
+ | |||
+ | Pubs without Prices: I've been using "L" on its own to force a link to Amazon UK at times, but if it's ONLY "L" then it still hasn't got a proper price. | ||
+ | |||
+ | select case ifnull(pub_price, 'L') when 'L' then "Bad" else "Good" end, count(*) | ||
+ | from pubs p | ||
+ | group by case ifnull(pub_price, 'L') when 'L' then "Bad" else "Good" end | ||
+ | |||
+ | Bad 13118 | ||
+ | Good 96196 | ||
+ | |||
+ | OK, off to download a later backup and see if things have got better or worse... [[User:BLongley|BLongley]] 13:22, 9 Dec 2007 (CST) | ||
+ | |||
+ | == Reviews of things that don't exist == | ||
+ | |||
+ | Due to "Entropy Measurements" again, I had a look at how we represent reviews. How do we get a review to refer to the reviewed title, reviewed author, and still fit a reviewer in? Well, this is one of the complicated bits. We sneak TWO entries into "canonical_author". Possibly the most misleading table name in the database: it links titles and authors, but different kinds of authors, so occasionally we need multiple entries. It doesn't just affect reviews either, it covers interviews. When looking at "canonical_author" there's a "ca_status" column that helps sort them out. I think it goes like this (for Review and Interview title types): | ||
+ | |||
+ | ca_status = 1 -- Reviewer/Interviewer | ||
+ | ca_status = 2 -- Interviewee | ||
+ | ca_status = 3 -- Author of reviewed title | ||
+ | |||
+ | So we can enter a pair of entries, for reviewer and reviewee, or interviewer and interviewee, and it all works somehow. (No idea what happens if all three are present.) | ||
+ | |||
+ | So I guess we can do something like this for reviews of missing things: | ||
+ | |||
+ | select t.title_title, a.author_canonical | ||
+ | from titles t, canonical_author ca, authors a | ||
+ | where ca.title_id = t.title_id | ||
+ | and a.author_id = ca.author_id | ||
+ | and t.title_ttype = 'REVIEW' | ||
+ | and ca.ca_status = 3 | ||
+ | and not exists (select 1 from pubs p | ||
+ | where p.pub_title = t.title_title) | ||
+ | LIMIT 20 | ||
+ | |||
+ | Notice the LIMIT, as these queries are dead slow if you get the "NOT EXISTS" bit wrong. I think the ISFDB software is cleverer for matching things that DO exist, and matches title and author against titles and their authors rather than against pubs. But when something is missing, it might be good to check against everywhere they MIGHT be. Still, as this can only be run against downloads of the database, maybe it's time to look at how we can locally add indexes and suchlike performance improvements for the queries that check BAD things rather than just GOOD things. | ||
+ | |||
+ | (Database experts, feel free to butt in here. I'd rather get a definitive answer on the design over figuring it out myself from the data.) [[User:BLongley|BLongley]] 16:13, 7 Dec 2007 (CST) | ||
+ | |||
+ | == Playing with TOAD 3.1 == | ||
+ | Well, it still kills my machine if I run an inefficient query, but the Graphical Query Builder is looking quite useful (rather like the one in MS-Access). As usual though, the sort of SQL it generates isn't what I'd necessarily write: | ||
+ | |||
+ | SELECT `titles`.title_ttype, `authors`.author_canonical, `titles`.title_title, | ||
+ | MAX(`verification`.user_id) | ||
+ | , MAX(`verification`.reference_id) | ||
+ | FROM ( ( ( ( isfdb.titles `titles` | ||
+ | INNER JOIN | ||
+ | isfdb.canonical_author `canonical_author` | ||
+ | ON (`titles`.title_id = `canonical_author`.title_id)) | ||
+ | INNER JOIN | ||
+ | isfdb.authors `authors` | ||
+ | ON (`canonical_author`.author_id = `authors`.author_id)) | ||
+ | INNER JOIN | ||
+ | isfdb.pub_content `pub_content` | ||
+ | ON (`pub_content`.title_id = `titles`.title_id)) | ||
+ | INNER JOIN | ||
+ | isfdb.pubs `pubs` | ||
+ | ON (`pub_content`.pub_id = `pubs`.pub_id)) | ||
+ | LEFT OUTER JOIN | ||
+ | isfdb.verification `verification` | ||
+ | ON (`verification`.pub_id = `pubs`.pub_id) | ||
+ | WHERE (`authors`.author_canonical = 'Philip K. Dick') | ||
+ | AND (`titles`.title_ttype IN ('NOVEL', 'ANTHOLOGY', 'COLLECTION')) | ||
+ | AND (`verification`.reference_id = 1 | ||
+ | OR `verification`.reference_id IS NULL) | ||
+ | AND (`verification`.user_id IN (2781, 4121) | ||
+ | OR `verification`.user_id IS NULL) | ||
+ | GROUP BY `titles`.title_ttype, | ||
+ | `authors`.author_canonical, | ||
+ | `titles`.title_title | ||
+ | |||
+ | Still, that's a perfectly useful query for finding books by an author and showing whether I've verified them or not. Now if only I'd verified all the pubs I own before other people got to them... :-/ [[User:BLongley|BLongley]] 12:55, 9 Dec 2007 (CST) | ||
+ | |||
+ | == Authors that only exist because of reviews == | ||
+ | They are a pain as they show up in the Author Directory with no useful information. This script should find the stray authors, along with the title IDs for the reviews that create them. (Bung the title ID into a "http://www.isfdb.org/cgi-bin/title.cgi?XXXXXX" query directly.) | ||
+ | |||
+ | select a.author_canonical, ca.title_id | ||
+ | FROM canonical_author ca, authors a | ||
+ | WHERE ca.ca_status = 3 | ||
+ | and ca.author_id = a.author_id | ||
+ | AND NOT EXISTS (SELECT 1 from canonical_author ca2, titles t | ||
+ | where ca.author_id = ca2.author_id | ||
+ | AND ca2.title_id = t.title_id | ||
+ | AND t.title_ttype != 'REVIEW' | ||
+ | and ca2.ca_status = 1) | ||
+ | order by 1,2 | ||
+ | |||
+ | According to my last loaded backup, there's about 1200 authors that may need the review correcting, or a reviewed title to be created. I'll update when I've loaded the latest. [[User:BLongley|BLongley]] 09:06, 29 Dec 2007 (CST) | ||
+ | |||
+ | : Down to 994 now - must have been a good month of cleanups! [[User:BLongley|BLongley]] 09:23, 29 Dec 2007 (CST) |
Revision as of 00:12, 18 March 2008
Examples of pub_content usage
A simple NOVEL publication has a NOVEL content entry. When you display the publication, there's no need to show the contents, as a Novel obviously contains a Novel. When you edit the pub though, the NOVEL content record is shown as you might want to put a page number on it, for instance. Here are the relevant bits of the relevant records, we have one entry in each table.
pubs pub_content titles pub_id: 183625 pubc_id: 739717 title_id: 459 pub_ctype: NOVEL pub_id: 183625 title_ttype: NOVEL pub_tag: DTHBSTVXTX1978 title ID: 459 title_title: Deathbeast pub_title: Deathbeast
Of course, a NOVEL Title doesn't have to be in a NOVEL publication. What if it's part of an Omnibus? You obviously want a link from the NOVEL title to the OMNIBUS publication. And a link for each other NOVEL to the same OMNIBUS publication. But an Omnibus isn't just the sum of its contents, so there's an OMNIBUS title record as well in the same pub. Again, this doesn't need to be shown when you display the publication, but we DO want to display the NOVEL contents. This time though, the OMNIBUS record is hidden when you edit the pub - there's nothing you really need to do to it. (Well, you could set a length for it, which is the way we record how many entries, or which entries from a series, make up the omnibus. But we do that on the Title itself instead.)
pubs pub_content titles pub_id: 174441 pubc_id: 673891 title_id: 503331 pub_ctype: OMNIBUS pub_id: 174441 title_ttype: OMNIBUS pub_tag: DDFNNLKHNL2000 title_id: 503331 title_title: Dead Funny pub_title: Dead Funny pubc_id: 673901 title_id: 1926 pub_id: 174441 title_ttype: NOVEL title_id: 1926 title_title: Flying Dutch pubc_id: 673911 title_id: 8769 pub_id: 174441 title_ttype: NOVEL title_id: 8769 title_title: Faust Among Equals
There's actually another content record for this pub, a COVERART record. This stays hidden when you edit the pub, because it's maintained automatically when the Artist field is used.
pubc_id: 673881 title_id: 503321 pub_id: 174441 title_ttype: COVERART title_id: 503321 title_title: Cover: Dead Funny
COLLECTION and ANTHOLOGY pubs (and even NOVELS) also have content title types such as SHORTFICTION and ESSAY and INTERIORART and POEM that show up on the pub display and while editing, they're pretty harmless so long as you understand that you're not working on the entry in this pub alone, you're working on ALL entries of that title in ANY publication. INTERVIEW and REVIEW types have their own sections while editing, but you may need to change one of the other content types to one of these if it was mis-entered at first. Those require an extra field to be entered though (the Author moves to Interviewer/Reviewer but you still have to add Book Author/interviewee), so you'll want to do two passes of edits.
It's the Special record used to link the title and pub that is dangerous to change: if one of these becomes visible while editing, try and keep the pub type and entry type matched: NOVEL to NOVEL, COLLECTION to COLLECTION, ANTHOLOGY to ANTHOLOGY, OMNIBUS to OMNIBUS. Sometimes when correcting a publication type, one of these special records isn't hidden away any more and looks as though it could be usefully changed to a normal content record: e.g. if a NOVEL called "XYZ" is actually a collection with a Short Story called "XYZ" in it, it may be tempting to change the XYZ NOVEL record to the SHORTFICTION entry that's required. DON'T! The NOVEL record should change to COLLECTION to keep the link working, and a SHORTFICTION for "XYZ" the short story is added separately.
The problems that start when you lose the link record that matches a pub and title may not be obvious immediately, but if you find pubs where the hyper-link to the title record isn't present, or get warnings about title record mismatches, or you get warnings when Cloning or trying to Remove Titles from a pub, stop and check. Ask for help if needed.
NONGENRE and NONFICTION look as though they might be used for normal contents, but they're not intended that way: they're intended to be the ONE main record for a publication that's of little interest to us as it's not SF - this is why we have troubles if we try and enter Non-Genre Short-fiction in a Non-Genre Publication for instance. Interesting Non-Fiction pieces in a publication should be of type ESSAY - one big NONFICTION record would be used to class the whole pub as NONFICTION, and not worth going into a lot of detail about in ISFDB.
SERIAL shouldn't give you a problem if you're working on books, it's intended for magazines (although it has been used within books to group episodes of a story that aren't consecutive in the book). But if you're working on magazines, then you will come across EDITOR and COVERART too - but I'll leave explaining those to a Magazine mod.
I hope this helps! BLongley 09:37, 21 Oct 2007 (CDT)
Things to do with pub_content in MySQL
As you can see from above, there may be a number of content records for even the simplest pub, so let's start with a simple case: NO content records for a title. Here's some SQL to find them:
select t.title_id, t.title_title from titles t where not exists (select 1 from pub_content pc where pc.title_id = t.title_id) and t.title_parent = 0 LIMIT 100
You'll want that "LIMIT" on the end as there are quite a few of them.
So what's wrong with these?
- There can be titles we are sure exist, but we don't have details of any publication yet: you could go search one out at an internet bookshop and create a publication, if you like. First editions are best as they tend to confirm the date on the title.
- Maybe the title record shouldn't be there: if someone deleted some pub that turned out to be an RPG dice set, the title wouldn't go automatically, it has to be deleted in a second submission. Maybe someone forgot that second submission?
Simple, eh? Well, I kept it simple deliberately, by restricting it to non-variant titles. Let's take out the "t.title_parent = 0"
select t.title_id, t.title_title, t.title_parent from titles t where not exists (select 1 from pub_content pc where pc.title_id = t.title_id) LIMIT 100
Look at the title_parent: if it's not zero then this is a Variant title. (This is why the way to remove an incorrect variant link is to set the parent to zero, rather than blank.) In which case:
- The "missing" publication(s) might be stored under the parent instead, and need to be moved back, or
- We're just missing the publication(s) for THIS variation of the title (feel free to go find one to enter), or
- There never WILL be an entry under this variation.
Why is the last OK? Well, consider Iain M. Banks and Iain Banks. They're the same person, using the world's least-confusing pseudonym. Any given book of his is published under one name or the other, depending on whether it's SF or not. However, we keep ONE canonical author record for him, and so we need all his books to have an entry for "Iain M. Banks", even when that publication will never be published under that name, only as by "Iain Banks". So one variant will have pubs, the other won't: it's just to put all the titles under one name, and the software will figure out the "as by" for us. This is why the simple case above is actually less simple than two cases: again, it might be fine to have an empty title if all the publications go under a variant title of the one you're looking at.
If this is confusing you, just try it out. You can search for the t.title_title in ISFDB itself, with the usual title search: if you want to search by the t.title_parent, then you can put that number into a URL and search that way, e.g.
http://www.isfdb.org/cgi-bin/title.cgi?887
or use some more SQL:
select * from titles t where t.title_id = 887
There's one other over-simplification I've made here you may have already noticed. What happens when there's a variant title of a variant title? Well, that's when people start getting headaches, so that's worth saving for a different topic.
Let's try it from the other side
How about, instead of titles with no contents, pubs with no contents?
select p.pub_id, p.pub_tag, p.pub_title, p.pub_ctype from pubs p where not exists (select 1 from pub_content pc where pc.pub_id = p.pub_id) LIMIT 100
(Again, you may want a LIMIT on this, there's a few hundred.)
If you plug the resulting pub_title information into a publication search (last of the Advanced search options), you may well think everything looks OK, it finds the publication alright. You can even click on the Title and find a fairly normal-looking display. (Can you spot the missing bit though?) Try it in the Advanced or Normal title searches and you won't necessarily find ANY records though. Search for that author and you won't necessarily find that title. If you do, you won't find that pub under it.
What's happening here then? Well, without a pub_content record linking the pub to the title, you don't get the title link appearing when you look at the pub. The Author's titles still show up, but this publication won't appear under them. How to fix? Simple! Once you've found the publication (remember, only the advanced pub search will work) you'll either find the orphan record, or the orphan record and a good record to go with it. For instance, at the moment I can see
pub_id pub_tag pub_title pub_ctype 5643 BBBLSNDTHSA2004 Bubbles and the Secret Admirer NOVEL
and searching for "Bubbles and the Secret Admirer" shows:
5642 Bubbles and the Secret Admirer E. S. Mooney 2004 48 BBBLSNDTHS2004 tp NOVEL 0-439-49177-0 $3.99 Scholastic 5643 Bubbles and the Secret Admirer E. S. Mooney 2004 48 BBBLSNDTHSA2004 tp NOVEL 0-439-49177-0 $3.99 Scholastic
Record 5643 can be deleted as record 5642 is the good version.
However, I also see:
pub_id pub_tag pub_title pub_ctype 5641 BRTRBTSTP2004 Brute Orbits Tp NOVEL
And searching for "Brute Orbits" (as I guessed the "Tp" was an error) finds:
Record Title Authors Year Pages Tag Binding Type ISBN Price Publisher 5639 Brute Orbits George Zebrowski 1998 222 BRUTEORBIT1998 hc NOVEL 0-06-105026-1 $23.00 HarperPrism 5640 Brute Orbits George Zebrowski 1999 ? BRTRBTSTP1999 tp NOVEL 0-06-105380-5 $15.00 Eos 225585 Brute Orbits George Zebrowski 1999 336 BRTRBTSTJD1999 pb NOVEL 0-06-105807-6 $6.99 HarperPrism 5641 Brute Orbits Tp George Zebrowski 2004 ? BRTRBTSTP2004 tp NOVEL 0-06-105380-5 $15.00 Eos
We don't have a 2004 trade paperback edition, so we probably want to rescue this 5641 stray. You can get to edit this publication directly via the "record" link, or via the title link and "Edit this pub". Remember all the "Special" contents records I told you not to mess with? Well, this is a time where we DO want to mess with it, as it's broken. As there's no content at all, the software invites you to enter an ANTHOLOGY - we don't necessarily want that though, we want what the publication says. (Most seem to be NOVELs though.) Fortunately there's enough info there - enter a Content record with the right title, author, type and date and submit. Pick the date from the earliest GOOD record, or from the pub, it won't matter - we're going to have to merge titles after approval anyway. You should probably make sure the title matches up, so that it's easy to search for the titles to BE merged.
You may find a complete stray, with no other good record equivalent: you can fix it this way so other editors will come across it naturally, or research the pub and delete it if it's completely wrong.
How does this happen? I don't know. If you find new examples being created, let us know. I personally suspect that the ISFDB-1 to ISFDB-2 conversion created such entries (as there's often a good pub entry with a pub_id 1 less than the pub_id of the bad pub entry) , in which case this article will become obsolete when we've fixed them all. BLongley 15:08, 21 Oct 2007 (CDT)
From too few to too many
OK, we've looked at pubs with not enough (i.e. zero) contents, and titles with not enough (i.e. zero) contents, what about when there's too MANY contents? I'm not talking about books with a lot of SHORTFICTION or ESSAYs in them, those are common: they're probably an ANTHOLOGY or COLLECTION. And a book about or by an artist may have a lot of INTERIORART. It's the Special, linking, content records that might be a worry. I mentioned that a simple case is a NOVEL containing a NOVEL: and we all know that a book containing two or more NOVELs is called an OMNIBUS, isn't it? So let's go find some Omnibuses that haven't been called that.
select p.pub_id, p.pub_tag, p.pub_title, p.pub_ctype , COUNT(*) from pubs p, titles t, pub_content pc where pc.pub_id = p.pub_id and pc.title_id = t.title_id and t.title_ttype = 'NOVEL' and p.pub_ctype != 'OMNIBUS' GROUP BY p.pub_id,p.pub_tag, p.pub_title, p.pub_ctype HAVING COUNT(*) > 1 LIMIT 100
Oh dear - it seems that sometimes we don't call a book with two NOVELs an OMNIBUS, we might call it an "Ace Double" instead:
pub_id pub_tag pub_title pub_ctype COUNT(*) 1591 CRSSTNVFE1958 Across Time / Invaders From Earth NOVEL 2 1769 GNSTRCTTTH1972 Against Arcturus / Time Thieves NOVEL 2 1998 LNFMRTMC1956 Alien From Arcturus / Atom Curtain NOVEL 2
One for a Standards discussion maybe. I don't own many, and none of those, so I'm opting out of that for now. I can't ignore these though: record 4545 comes up:
pub_id pub_tag pub_title pub_ctype COUNT(*) 4545 BIPOHL1982 BiPohl NOVEL 2
It's a title I might have considered buying, and Advanced searching for "BiPohl" publications shows these:
Record Title Authors Year Pages Tag Binding Type ISBN Price Publisher 4545 BiPohl Frederik Pohl 1982 313 BIPOHL1982 pb NOVEL 0-345-30247-8 $2.75 Ballantine Del Rey 4546 BiPohl Frederik Pohl 1987 ? BPHL1987 tp OMNIBUS 0-345-35005-7 $3.50 Ballantine 136061 BiPohl Frederik Pohl 1982 314 BPHLLHKQVN1982 pb OMNIBUS 0-345-30247-8 $2.75 Ballantine
It turns out 136061 has been verified, and contains two novels I already own - just the sort of information I wanted! Saves me buying a book I don't need. Looking at 4545 makes me think it's a Duplicate, and can be deleted.
Here's another book I own a version of:
pub_id pub_tag pub_title pub_ctype COUNT(*) 13811 FRSTLNSMN1982 First Lensman NOVEL 2
Oh dear, when I try and edit that I see it's got TWO identical NOVEL entries. Never mind, I can "Remove Titles From This Pub" can't I? No. ISFDB hides the "correct entry" for me: unfortunately both are correct, we just don't need TWO of them.
Blank out one of them? Doesn't work.
How about if I change the title type to something different like ANTHOLOGY? Well, that shows up the two NOVELS for removal in "Remove Titles From This Pub", let's try removing one. There's a big yellow "WARNING: Unable to locate the title reference for this publication" message though... oh well, it's a screwed-up pub anyway, let's try it - Wey-hey, it worked! OK, now I need to change the title back to NOVEL.
I'll do a few more whilst I'm at it.
pub_id pub_tag pub_title pub_ctype COUNT(*) 22863 MRTHNHMN601953 More Than Human NOVEL 2 22864 MRTHNHMN1953 More Than Human NOVEL 2
Oh no! I suggested we remove ONE title, it says I want to remove BOTH! REJECT, REJECT, REJECT! Or maybe not. So long as I put the title BACK afterwards it's OK. Phew, that worked. Oh wait, it's considered a new title - merge them both and we're back to normal. This is definitely NOT for the faint-hearted! Far better to leave it alone and call for help, I think. And it's arguable about whether it's an anthology anyway, it seems. Maybe it's a novel with some extra short-fiction and essays? TWO copies of the same novel in one pub was obviously wrong, but unless I've got all the data about that novel saved in case I need to put it back, I'm not going to touch such again for a while. So much more to learn...
Back to a simple table
The last example caused too much thinking, so let's go look at another simple table that may be of use: Verification. Hopefully if you've got to this stage, you've been verifying the publications you actually own. Primary Verification where you've still got the book is reference_id 1. If you want to mourn books you didn't keep, the "Primary (Transient)" is reference_id 12. Note that this does NOT match the reference_id in the reference table!
Here's how I find what I verified:
select p.pub_id, p.pub_title, v.ver_time from pubs p, verification v where v.pub_id = p.pub_id and v.reference_id = 1 and v.user_id = 2781 ORDER by 2
That's alphabetical by title, you may want to ORDER BY 3 instead if you're trying to find all the early verifications you did that you think aren't quite good enough by your current standards. ;-)
Of course, that's just me bragging about MY verifications as it stands - you need to put your own user_id in, I'm 2781 and you're NOT. YOU are... well, we can't find that out from SQL, the backups hide all the personally identifiable information. (Phew!) You'll need to go look at your own Wiki preferences and read the first line, "Your internal ID number is XXXX". I bet you haven't looked at THAT page for a while, maybe there's something else you'd like to update while you're there?
"Bill, you haven't put the author in there, how do you tell your Starburst by Pohl from your Starburst by Bester?" Well, of course, I've memorised the pub_id for everything I still own... ;-) Nah, just kidding. The reason is that a lot of my books aren't by a single author, or are by a variant author, and tonight I just want a simple example. It'll get complex enough soon enough, you can be sure of that. Just try the simple things for now. BLongley 16:03, 23 Oct 2007 (CDT)
Why the Publisher Search doesn't work
You may, for instance, have decided to have a look for all the Ace Doubles I pointed out earlier. And the Advanced Search does offer you the opportunity of searching for publications where Term 1 is 'Ace Double' and the search-type is 'Publisher'. This gives a nasty-looking set of Python error messages - don't worry, you haven't broken anything! Some of the error says:
select pubs.* from pubs where pubs.pub_publisher... '%Ace Double%' order by pubs.pub_title limit 100"
- this is a shortened version of the SQL it's trying - and the explanation is
"Unknown column 'pubs.pub_publisher' in 'where clause'"
Pretty self-explanatory really, as there is indeed NO pubs.pub_publisher column in the pubs table. So what SQL should be being executed? Well, the publishers seem to have been normalised into their own table, publishers, which is quite a simple one:
desc publishers;
Field Type Null Key publisher_id int(11) NO PRI publisher_name varchar(64) YES note_id int(11) YES
We can search for 'Ace Double' in here:
select * from publishers where publisher_name like '%Ace Double%'
publisher_id publisher_name note_id 381 Ace Double
and if you look at the pubs table, sure enough there's a publisher_id column there we can search by:
select * from pubs p where p.publisher_id = 381 LIMIT 100
Bingo! Lots and lots of Ace Doubles. Of course, you can find most of them just by searching for titles of ' / ' in ISFDB itself, but SQL can be a bit more generic, or more specific - ' / ' finds some NON-Ace Doubles for instance. Combine the two queries and use something like this to search for publications by the Publisher of your choice: replace "Bill" with the publisher you want.
select pu.publisher_name, p.* from pubs p, publishers pu where p.publisher_id = pu.publisher_id and pu.publisher_name like '%Bill%' order by p.pub_title LIMIT 100
This seems to be a good example of what the database USED to be like - where there once was a pubs.pub_publisher column? - and what it may well become: that publishers.note_id column hasn't been used yet.
"And Not an Omnibus...."
There's another Python problem if you try that in Advanced Search. Say you've tackled all the Ace Doubles, and now want to look at everything else that has ' / ' in the title, but ISN'T an Omnibus. Well, it's easy to search for ' / ' and Ttype of 'NOVEL' for instance, although the search won't recognise the spaces around the slash. But trying to search for ' / ' and NOT Ttype of 'OMNIBUS' doesn't work:
ProgrammingError: (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'ANDNOT titles.title_ttype='OMNIBUS' order by titles.title_title limit 100' at line 1") args = (1064, "You have an error in your SQL syntax; check the ... order by titles.title_title limit 100' at line 1")
That seems to suggest that ISFDB is using "ANDNOT" rather than "AND NOT". Dead easy to correct if we run the SQL ourselves though:
select t.* from titles t where t.title_title like '% / %' AND NOT t.title_ttype = 'OMNIBUS'
OK, that shows there are plenty of other types to exclude. You may want to try something like
select t.* from titles t where t.title_title like '% / %' and not t.title_ttype = 'OMNIBUS' and not t.title_ttype = 'COVERART' and not t.title_ttype = 'INTERIORART' and not t.title_ttype = 'ESSAY' and not t.title_ttype = 'POEM' and not t.title_ttype = 'EDITOR'
Or if we remember "!=" is slightly less typing:
select t.* from titles t where t.title_title like '% / %' and t.title_ttype != 'OMNIBUS' and t.title_ttype != 'COVERART' and t.title_ttype != 'INTERIORART' and t.title_ttype != 'ESSAY' and t.title_ttype != 'POEM' and t.title_ttype != 'EDITOR'
Or even better, remember "NOT IN" and use:
select t.* from titles t where t.title_title like '% / %' and t.title_ttype NOT IN ('OMNIBUS', 'COVERART', 'INTERIORART', 'ESSAY', 'POEM', 'EDITOR')
You could also use "IN" rather than "NOT IN" and list the other types instead - title_ttype is fixed to certain values, namely 'ANTHOLOGY', 'BACKCOVERART', 'COLLECTION', 'COVERART', 'INTERIORART', 'EDITOR', 'ESSAY', 'INTERVIEW', 'NOVEL', 'NONFICTION', 'NONGENRE', 'OMNIBUS', 'POEM', 'REVIEW', 'SERIAL', 'SHORTFICTION', 'CHAPTERBOOK'.
Lots of options, but all quite simple, and all working around a simple problem.
Are you LOOKING for trouble?
Hopefully by now you can spot what this SQL is trying to do.
select p.pub_title, pc.pub_id, t.title_ttype, count(*) from pub_content pc, titles t, pubs p where pc.title_id = t.title_id and pc.pub_id = p.pub_id and t.title_ttype IN ('ANTHOLOGY', 'COLLECTION') and p.pub_ctype != 'OMNIBUS' group by pc.pub_id, t.title_ttype having count(*) > 1
It's OK to have an Anthology or a Collection as part of a bigger book, if the book is an Omnibus. You'd have contents for the Antholog(y/ies) and Collection(s) AND contents for all the shortfiction and essays and such too. That's fine. But we don't want collections of collections or anthologies of anthologies. And some of these problem publications have the SAME title in them twice (who ignored the big Yellow Warning messages about merging, eh?).
But these are not simply classifiable from the above information alone, you're going to have to investigate, and make some pretty dangerous edits in the meantime too.
DO NOT TRY THIS IF YOU DO NOT KNOW HOW TO PUT THINGS BACK TO AT LEAST NO MORE THAN THE SAME SORT OF PROBLEM AS THEY HAD BEFORE!
OK, now I've warned you, let's have a look at a few:
A Star Above and Other Stories 1190 COLLECTION 2
Two collections within one collection? Not that ISFDB will show you that. Changing the Pub type to NOVEL showed up the two collections though. One (with some Internet research despite the NESFA site being unavailable) seems clearly bogus, so "Remove Titles from this pub" got rid of that one. The other is right(ish) - it's "A Star Above It and Other Stories". So that got retitled, and although I don't want to put all the contents in (I don't own it, can't "Look Inside" on Amazon or suchlike, and it postdates the catalogue at tamu.edu) I've at least left the suspected contents in notes for someone to sort out later. It IS a collection though, so the pub type got changed back.
Anasazi 2564 COLLECTION 2
Changing the Pub type to NOVEL showed up the two collections again. Unfortunately they were both the same one, so when I "Removed Titles from this pub" BOTH went, and I had to put one back. Fortunately Contento had details of the contents of this edition, so they now have page numbers. Unfortunately there is a second printing too, so I needed to merge the Collection titles again afterwards. (Yes, changing to NOVEL is usually a temporary measure only, this needed to go back to "COLLECTION" status.)
Bending the Landscape: Science Fiction 3999 ANTHOLOGY 2
This was a simpler one. Changing the Pub type to NOVEL showed up the two Anthologies, and one was called an Introduction - clearly an 'ESSAY' mis-classified. So I fixed that entry, changed the pub type back to 'ANTHOLOGY' and it seems OK now.
Note that the usual resolution is to change the publication back TO the type you changed it FROM. If you've ignored the Big Warnings and got out of your depth, this is the usual thing to try even if you haven't fixed anything.
DO please ask for help if you think you've made things WORSE.
Best from Fantasy and Science Fiction: 14th Series 4076 ANTHOLOGY 2 Best from Fantasy and Science Fiction: 22nd Series 4083 ANTHOLOGY 2
Well, these ones are easy - ISBN-10: 9997374851 - ISBN-10: 9997376463 - just zap them. There are GOOD records for those titles around still, we can do without this rubbish. (Yes, I did check if there was more information about them under those ISBNs, but only Amazon admits they exist and Amazon is WRONG.)
Beyond the Galactic Rim / The Ship From Outside 4374 COLLECTION 3
That looks familiar... oh yes, already fixed under the Ace Double initiative. Ignore. Here's an interesting pair though:
Blood & Ivory: A Tapestry 4835 COLLECTION 2 Blood and Ivory: A Tapestry 4836 COLLECTION 2
Let's change the "&" version to NOVEL and see what's up. (The other doesn't have enough content to be interesting.)
Blood and Ivory: A Tapestry (expanded) • collection by P. C. Hodgell (aka Blood & Ivory: The Book of Jame) [as by P. C. Hodgell ] Blood & Ivory: A Tapestry • collection by P. C. Hodgell
Well, somebody's tried some serious merges on titles there, along with a variant - I have NO clue what's correct. I AM going to leave the pub as a NOVEL, although remove the pair of collections, in the hope that someone can explain them all at some point. I'll put a NOVEL title into the pub as well just so that it shows up, and notes too. Am I breaking my rule of:
DO NOT TRY THIS IF YOU DO NOT KNOW HOW TO PUT THINGS BACK TO AT LEAST NO MORE THAN THE SAME SORT OF PROBLEM AS THEY HAD BEFORE!
I don't think so. I think it's LESS of a problem if the problem is still visible.
Here's an awkward one: it looks like two titles if you search that way, only one pub if you search that way: and it's VERIFIED.
Budayeen Nights 5665 COLLECTION 2
OK, change to NOVEL: yes, there's two collections in there. Let's remove one - damn, it removed BOTH. The "two titles, one pub" should have given that away. OK, add one COLLECTION back and change the pub type back - sorted. No need to ask the absent verifier what I did wrong... but I really shouldn't be trying any more this late at night under the influence of alcohol. And that's only record 9 of 67... :-/
I really wouldn't recommend this sort of fixing if you can't approve your own edits, it's difficult to explain WHY you're messing up things (in the short term) to a separate approver. But if you've got to that stage, then you might want to try this on some obscure titles that haven't been fully completed, and not verified. There's (thankfully!) not too many pubs in this state, but it's easy to create more (I have no idea if the problem is better or worse since the last backup, I can't think of a way to check the current ISFDB situation) so we have to be vigilant for such. I don't want to be the only fixer though. If you feel you can help, drop me a line on my talk page. BLongley 17:48, 1 Nov 2007 (CDT)
- Note: as of tonight I've worked my way through all the current problems that this SQL example points out. I've left a few with the individual Verifiers to finish off. Thankfully, it seems a few had already been fixed, and NOT just by deleting the problem (which tends to delete good data along with the bad). But I'm sure there'll be some more visible after the next backup. In the meantime, if you spot apparent duplicate pubs, DO please use the "Diff Publications" option and check whether ISFDB is comparing a pub with itself - deleting either of the apparent duplicate titles will delete the ONE pub with all the useful information in, and you don't want to have to reenter it all, do you? BLongley 17:10, 4 Nov 2007 (CST)
Double Pseudonyms
It's OK for an author to have many pseudonyms, but sometimes they get the same pseudonym twice:
select p.author_id, p.pseudonym, a2.author_canonical, a1.author_canonical, count(*) from pseudonyms p, authors a1, authors a2 where p.pseudonym = a1.author_id and p.author_id = a2.author_id group by p.author_id, p.pseudonym, a2.author_canonical, a1.author_canonical having count(*) > 1 order by 3
author_id pseudonym author_canonical author_canonical1 count(*) 131 21289 Brian W. Aldiss Brian Aldiss 2 4921 501 Ellery Queen Avram Davidson 2 1645 7807 Geo W. Proctor George W. Proctor 2 55321 68411 Jim Steranko Steranko 2 1170 11214 John Jakes John W. Jakes 2 60471 83981 Joseph Dreany J. Dreany 2 2902 21809 Michael A. Arnzen Mike Arnzen 2 517 40131 Miriam Allen deFord Miriam Allen de Ford 2 456 11213 Nelson S. Bond Nelson Bond 2 18315 81517 Richard Harris Barham R. H. Barham 2 29 70121 Robert A. Heinlein Robert Heinlein 2 10527 196 Robin Scott Wilson Robin Wilson 2 22 24197 Samuel R. Delany K. Leslie Steiner 2 10574 88989 Stephen P. Brown S. Patrick Brown 3 2188 32048 Vicente Segrelles Segrelles 2 1803 43381 Vincent Di Fate Vincent DiFate 2
If anyone knows any method of cleaning these up that doesn't require SQL or massive numbers of edits, please let me know. BLongley 14:26, 2 Nov 2007 (CDT)
- And I see there's three more now - please be careful, people! BLongley 15:38, 8 Nov 2007 (CST)
Entropy Measurements
From an interesting discussion here, let's see what we can do to measure our (hopefully continual) improvements. Taking the first suggestion: "For magazines, titles entered without page numbers", let's see what we can do. Here's a start:
select case ifnull(pubc_page, -999) when -999 then "Bad" else "Good" end, count(*) from pub_content pc, pubs p, titles t where p.pub_id = pc.pub_id and p.pub_ctype = 'MAGAZINE' and t.title_id = pc.title_id and t.title_ttype NOT IN ('EDITOR','COVERART') group by case ifnull(pubc_page, -999) when -999 then "Bad" else "Good" end
Bad 53750 Good 48353
Not even halfway yet, but this is not from the latest backup. Note that EDITOR and COVERART don't need page numbers, so shouldn't be included. And this doesn't actually include MAGAZINES with NO contents:
select count(*) from pubs p where not exists (select 1 from pub_content pc where p.pub_id = pc.pub_id) and p.pub_ctype = 'MAGAZINE'
But there's only 5 of those. As many problems as we have with NOVEL, NONFICTION, and COLLECTION(!) entries in magazines. BLongley 14:58, 3 Dec 2007 (CST)
A couple of other quick Entropy Measurements, useful to watch over time maybe.
Pubs without pages, that aren't audio-books (which shouldn't have them) or electronic (which may or may not):
select case ifnull(pub_pages, -999) when -999 then "Bad" else "Good" end, count(*) from pubs p WHERE p.pub_ptype NOT LIKE '%audio%' And p.pub_ptype NOT LIKE '%cassette%' And p.pub_ptype NOT LIKE 'CD%' And p.pub_ptype NOT LIKE 'compact disc%' And p.pub_ptype NOT LIKE 'e%book%' And p.pub_ptype NOT LIKE 'electron%' And p.pub_ptype NOT LIKE '%web%' And p.pub_ptype NOT LIKE '%ezine%' And p.pub_ptype NOT LIKE '%internet%' And p.pub_ptype NOT LIKE '%mp3%' And p.pub_ptype NOT LIKE '%Adobe%' And p.pub_ptype NOT LIKE '%Mobipocket%' And p.pub_ptype NOT LIKE '%PDF%' And p.pub_ptype NOT LIKE '%tape%' And p.pub_ptype NOT LIKE '%www%' And p.pub_ptype NOT LIKE '%digit%' And p.pub_ptype NOT LIKE '%online%' group by case ifnull(pub_pages, -999) when -999 then "Bad" else "Good" end
Bad 19535 Good 82131
Pubs without Prices: I've been using "L" on its own to force a link to Amazon UK at times, but if it's ONLY "L" then it still hasn't got a proper price.
select case ifnull(pub_price, 'L') when 'L' then "Bad" else "Good" end, count(*) from pubs p group by case ifnull(pub_price, 'L') when 'L' then "Bad" else "Good" end
Bad 13118 Good 96196
OK, off to download a later backup and see if things have got better or worse... BLongley 13:22, 9 Dec 2007 (CST)
Reviews of things that don't exist
Due to "Entropy Measurements" again, I had a look at how we represent reviews. How do we get a review to refer to the reviewed title, reviewed author, and still fit a reviewer in? Well, this is one of the complicated bits. We sneak TWO entries into "canonical_author". Possibly the most misleading table name in the database: it links titles and authors, but different kinds of authors, so occasionally we need multiple entries. It doesn't just affect reviews either, it covers interviews. When looking at "canonical_author" there's a "ca_status" column that helps sort them out. I think it goes like this (for Review and Interview title types):
ca_status = 1 -- Reviewer/Interviewer ca_status = 2 -- Interviewee ca_status = 3 -- Author of reviewed title
So we can enter a pair of entries, for reviewer and reviewee, or interviewer and interviewee, and it all works somehow. (No idea what happens if all three are present.)
So I guess we can do something like this for reviews of missing things:
select t.title_title, a.author_canonical from titles t, canonical_author ca, authors a where ca.title_id = t.title_id and a.author_id = ca.author_id and t.title_ttype = 'REVIEW' and ca.ca_status = 3 and not exists (select 1 from pubs p where p.pub_title = t.title_title) LIMIT 20
Notice the LIMIT, as these queries are dead slow if you get the "NOT EXISTS" bit wrong. I think the ISFDB software is cleverer for matching things that DO exist, and matches title and author against titles and their authors rather than against pubs. But when something is missing, it might be good to check against everywhere they MIGHT be. Still, as this can only be run against downloads of the database, maybe it's time to look at how we can locally add indexes and suchlike performance improvements for the queries that check BAD things rather than just GOOD things.
(Database experts, feel free to butt in here. I'd rather get a definitive answer on the design over figuring it out myself from the data.) BLongley 16:13, 7 Dec 2007 (CST)
Playing with TOAD 3.1
Well, it still kills my machine if I run an inefficient query, but the Graphical Query Builder is looking quite useful (rather like the one in MS-Access). As usual though, the sort of SQL it generates isn't what I'd necessarily write:
SELECT `titles`.title_ttype, `authors`.author_canonical, `titles`.title_title, MAX(`verification`.user_id) , MAX(`verification`.reference_id) FROM ( ( ( ( isfdb.titles `titles` INNER JOIN isfdb.canonical_author `canonical_author` ON (`titles`.title_id = `canonical_author`.title_id)) INNER JOIN isfdb.authors `authors` ON (`canonical_author`.author_id = `authors`.author_id)) INNER JOIN isfdb.pub_content `pub_content` ON (`pub_content`.title_id = `titles`.title_id)) INNER JOIN isfdb.pubs `pubs` ON (`pub_content`.pub_id = `pubs`.pub_id)) LEFT OUTER JOIN isfdb.verification `verification` ON (`verification`.pub_id = `pubs`.pub_id) WHERE (`authors`.author_canonical = 'Philip K. Dick') AND (`titles`.title_ttype IN ('NOVEL', 'ANTHOLOGY', 'COLLECTION')) AND (`verification`.reference_id = 1 OR `verification`.reference_id IS NULL) AND (`verification`.user_id IN (2781, 4121) OR `verification`.user_id IS NULL) GROUP BY `titles`.title_ttype, `authors`.author_canonical, `titles`.title_title
Still, that's a perfectly useful query for finding books by an author and showing whether I've verified them or not. Now if only I'd verified all the pubs I own before other people got to them... :-/ BLongley 12:55, 9 Dec 2007 (CST)
Authors that only exist because of reviews
They are a pain as they show up in the Author Directory with no useful information. This script should find the stray authors, along with the title IDs for the reviews that create them. (Bung the title ID into a "http://www.isfdb.org/cgi-bin/title.cgi?XXXXXX" query directly.)
select a.author_canonical, ca.title_id FROM canonical_author ca, authors a WHERE ca.ca_status = 3 and ca.author_id = a.author_id AND NOT EXISTS (SELECT 1 from canonical_author ca2, titles t where ca.author_id = ca2.author_id AND ca2.title_id = t.title_id AND t.title_ttype != 'REVIEW' and ca2.ca_status = 1) order by 1,2
According to my last loaded backup, there's about 1200 authors that may need the review correcting, or a reviewed title to be created. I'll update when I've loaded the latest. BLongley 09:06, 29 Dec 2007 (CST)
- Down to 994 now - must have been a good month of cleanups! BLongley 09:23, 29 Dec 2007 (CST)