Difference between revisions of "Talk:Database Schema"

Revision as of 00:12, 18 March 2008

Examples of pub_content usage

A simple NOVEL publication has a NOVEL content entry. When you display the publication, there's no need to show the contents, as a Novel obviously contains a Novel. When you edit the pub though, the NOVEL content record is shown as you might want to put a page number on it, for instance. Here are the relevant bits of the relevant records, we have one entry in each table.

pubs                        pub_content                     titles
pub_id: 183625              pubc_id: 739717                 title_id: 459
pub_ctype: NOVEL            pub_id: 183625                  title_ttype: NOVEL
pub_tag: DTHBSTVXTX1978     title ID: 459                   title_title: Deathbeast
pub_title: Deathbeast

Of course, a NOVEL Title doesn't have to be in a NOVEL publication. What if it's part of an Omnibus? You obviously want a link from the NOVEL title to the OMNIBUS publication. And a link for each other NOVEL to the same OMNIBUS publication. But an Omnibus isn't just the sum of its contents, so there's an OMNIBUS title record as well in the same pub. Again, this doesn't need to be shown when you display the publication, but we DO want to display the NOVEL contents. This time though, the OMNIBUS record is hidden when you edit the pub - there's nothing you really need to do to it. (Well, you could set a length for it, which is the way we record how many entries, or which entries from a series, make up the omnibus. But we do that on the Title itself instead.)

pubs                        pub_content                     titles
pub_id: 174441              pubc_id: 673891                 title_id: 503331
pub_ctype: OMNIBUS          pub_id: 174441                  title_ttype: OMNIBUS
pub_tag: DDFNNLKHNL2000     title_id: 503331                title_title: Dead Funny
pub_title: Dead Funny

                            pubc_id: 673901                 title_id: 1926
                            pub_id: 174441                  title_ttype: NOVEL
                            title_id: 1926                  title_title: Flying Dutch

                            pubc_id: 673911                 title_id: 8769
                            pub_id: 174441                  title_ttype: NOVEL
                            title_id: 8769                  title_title: Faust Among Equals

There's actually another content record for this pub, a COVERART record. This stays hidden when you edit the pub, because it's maintained automatically when the Artist field is used.

                            pubc_id: 673881                 title_id: 503321
                            pub_id: 174441                  title_ttype: COVERART
                            title_id: 503321                title_title: Cover: Dead Funny

COLLECTION and ANTHOLOGY pubs (and even NOVELS) also have content title types such as SHORTFICTION and ESSAY and INTERIORART and POEM that show up on the pub display and while editing, they're pretty harmless so long as you understand that you're not working on the entry in this pub alone, you're working on ALL entries of that title in ANY publication. INTERVIEW and REVIEW types have their own sections while editing, but you may need to change one of the other content types to one of these if it was mis-entered at first. Those require an extra field to be entered though (the Author moves to Interviewer/Reviewer but you still have to add Book Author/interviewee), so you'll want to do two passes of edits.

It's the Special record used to link the title and pub that is dangerous to change: if one of these becomes visible while editing, try and keep the pub type and entry type matched: NOVEL to NOVEL, COLLECTION to COLLECTION, ANTHOLOGY to ANTHOLOGY, OMNIBUS to OMNIBUS. Sometimes when correcting a publication type, one of these special records isn't hidden away any more and looks as though it could be usefully changed to a normal content record: e.g. if a NOVEL called "XYZ" is actually a collection with a Short Story called "XYZ" in it, it may be tempting to change the XYZ NOVEL record to the SHORTFICTION entry that's required. DON'T! The NOVEL record should change to COLLECTION to keep the link working, and a SHORTFICTION for "XYZ" the short story is added separately.

The problems that start when you lose the link record that matches a pub and title may not be obvious immediately, but if you find pubs where the hyper-link to the title record isn't present, or get warnings about title record mismatches, or you get warnings when Cloning or trying to Remove Titles from a pub, stop and check. Ask for help if needed.

NONGENRE and NONFICTION look as though they might be used for normal contents, but they're not intended that way: they're intended to be the ONE main record for a publication that's of little interest to us as it's not SF - this is why we have troubles if we try and enter Non-Genre Short-fiction in a Non-Genre Publication for instance. Interesting Non-Fiction pieces in a publication should be of type ESSAY - one big NONFICTION record would be used to class the whole pub as NONFICTION, and not worth going into a lot of detail about in ISFDB.

SERIAL shouldn't give you a problem if you're working on books, it's intended for magazines (although it has been used within books to group episodes of a story that aren't consecutive in the book). But if you're working on magazines, then you will come across EDITOR and COVERART too - but I'll leave explaining those to a Magazine mod.

I hope this helps! BLongley 09:37, 21 Oct 2007 (CDT)

Things to do with pub_content in MySQL

As you can see from above, there may be a number of content records for even the simplest pub, so let's start with a simple case: NO content records for a title. Here's some SQL to find them:

select t.title_id, t.title_title from titles t 
where not exists (select 1 from pub_content pc 
                  where pc.title_id = t.title_id)
and t.title_parent = 0
LIMIT 100

You'll want that "LIMIT" on the end as there are quite a few of them.

So what's wrong with these?

There can be titles we are sure exist, but we don't have details of any publication yet: you could go search one out at an internet bookshop and create a publication, if you like. First editions are best as they tend to confirm the date on the title.
Maybe the title record shouldn't be there: if someone deleted some pub that turned out to be an RPG dice set, the title wouldn't go automatically, it has to be deleted in a second submission. Maybe someone forgot that second submission?

Simple, eh? Well, I kept it simple deliberately, by restricting it to non-variant titles. Let's take out the "t.title_parent = 0"

select t.title_id, t.title_title, t.title_parent from titles t 
where not exists (select 1 from pub_content pc 
                  where pc.title_id = t.title_id)
LIMIT 100

Look at the title_parent: if it's not zero then this is a Variant title. (This is why the way to remove an incorrect variant link is to set the parent to zero, rather than blank.) In which case:

The "missing" publication(s) might be stored under the parent instead, and need to be moved back, or
We're just missing the publication(s) for THIS variation of the title (feel free to go find one to enter), or
There never WILL be an entry under this variation.

Why is the last OK? Well, consider Iain M. Banks and Iain Banks. They're the same person, using the world's least-confusing pseudonym. Any given book of his is published under one name or the other, depending on whether it's SF or not. However, we keep ONE canonical author record for him, and so we need all his books to have an entry for "Iain M. Banks", even when that publication will never be published under that name, only as by "Iain Banks". So one variant will have pubs, the other won't: it's just to put all the titles under one name, and the software will figure out the "as by" for us. This is why the simple case above is actually less simple than two cases: again, it might be fine to have an empty title if all the publications go under a variant title of the one you're looking at.

If this is confusing you, just try it out. You can search for the t.title_title in ISFDB itself, with the usual title search: if you want to search by the t.title_parent, then you can put that number into a URL and search that way, e.g.

http://www.isfdb.org/cgi-bin/title.cgi?887

or use some more SQL:

select * from titles t where t.title_id = 887

There's one other over-simplification I've made here you may have already noticed. What happens when there's a variant title of a variant title? Well, that's when people start getting headaches, so that's worth saving for a different topic.

Let's try it from the other side

How about, instead of titles with no contents, pubs with no contents?

select p.pub_id, p.pub_tag, p.pub_title, p.pub_ctype from pubs p 
where not exists (select 1 from pub_content pc 
                  where pc.pub_id = p.pub_id)
LIMIT 100

(Again, you may want a LIMIT on this, there's a few hundred.)

If you plug the resulting pub_title information into a publication search (last of the Advanced search options), you may well think everything looks OK, it finds the publication alright. You can even click on the Title and find a fairly normal-looking display. (Can you spot the missing bit though?) Try it in the Advanced or Normal title searches and you won't necessarily find ANY records though. Search for that author and you won't necessarily find that title. If you do, you won't find that pub under it.

What's happening here then? Well, without a pub_content record linking the pub to the title, you don't get the title link appearing when you look at the pub. The Author's titles still show up, but this publication won't appear under them. How to fix? Simple! Once you've found the publication (remember, only the advanced pub search will work) you'll either find the orphan record, or the orphan record and a good record to go with it. For instance, at the moment I can see

pub_id	pub_tag	        pub_title	                pub_ctype	
5643	BBBLSNDTHSA2004	Bubbles and the Secret Admirer	NOVEL

and searching for "Bubbles and the Secret Admirer" shows:

5642 	Bubbles and the Secret Admirer 	E. S. Mooney 	2004 	48 	BBBLSNDTHS2004 	tp 	NOVEL 	0-439-49177-0 	$3.99 	Scholastic
5643 	Bubbles and the Secret Admirer 	E. S. Mooney 	2004 	48 	BBBLSNDTHSA2004 	tp 	NOVEL 	0-439-49177-0 	$3.99 	Scholastic

Record 5643 can be deleted as record 5642 is the good version.

However, I also see:

pub_id	pub_tag	        pub_title	pub_ctype	
5641	BRTRBTSTP2004	Brute Orbits Tp	NOVEL

And searching for "Brute Orbits" (as I guessed the "Tp" was an error) finds:

Record 	Title 	Authors 	Year 	Pages 	Tag 	Binding 	Type 	ISBN 	Price 	Publisher
5639 	Brute Orbits 	George Zebrowski 	1998 	222 	BRUTEORBIT1998 	hc 	NOVEL 	0-06-105026-1 	$23.00 	HarperPrism
5640 	Brute Orbits 	George Zebrowski 	1999 	? 	BRTRBTSTP1999 	tp 	NOVEL 	0-06-105380-5 	$15.00 	Eos
225585 	Brute Orbits 	George Zebrowski 	1999 	336 	BRTRBTSTJD1999 	pb 	NOVEL 	0-06-105807-6 	$6.99 	 HarperPrism
5641 	Brute Orbits Tp 	George Zebrowski 	2004 	? 	BRTRBTSTP2004 	tp 	NOVEL 	0-06-105380-5 	$15.00 	Eos

We don't have a 2004 trade paperback edition, so we probably want to rescue this 5641 stray. You can get to edit this publication directly via the "record" link, or via the title link and "Edit this pub". Remember all the "Special" contents records I told you not to mess with? Well, this is a time where we DO want to mess with it, as it's broken. As there's no content at all, the software invites you to enter an ANTHOLOGY - we don't necessarily want that though, we want what the publication says. (Most seem to be NOVELs though.) Fortunately there's enough info there - enter a Content record with the right title, author, type and date and submit. Pick the date from the earliest GOOD record, or from the pub, it won't matter - we're going to have to merge titles after approval anyway. You should probably make sure the title matches up, so that it's easy to search for the titles to BE merged.

You may find a complete stray, with no other good record equivalent: you can fix it this way so other editors will come across it naturally, or research the pub and delete it if it's completely wrong.

How does this happen? I don't know. If you find new examples being created, let us know. I personally suspect that the ISFDB-1 to ISFDB-2 conversion created such entries (as there's often a good pub entry with a pub_id 1 less than the pub_id of the bad pub entry) , in which case this article will become obsolete when we've fixed them all. BLongley 15:08, 21 Oct 2007 (CDT)

From too few to too many

OK, we've looked at pubs with not enough (i.e. zero) contents, and titles with not enough (i.e. zero) contents, what about when there's too MANY contents? I'm not talking about books with a lot of SHORTFICTION or ESSAYs in them, those are common: they're probably an ANTHOLOGY or COLLECTION. And a book about or by an artist may have a lot of INTERIORART. It's the Special, linking, content records that might be a worry. I mentioned that a simple case is a NOVEL containing a NOVEL: and we all know that a book containing two or more NOVELs is called an OMNIBUS, isn't it? So let's go find some Omnibuses that haven't been called that.

select p.pub_id, p.pub_tag, p.pub_title, p.pub_ctype 
, COUNT(*)
from pubs p, titles t, pub_content pc
where pc.pub_id = p.pub_id
and pc.title_id = t.title_id
and t.title_ttype = 'NOVEL'
and p.pub_ctype != 'OMNIBUS'
GROUP BY p.pub_id,p.pub_tag, p.pub_title, p.pub_ctype 
HAVING COUNT(*) > 1
LIMIT 100

Oh dear - it seems that sometimes we don't call a book with two NOVELs an OMNIBUS, we might call it an "Ace Double" instead:

pub_id	pub_tag	        pub_title	                        pub_ctype	COUNT(*)	
1591	CRSSTNVFE1958	Across Time / Invaders From Earth	NOVEL	        2	
1769	GNSTRCTTTH1972	Against Arcturus / Time Thieves	        NOVEL	        2	
1998	LNFMRTMC1956	Alien From Arcturus / Atom Curtain	NOVEL	        2

One for a Standards discussion maybe. I don't own many, and none of those, so I'm opting out of that for now. I can't ignore these though: record 4545 comes up:

pub_id	pub_tag	        pub_title	pub_ctype	COUNT(*)	
4545	BIPOHL1982	BiPohl	        NOVEL	        2

It's a title I might have considered buying, and Advanced searching for "BiPohl" publications shows these:

Record 	Title 	Authors 	Year 	Pages 	Tag 	Binding 	Type 	ISBN 	Price 	Publisher
4545 	BiPohl 	Frederik Pohl 	1982 	313 	BIPOHL1982 	pb 	NOVEL 	0-345-30247-8 	$2.75 	Ballantine Del Rey
4546 	BiPohl 	Frederik Pohl 	1987 	? 	BPHL1987 	tp 	OMNIBUS 	0-345-35005-7 	$3.50 	Ballantine
136061 	BiPohl 	Frederik Pohl 	1982 	314 	BPHLLHKQVN1982 	pb 	OMNIBUS 	0-345-30247-8 	$2.75 	Ballantine

It turns out 136061 has been verified, and contains two novels I already own - just the sort of information I wanted! Saves me buying a book I don't need. Looking at 4545 makes me think it's a Duplicate, and can be deleted.

Here's another book I own a version of:

pub_id	pub_tag	        pub_title	pub_ctype	COUNT(*)	
13811	FRSTLNSMN1982	First Lensman	NOVEL	        2

Oh dear, when I try and edit that I see it's got TWO identical NOVEL entries. Never mind, I can "Remove Titles From This Pub" can't I? No. ISFDB hides the "correct entry" for me: unfortunately both are correct, we just don't need TWO of them.

Blank out one of them? Doesn't work.

How about if I change the title type to something different like ANTHOLOGY? Well, that shows up the two NOVELS for removal in "Remove Titles From This Pub", let's try removing one. There's a big yellow "WARNING: Unable to locate the title reference for this publication" message though... oh well, it's a screwed-up pub anyway, let's try it - Wey-hey, it worked! OK, now I need to change the title back to NOVEL.

I'll do a few more whilst I'm at it.

pub_id	pub_tag	        pub_title	pub_ctype	COUNT(*)	
22863	MRTHNHMN601953	More Than Human	NOVEL	        2	
22864	MRTHNHMN1953	More Than Human	NOVEL	        2

Oh no! I suggested we remove ONE title, it says I want to remove BOTH! REJECT, REJECT, REJECT! Or maybe not. So long as I put the title BACK afterwards it's OK. Phew, that worked. Oh wait, it's considered a new title - merge them both and we're back to normal. This is definitely NOT for the faint-hearted! Far better to leave it alone and call for help, I think. And it's arguable about whether it's an anthology anyway, it seems. Maybe it's a novel with some extra short-fiction and essays? TWO copies of the same novel in one pub was obviously wrong, but unless I've got all the data about that novel saved in case I need to put it back, I'm not going to touch such again for a while. So much more to learn...

Back to a simple table

The last example caused too much thinking, so let's go look at another simple table that may be of use: Verification. Hopefully if you've got to this stage, you've been verifying the publications you actually own. Primary Verification where you've still got the book is reference_id 1. If you want to mourn books you didn't keep, the "Primary (Transient)" is reference_id 12. Note that this does NOT match the reference_id in the reference table!

Here's how I find what I verified:

select p.pub_id, p.pub_title, v.ver_time
from pubs p, verification v
where v.pub_id = p.pub_id
and v.reference_id = 1
and v.user_id = 2781
ORDER by 2

That's alphabetical by title, you may want to ORDER BY 3 instead if you're trying to find all the early verifications you did that you think aren't quite good enough by your current standards. ;-)

Of course, that's just me bragging about MY verifications as it stands - you need to put your own user_id in, I'm 2781 and you're NOT. YOU are... well, we can't find that out from SQL, the backups hide all the personally identifiable information. (Phew!) You'll need to go look at your own Wiki preferences and read the first line, "Your internal ID number is XXXX". I bet you haven't looked at THAT page for a while, maybe there's something else you'd like to update while you're there?

"Bill, you haven't put the author in there, how do you tell your Starburst by Pohl from your Starburst by Bester?" Well, of course, I've memorised the pub_id for everything I still own... ;-) Nah, just kidding. The reason is that a lot of my books aren't by a single author, or are by a variant author, and tonight I just want a simple example. It'll get complex enough soon enough, you can be sure of that. Just try the simple things for now. BLongley 16:03, 23 Oct 2007 (CDT)

Why the Publisher Search doesn't work

You may, for instance, have decided to have a look for all the Ace Doubles I pointed out earlier. And the Advanced Search does offer you the opportunity of searching for publications where Term 1 is 'Ace Double' and the search-type is 'Publisher'. This gives a nasty-looking set of Python error messages - don't worry, you haven't broken anything! Some of the error says:

select pubs.* from pubs where pubs.pub_publisher... '%Ace Double%' order by pubs.pub_title limit 100"

- this is a shortened version of the SQL it's trying - and the explanation is

"Unknown column 'pubs.pub_publisher' in 'where clause'"

Pretty self-explanatory really, as there is indeed NO pubs.pub_publisher column in the pubs table. So what SQL should be being executed? Well, the publishers seem to have been normalised into their own table, publishers, which is quite a simple one:

desc publishers;

Field          Type            Null    Key
publisher_id   int(11)         NO      PRI
publisher_name varchar(64)     YES		
note_id        int(11)         YES

We can search for 'Ace Double' in here:

select * from publishers 
where publisher_name like '%Ace Double%'

publisher_id   publisher_name   note_id	
381            Ace Double

and if you look at the pubs table, sure enough there's a publisher_id column there we can search by:

select * from pubs p 
where p.publisher_id = 381
LIMIT 100

Bingo! Lots and lots of Ace Doubles. Of course, you can find most of them just by searching for titles of ' / ' in ISFDB itself, but SQL can be a bit more generic, or more specific - ' / ' finds some NON-Ace Doubles for instance. Combine the two queries and use something like this to search for publications by the Publisher of your choice: replace "Bill" with the publisher you want.

select pu.publisher_name, p.* 
from pubs p, publishers pu 
where p.publisher_id = pu.publisher_id 
and pu.publisher_name like '%Bill%'
order by p.pub_title
LIMIT 100

This seems to be a good example of what the database USED to be like - where there once was a pubs.pub_publisher column? - and what it may well become: that publishers.note_id column hasn't been used yet.

"And Not an Omnibus...."

There's another Python problem if you try that in Advanced Search. Say you've tackled all the Ace Doubles, and now want to look at everything else that has ' / ' in the title, but ISN'T an Omnibus. Well, it's easy to search for ' / ' and Ttype of 'NOVEL' for instance, although the search won't recognise the spaces around the slash. But trying to search for ' / ' and NOT Ttype of 'OMNIBUS' doesn't work:

ProgrammingError: (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the
right syntax to use near 'ANDNOT titles.title_ttype='OMNIBUS' order by titles.title_title limit 100' at line 1")
args = (1064, "You have an error in your SQL syntax; check the ... order by titles.title_title limit 100' at line 1")

That seems to suggest that ISFDB is using "ANDNOT" rather than "AND NOT". Dead easy to correct if we run the SQL ourselves though:

select t.* from titles t
where t.title_title like '% / %'
AND NOT t.title_ttype = 'OMNIBUS'

OK, that shows there are plenty of other types to exclude. You may want to try something like

select t.* from titles t
where t.title_title like '% / %'
and not t.title_ttype = 'OMNIBUS'
and not t.title_ttype = 'COVERART'
and not t.title_ttype = 'INTERIORART'
and not t.title_ttype = 'ESSAY'
and not t.title_ttype = 'POEM'
and not t.title_ttype = 'EDITOR'

Or if we remember "!=" is slightly less typing:

select t.* from titles t
where t.title_title like '% / %'
and t.title_ttype != 'OMNIBUS'
and t.title_ttype != 'COVERART'
and t.title_ttype != 'INTERIORART'
and t.title_ttype != 'ESSAY'
and t.title_ttype != 'POEM'
and t.title_ttype != 'EDITOR'

Or even better, remember "NOT IN" and use:

select t.* from titles t
where t.title_title like '% / %'
and t.title_ttype NOT IN ('OMNIBUS', 'COVERART', 'INTERIORART', 'ESSAY', 'POEM', 'EDITOR')

You could also use "IN" rather than "NOT IN" and list the other types instead - title_ttype is fixed to certain values, namely 'ANTHOLOGY', 'BACKCOVERART', 'COLLECTION', 'COVERART', 'INTERIORART', 'EDITOR', 'ESSAY', 'INTERVIEW', 'NOVEL', 'NONFICTION', 'NONGENRE', 'OMNIBUS', 'POEM', 'REVIEW', 'SERIAL', 'SHORTFICTION', 'CHAPTERBOOK'.

Lots of options, but all quite simple, and all working around a simple problem.

Are you LOOKING for trouble?

Hopefully by now you can spot what this SQL is trying to do.

select p.pub_title, pc.pub_id, t.title_ttype, count(*)
from pub_content pc, titles t, pubs p
where pc.title_id = t.title_id
and pc.pub_id = p.pub_id
and t.title_ttype IN ('ANTHOLOGY', 'COLLECTION')
and p.pub_ctype != 'OMNIBUS'
group by pc.pub_id, t.title_ttype
having count(*) > 1

It's OK to have an Anthology or a Collection as part of a bigger book, if the book is an Omnibus. You'd have contents for the Antholog(y/ies) and Collection(s) AND contents for all the shortfiction and essays and such too. That's fine. But we don't want collections of collections or anthologies of anthologies. And some of these problem publications have the SAME title in them twice (who ignored the big Yellow Warning messages about merging, eh?).

But these are not simply classifiable from the above information alone, you're going to have to investigate, and make some pretty dangerous edits in the meantime too.

DO NOT TRY THIS IF YOU DO NOT KNOW HOW TO PUT THINGS BACK TO AT LEAST NO MORE THAN THE SAME SORT OF PROBLEM AS THEY HAD BEFORE!

OK, now I've warned you, let's have a look at a few:

A Star Above and Other Stories	1190	COLLECTION	2

Two collections within one collection? Not that ISFDB will show you that. Changing the Pub type to NOVEL showed up the two collections though. One (with some Internet research despite the NESFA site being unavailable) seems clearly bogus, so "Remove Titles from this pub" got rid of that one. The other is right(ish) - it's "A Star Above It and Other Stories". So that got retitled, and although I don't want to put all the contents in (I don't own it, can't "Look Inside" on Amazon or suchlike, and it postdates the catalogue at tamu.edu) I've at least left the suspected contents in notes for someone to sort out later. It IS a collection though, so the pub type got changed back.

Anasazi	2564	COLLECTION	2

Changing the Pub type to NOVEL showed up the two collections again. Unfortunately they were both the same one, so when I "Removed Titles from this pub" BOTH went, and I had to put one back. Fortunately Contento had details of the contents of this edition, so they now have page numbers. Unfortunately there is a second printing too, so I needed to merge the Collection titles again afterwards. (Yes, changing to NOVEL is usually a temporary measure only, this needed to go back to "COLLECTION" status.)

Bending the Landscape: Science Fiction	3999	ANTHOLOGY	2

This was a simpler one. Changing the Pub type to NOVEL showed up the two Anthologies, and one was called an Introduction - clearly an 'ESSAY' mis-classified. So I fixed that entry, changed the pub type back to 'ANTHOLOGY' and it seems OK now.

Note that the usual resolution is to change the publication back TO the type you changed it FROM. If you've ignored the Big Warnings and got out of your depth, this is the usual thing to try even if you haven't fixed anything.

DO please ask for help if you think you've made things WORSE.

Best from Fantasy and Science Fiction: 14th Series	4076	ANTHOLOGY	2	
Best from Fantasy and Science Fiction: 22nd Series	4083	ANTHOLOGY	2

Well, these ones are easy - ISBN-10: 9997374851 - ISBN-10: 9997376463 - just zap them. There are GOOD records for those titles around still, we can do without this rubbish. (Yes, I did check if there was more information about them under those ISBNs, but only Amazon admits they exist and Amazon is WRONG.)

Beyond the Galactic Rim / The Ship From Outside	4374	COLLECTION	3

That looks familiar... oh yes, already fixed under the Ace Double initiative. Ignore. Here's an interesting pair though:

Blood & Ivory: A Tapestry	4835	COLLECTION	2	
Blood and Ivory: A Tapestry	4836	COLLECTION	2

Let's change the "&" version to NOVEL and see what's up. (The other doesn't have enough content to be interesting.)

Blood and Ivory: A Tapestry (expanded) • collection by P. C. Hodgell  (aka Blood & Ivory: The Book of Jame) [as by P. C. Hodgell ]
Blood & Ivory: A Tapestry • collection by P. C. Hodgell

Well, somebody's tried some serious merges on titles there, along with a variant - I have NO clue what's correct. I AM going to leave the pub as a NOVEL, although remove the pair of collections, in the hope that someone can explain them all at some point. I'll put a NOVEL title into the pub as well just so that it shows up, and notes too. Am I breaking my rule of:

DO NOT TRY THIS IF YOU DO NOT KNOW HOW TO PUT THINGS BACK TO AT LEAST NO MORE THAN THE SAME SORT OF PROBLEM AS THEY HAD BEFORE!

I don't think so. I think it's LESS of a problem if the problem is still visible.

Here's an awkward one: it looks like two titles if you search that way, only one pub if you search that way: and it's VERIFIED.

Budayeen Nights	5665	COLLECTION	2

OK, change to NOVEL: yes, there's two collections in there. Let's remove one - damn, it removed BOTH. The "two titles, one pub" should have given that away. OK, add one COLLECTION back and change the pub type back - sorted. No need to ask the absent verifier what I did wrong... but I really shouldn't be trying any more this late at night under the influence of alcohol. And that's only record 9 of 67... :-/

I really wouldn't recommend this sort of fixing if you can't approve your own edits, it's difficult to explain WHY you're messing up things (in the short term) to a separate approver. But if you've got to that stage, then you might want to try this on some obscure titles that haven't been fully completed, and not verified. There's (thankfully!) not too many pubs in this state, but it's easy to create more (I have no idea if the problem is better or worse since the last backup, I can't think of a way to check the current ISFDB situation) so we have to be vigilant for such. I don't want to be the only fixer though. If you feel you can help, drop me a line on my talk page. BLongley 17:48, 1 Nov 2007 (CDT)

Note: as of tonight I've worked my way through all the current problems that this SQL example points out. I've left a few with the individual Verifiers to finish off. Thankfully, it seems a few had already been fixed, and NOT just by deleting the problem (which tends to delete good data along with the bad). But I'm sure there'll be some more visible after the next backup. In the meantime, if you spot apparent duplicate pubs, DO please use the "Diff Publications" option and check whether ISFDB is comparing a pub with itself - deleting either of the apparent duplicate titles will delete the ONE pub with all the useful information in, and you don't want to have to reenter it all, do you? BLongley 17:10, 4 Nov 2007 (CST)

Double Pseudonyms

It's OK for an author to have many pseudonyms, but sometimes they get the same pseudonym twice:

select p.author_id, p.pseudonym, a2.author_canonical, a1.author_canonical, count(*) 
from pseudonyms p, authors a1, authors a2
where p.pseudonym = a1.author_id
and p.author_id = a2.author_id
group by p.author_id, p.pseudonym, a2.author_canonical, a1.author_canonical
having count(*) > 1
order by 3

author_id	pseudonym	author_canonical	author_canonical1	count(*)	
131	        21289	        Brian W. Aldiss	        Brian Aldiss	        2	
4921	        501	        Ellery Queen	        Avram Davidson	        2	
1645	        7807	        Geo W. Proctor	        George W. Proctor	2	
55321	        68411	        Jim Steranko	        Steranko	        2	
1170	        11214	        John Jakes	        John W. Jakes	        2	
60471	        83981	        Joseph Dreany	        J. Dreany	        2	
2902	        21809	        Michael A. Arnzen	Mike Arnzen	        2	
517	        40131	        Miriam Allen deFord	Miriam Allen de Ford	2	
456	        11213	        Nelson S. Bond	        Nelson Bond	        2	
18315	        81517	        Richard Harris Barham	R. H. Barham	        2	
29	        70121	        Robert A. Heinlein	Robert Heinlein	        2	
10527	        196	        Robin Scott Wilson	Robin Wilson	        2	
22	        24197	        Samuel R. Delany	K. Leslie Steiner	2	
10574	        88989	        Stephen P. Brown	S. Patrick Brown	3	
2188	        32048	        Vicente Segrelles	Segrelles	        2	
1803	        43381	        Vincent Di Fate	        Vincent DiFate          2

If anyone knows any method of cleaning these up that doesn't require SQL or massive numbers of edits, please let me know. BLongley 14:26, 2 Nov 2007 (CDT)

And I see there's three more now - please be careful, people! BLongley 15:38, 8 Nov 2007 (CST)

Entropy Measurements

From an interesting discussion here, let's see what we can do to measure our (hopefully continual) improvements. Taking the first suggestion: "For magazines, titles entered without page numbers", let's see what we can do. Here's a start:

select case ifnull(pubc_page, -999) when -999 then "Bad" else "Good" end, count(*)
from pub_content pc, pubs p, titles t
where p.pub_id = pc.pub_id
and p.pub_ctype = 'MAGAZINE'
and t.title_id = pc.title_id
and t.title_ttype NOT IN ('EDITOR','COVERART')
group by case ifnull(pubc_page, -999) when -999 then "Bad" else "Good" end

Bad	53750	
Good	48353

Not even halfway yet, but this is not from the latest backup. Note that EDITOR and COVERART don't need page numbers, so shouldn't be included. And this doesn't actually include MAGAZINES with NO contents:

select count(*)
from pubs p where not exists (select 1 from pub_content pc where p.pub_id = pc.pub_id)
and p.pub_ctype = 'MAGAZINE'

But there's only 5 of those. As many problems as we have with NOVEL, NONFICTION, and COLLECTION(!) entries in magazines. BLongley 14:58, 3 Dec 2007 (CST)

A couple of other quick Entropy Measurements, useful to watch over time maybe.

Pubs without pages, that aren't audio-books (which shouldn't have them) or electronic (which may or may not):

select case ifnull(pub_pages, -999) when -999 then "Bad" else "Good" end, count(*)
from pubs p
WHERE p.pub_ptype NOT LIKE '%audio%'
And p.pub_ptype NOT LIKE '%cassette%'
And p.pub_ptype NOT LIKE 'CD%'
And p.pub_ptype NOT LIKE 'compact disc%'
And p.pub_ptype NOT LIKE 'e%book%'
And p.pub_ptype NOT LIKE 'electron%'
And p.pub_ptype NOT LIKE '%web%'
And p.pub_ptype NOT LIKE '%ezine%'
And p.pub_ptype NOT LIKE '%internet%'
And p.pub_ptype NOT LIKE '%mp3%'
And p.pub_ptype NOT LIKE '%Adobe%'
And p.pub_ptype NOT LIKE '%Mobipocket%'
And p.pub_ptype NOT LIKE '%PDF%'
And p.pub_ptype NOT LIKE '%tape%'
And p.pub_ptype NOT LIKE '%www%'
And p.pub_ptype NOT LIKE '%digit%'
And p.pub_ptype NOT LIKE '%online%'
group by case ifnull(pub_pages, -999) when -999 then "Bad" else "Good" end

Bad	19535
Good	82131

Pubs without Prices: I've been using "L" on its own to force a link to Amazon UK at times, but if it's ONLY "L" then it still hasn't got a proper price.

select case ifnull(pub_price, 'L') when 'L' then "Bad" else "Good" end, count(*)
from pubs p
group by case ifnull(pub_price, 'L') when 'L' then "Bad" else "Good" end

Bad	13118
Good	96196

OK, off to download a later backup and see if things have got better or worse... BLongley 13:22, 9 Dec 2007 (CST)

Reviews of things that don't exist

Due to "Entropy Measurements" again, I had a look at how we represent reviews. How do we get a review to refer to the reviewed title, reviewed author, and still fit a reviewer in? Well, this is one of the complicated bits. We sneak TWO entries into "canonical_author". Possibly the most misleading table name in the database: it links titles and authors, but different kinds of authors, so occasionally we need multiple entries. It doesn't just affect reviews either, it covers interviews. When looking at "canonical_author" there's a "ca_status" column that helps sort them out. I think it goes like this (for Review and Interview title types):

 ca_status = 1 -- Reviewer/Interviewer
 ca_status = 2 -- Interviewee
 ca_status = 3 -- Author of reviewed title

So we can enter a pair of entries, for reviewer and reviewee, or interviewer and interviewee, and it all works somehow. (No idea what happens if all three are present.)

So I guess we can do something like this for reviews of missing things:

select t.title_title, a.author_canonical 
from titles t, canonical_author ca, authors a
where ca.title_id = t.title_id
and a.author_id = ca.author_id
and   t.title_ttype = 'REVIEW'
and   ca.ca_status = 3
and not exists (select 1 from pubs p
                where p.pub_title = t.title_title)
LIMIT 20

Notice the LIMIT, as these queries are dead slow if you get the "NOT EXISTS" bit wrong. I think the ISFDB software is cleverer for matching things that DO exist, and matches title and author against titles and their authors rather than against pubs. But when something is missing, it might be good to check against everywhere they MIGHT be. Still, as this can only be run against downloads of the database, maybe it's time to look at how we can locally add indexes and suchlike performance improvements for the queries that check BAD things rather than just GOOD things.

(Database experts, feel free to butt in here. I'd rather get a definitive answer on the design over figuring it out myself from the data.) BLongley 16:13, 7 Dec 2007 (CST)

Playing with TOAD 3.1

Well, it still kills my machine if I run an inefficient query, but the Graphical Query Builder is looking quite useful (rather like the one in MS-Access). As usual though, the sort of SQL it generates isn't what I'd necessarily write:

 SELECT `titles`.title_ttype, `authors`.author_canonical, `titles`.title_title,
      MAX(`verification`.user_id)
         , MAX(`verification`.reference_id)
 FROM    (   (   (   (   isfdb.titles `titles`
                      INNER JOIN
                         isfdb.canonical_author `canonical_author`
                      ON (`titles`.title_id = `canonical_author`.title_id))
                  INNER JOIN
                     isfdb.authors `authors`
                  ON (`canonical_author`.author_id = `authors`.author_id))
              INNER JOIN
                 isfdb.pub_content `pub_content`
              ON (`pub_content`.title_id = `titles`.title_id))
          INNER JOIN
             isfdb.pubs `pubs`
          ON (`pub_content`.pub_id = `pubs`.pub_id))
      LEFT OUTER JOIN
         isfdb.verification `verification`
      ON (`verification`.pub_id = `pubs`.pub_id)
WHERE (`authors`.author_canonical = 'Philip K. Dick')
      AND (`titles`.title_ttype IN ('NOVEL', 'ANTHOLOGY', 'COLLECTION'))
      AND (`verification`.reference_id = 1
           OR `verification`.reference_id IS NULL)
      AND (`verification`.user_id IN (2781, 4121)
           OR `verification`.user_id IS NULL)
GROUP BY `titles`.title_ttype,
         `authors`.author_canonical,
         `titles`.title_title

Still, that's a perfectly useful query for finding books by an author and showing whether I've verified them or not. Now if only I'd verified all the pubs I own before other people got to them... :-/ BLongley 12:55, 9 Dec 2007 (CST)

Authors that only exist because of reviews

They are a pain as they show up in the Author Directory with no useful information. This script should find the stray authors, along with the title IDs for the reviews that create them. (Bung the title ID into a "http://www.isfdb.org/cgi-bin/title.cgi?XXXXXX" query directly.)

select a.author_canonical, ca.title_id
FROM  canonical_author ca, authors a
WHERE ca.ca_status = 3
and   ca.author_id = a.author_id
AND NOT EXISTS (SELECT 1 from canonical_author ca2, titles t
                where ca.author_id = ca2.author_id
		 AND  ca2.title_id = t.title_id
		 AND  t.title_ttype != 'REVIEW'
		 and  ca2.ca_status = 1)
order by 1,2

According to my last loaded backup, there's about 1200 authors that may need the review correcting, or a reviewed title to be created. I'll update when I've loaded the latest. BLongley 09:06, 29 Dec 2007 (CST)

Down to 994 now - must have been a good month of cleanups! BLongley 09:23, 29 Dec 2007 (CST)