ISFDB talk:Data Entropy
Generally positive news, but the lack of updates on the price category makes me wonder if that's of use. Some good Publisher regularization recently makes me think we need another measure, but that can't be as absolute as the other measures - we don't want "one publisher to rule them all", we just want to reach the right level of accurately recording imprint and publishers without over-duplication. Other suggestions for (definable) measures are welcome though. BLongley 21:31, 10 July 2008 (UTC)
- Yep! rbh (Bob) 01:37, 11 July 2008 (UTC)
- Older (<1930) publications are often difficult to find price data for, so there is probably a floor for that category, although I doubt that we have reached it yet. There is a still a lot of dirty data with mangled ISBNs, prices, page counts, etc that we need to clean up. And no, this page hasn't been forgotten, at least not by me, but there are so many other things going on and so little time to work on all of them... Ahasuerus 02:50, 11 July 2008 (UTC)
- I, for one, do look at this. There re other measures that I would in theory like to see, but I'm not sure if they are worth the the trouble of running and posting them. Still if someone feels like:
- Prices are much harder to find for older works, and for many older works there never was a cover price in the modern sense. Therefore, it would be nice to see a version of the price query with a date limit, or even perhaps multiple versions with different date limits. For example
- Percent of publications with prices published from 1900 to the present
- Percent of publications with prices published from 1920 to the present
- Percent of publications with prices published from 1940 to the present
- Verification is also an important measure of data entropy, IMO,and so I would like to see
- Percent of publications verified
- Percent of titles with at least one verified publication. While often when entering or verifying a title people add other pubs that can be inferred from data on the copyright page, and/or data from secondary sources, if at least one pub is verified, our confidence in the whole can be somewhat higher. If we had a well defined concept of "edition" then I would say "editions with at least one verified pub"
- Prices are much harder to find for older works, and for many older works there never was a cover price in the modern sense. Therefore, it would be nice to see a version of the price query with a date limit, or even perhaps multiple versions with different date limits. For example
- Again, I am not asking for any of these, but I think they might possibly be of some limited value.
- On publishers, we can't really devise queries that measure how far we are from normalization until we have better defined what we mean by normal, IMO. The only query i can think of that would offer some limited value (besides the raw list of publisher names now in the system, which is too long for a wiki list) would be a bracketed list of number of publications/publisher. That is: the number of publishers with exactly one pub; number with 2-10 pubs; number with 11-20 pubs; ... number with 50-100 pubs; etc The precise brackets are not IMO important, except that they should probably get larger as the count rises, and that "exactly one" should almost surely be a separate bracket. Does anyone else think such data would be helpful?
- Anyway, those are my views. -DES Talk 15:27, 11 July 2008 (UTC)
- I, for one, do look at this. There re other measures that I would in theory like to see, but I'm not sure if they are worth the the trouble of running and posting them. Still if someone feels like:
- We can check the current number/percentage of Verified Publications on the Statistics page -- currently 14.01% -- but it doesn't have historical data or secondary verifications information. I agree that "percent of titles with at least one verified publication" sounds like a useful metric and something to add to our to-do list. Ahasuerus 16:45, 11 July 2008 (UTC)
- Primary Verified (including Transient), and Verified (any way) are easy to add: e.g.
select 'Primary Verified', count(*) from pubs p, verification v where v.pub_id = p.pub_id and (v.reference_id = 1 or v.reference_id = 12) UNION select 'Verified', count(*) from pubs p, verification v where v.pub_id = p.pub_id UNION select 'Total Pubs', count(*) from pubs p
- Gives us more encouraging percentages like:
Primary Verified 18609 15.0% Verified 27420 22.1% Total Pubs 123993
BLongley 18:59, 11 July 2008 (UTC)
Titles with Verified Pub
This is a bigger can of worms, so new section.
- How to count titles with no pubs at all? Some are strays left behind after a pub-delete. Others are placeholders for pubs we've not yet found.
- Variants. It's not easy to separate titles that only exist to go under the canonical author display, from those that really did get published under different titles or different pen-names.
- Magazines. These tend to only have one publication[1] so it's a Yes/No category rather than "we know one pub exists, how many other printings do?" question. I think these could usefully be separated out, e.g. I know I've entered a load of British stubs that must go in the unverified category, whereas a lot of current work is from primary copies. I'm not sure I want to publish the fact that I'm destroying the magazine editors' statistical reputation though... ;-)
- Chapterbooks. Current ISFDB treatment of shortfiction published stand-alone sucks, IMO, and these are a mess I'm not going to attempt to deal with.
- "Complete Novel" serials - is this proof of a NOVELs existence? Or do we demand a physical book? (Cross-over with Chapterbook above.)
- Nonfiction. Much has been added just to link REVIEWs to. We don't particularly separate useful (in ISFDB terms) Nonfiction books like Bibliographies and Biographies from plain Science Books. Or other works by same author that may be here for completeness alone.
- Miscounts in Omnibuses. NOVELs in Omnibuses are fairly easy, collections or anthologies in such may list only the shortfiction and not the COLLECTION or ANTHOLOGY records.
[1] ISFDB doesn't let you clone a Magazine so foreign editions are very under-represented until they deviate horribly and need separate titles.
Still, with all the disclaimers above to be noted, here's a vague stab at some possibly useful stats:
select 'Titles with Verified Pub', t.title_ttype, count(DISTINCT t.title_id) from titles t, pub_content pc, pubs p, verification v where v.pub_id = p.pub_id and t.title_id = pc.title_id and pc.pub_id = p.pub_id and p.pub_ctype = t.title_ttype and p.pub_ctype IN ('ANTHOLOGY', 'COLLECTION', 'NONFICTION', 'NOVEL', 'OMNIBUS') and t.title_ttype IN ('ANTHOLOGY', 'COLLECTION', 'NONFICTION', 'NOVEL', 'OMNIBUS') GROUP BY t.title_ttype UNION select 'Titles with Pub', t.title_ttype, count(DISTINCT t.title_id) from titles t, pub_content pc, pubs p where t.title_id = pc.title_id and pc.pub_id = p.pub_id and p.pub_ctype = t.title_ttype and p.pub_ctype IN ('ANTHOLOGY', 'COLLECTION', 'NONFICTION', 'NOVEL', 'OMNIBUS') and t.title_ttype IN ('ANTHOLOGY', 'COLLECTION', 'NONFICTION', 'NOVEL', 'OMNIBUS') GROUP BY t.title_ttype ORDER BY 2,1 desc
That gives us percentages like:
Titles with Verified Pub ANTHOLOGY 1332 27.3% Titles with Pub ANTHOLOGY 4874 Titles with Verified Pub COLLECTION 1504 30.0% Titles with Pub COLLECTION 5007 Titles with Verified Pub NONFICTION 251 6.9% Titles with Pub NONFICTION 3631 Titles with Verified Pub NOVEL 7497 16.5% Titles with Pub NOVEL 45569 Titles with Verified Pub OMNIBUS 429 21.5% Titles with Pub OMNIBUS 1993
Does anyone find any of these figures useful? BLongley 20:25, 11 July 2008 (UTC)
- And please, if anyone else uses plain SQL for offline ISFDB queries, check my SQL and understanding of pub contents and title types etc - point out any glaring errors or at least question them. I've no idea if I use it the way Al intends. I've added my earlier wanderings here if anyone finds those useful and can think of a better place for such. BLongley 20:49, 11 July 2008 (UTC)