ISFDB talk:Data Entropy

From ISFDB
Jump to navigation Jump to search

Additional measures

Generally positive news, but the lack of updates on the price category makes me wonder if that's of use. Some good Publisher regularization recently makes me think we need another measure, but that can't be as absolute as the other measures - we don't want "one publisher to rule them all", we just want to reach the right level of accurately recording imprint and publishers without over-duplication. Other suggestions for (definable) measures are welcome though. BLongley 21:31, 10 July 2008 (UTC)

Yep! rbh (Bob) 01:37, 11 July 2008 (UTC)
Older (<1930) publications are often difficult to find price data for, so there is probably a floor for that category, although I doubt that we have reached it yet. There is a still a lot of dirty data with mangled ISBNs, prices, page counts, etc that we need to clean up. And no, this page hasn't been forgotten, at least not by me, but there are so many other things going on and so little time to work on all of them... Ahasuerus 02:50, 11 July 2008 (UTC)
I, for one, do look at this. There re other measures that I would in theory like to see, but I'm not sure if they are worth the the trouble of running and posting them. Still if someone feels like:
  • Prices are much harder to find for older works, and for many older works there never was a cover price in the modern sense. Therefore, it would be nice to see a version of the price query with a date limit, or even perhaps multiple versions with different date limits. For example
    • Percent of publications with prices published from 1900 to the present
    • Percent of publications with prices published from 1920 to the present
    • Percent of publications with prices published from 1940 to the present
  • Verification is also an important measure of data entropy, IMO,and so I would like to see
    • Percent of publications verified
    • Percent of titles with at least one verified publication. While often when entering or verifying a title people add other pubs that can be inferred from data on the copyright page, and/or data from secondary sources, if at least one pub is verified, our confidence in the whole can be somewhat higher. If we had a well defined concept of "edition" then I would say "editions with at least one verified pub"
Again, I am not asking for any of these, but I think they might possibly be of some limited value.
On publishers, we can't really devise queries that measure how far we are from normalization until we have better defined what we mean by normal, IMO. The only query i can think of that would offer some limited value (besides the raw list of publisher names now in the system, which is too long for a wiki list) would be a bracketed list of number of publications/publisher. That is: the number of publishers with exactly one pub; number with 2-10 pubs; number with 11-20 pubs; ... number with 50-100 pubs; etc The precise brackets are not IMO important, except that they should probably get larger as the count rises, and that "exactly one" should almost surely be a separate bracket. Does anyone else think such data would be helpful?
Anyway, those are my views. -DES Talk 15:27, 11 July 2008 (UTC)
We can check the current number/percentage of Verified Publications on the Statistics page -- currently 14.01% -- but it doesn't have historical data or secondary verifications information. I agree that "percent of titles with at least one verified publication" sounds like a useful metric and something to add to our to-do list. Ahasuerus 16:45, 11 July 2008 (UTC)
Primary Verified (including Transient), and Verified (any way) are easy to add: e.g.
select 'Primary Verified', count(*)
from pubs p, verification v
where v.pub_id = p.pub_id
and (v.reference_id = 1 or v.reference_id = 12)
UNION
select 'Verified', count(*)
from pubs p, verification v
where v.pub_id = p.pub_id
UNION
select 'Total Pubs', count(*)
from pubs p
Gives us more encouraging percentages like:
Primary Verified	18609	15.0%
Verified		27420	22.1%
Total Pubs		123993	

BLongley 18:59, 11 July 2008 (UTC)


I think that those are indeed encouraging, adn a time series on those numbers would be interesting and instructive, IMO. -DES Talk 22:26, 11 July 2008 (UTC)

Titles with Verified Pub

This is a bigger can of worms, so new section.

  • How to count titles with no pubs at all? Some are strays left behind after a pub-delete. Others are placeholders for pubs we've not yet found.
  • Variants. It's not easy to separate titles that only exist to go under the canonical author display, from those that really did get published under different titles or different pen-names.
  • Magazines. These tend to only have one publication[1] so it's a Yes/No category rather than "we know one pub exists, how many other printings do?" question. I think these could usefully be separated out, e.g. I know I've entered a load of British stubs that must go in the unverified category, whereas a lot of current work is from primary copies. I'm not sure I want to publish the fact that I'm destroying the magazine editors' statistical reputation though... ;-)
  • Chapterbooks. Current ISFDB treatment of shortfiction published stand-alone sucks, IMO, and these are a mess I'm not going to attempt to deal with.
  • "Complete Novel" serials - is this proof of a NOVELs existence? Or do we demand a physical book? (Cross-over with Chapterbook above.)
  • Nonfiction. Much has been added just to link REVIEWs to. We don't particularly separate useful (in ISFDB terms) Nonfiction books like Bibliographies and Biographies from plain Science Books. Or other works by same author that may be here for completeness alone.
  • Miscounts in Omnibuses. NOVELs in Omnibuses are fairly easy, collections or anthologies in such may list only the shortfiction and not the COLLECTION or ANTHOLOGY records.

[1] ISFDB doesn't let you clone a Magazine so foreign editions are very under-represented until they deviate horribly and need separate titles.

Still, with all the disclaimers above to be noted, here's a vague stab at some possibly useful stats:

select 'Titles with Verified Pub', t.title_ttype, count(DISTINCT t.title_id)
from titles t, pub_content pc, pubs p, verification v
where v.pub_id = p.pub_id
and   t.title_id = pc.title_id
and   pc.pub_id = p.pub_id
and   p.pub_ctype = t.title_ttype
and   p.pub_ctype IN ('ANTHOLOGY', 'COLLECTION', 'NONFICTION', 'NOVEL', 'OMNIBUS')
and   t.title_ttype IN ('ANTHOLOGY', 'COLLECTION', 'NONFICTION', 'NOVEL', 'OMNIBUS')
GROUP BY t.title_ttype
UNION
select 'Titles with Pub', t.title_ttype, count(DISTINCT t.title_id)
from titles t, pub_content pc, pubs p
where t.title_id = pc.title_id
and   pc.pub_id = p.pub_id
and   p.pub_ctype = t.title_ttype
and   p.pub_ctype IN ('ANTHOLOGY', 'COLLECTION', 'NONFICTION', 'NOVEL', 'OMNIBUS')
and   t.title_ttype IN ('ANTHOLOGY', 'COLLECTION', 'NONFICTION', 'NOVEL', 'OMNIBUS')
GROUP BY t.title_ttype
ORDER BY 2,1 desc

That gives us percentages like:

Titles with Verified Pub	ANTHOLOGY	1332	27.3%
Titles with Pub		ANTHOLOGY	4874	
Titles with Verified Pub	COLLECTION	1504	30.0%
Titles with Pub		COLLECTION	5007	
Titles with Verified Pub	NONFICTION	251	6.9%
Titles with Pub		NONFICTION	3631	
Titles with Verified Pub	NOVEL		7497	16.5%
Titles with Pub		NOVEL		45569	
Titles with Verified Pub	OMNIBUS		429	21.5%
Titles with Pub		OMNIBUS		1993	

Does anyone find any of these figures useful? BLongley 20:25, 11 July 2008 (UTC)

And please, if anyone else uses plain SQL for offline ISFDB queries, check my SQL and understanding of pub contents and title types etc - point out any glaring errors or at least question them. I've no idea if I use it the way Al intends. I've added my earlier wanderings here if anyone finds those useful and can think of a better place for such. BLongley 20:49, 11 July 2008 (UTC)