ISFDB talk:Data Entropy

From ISFDB
Revision as of 17:26, 10 April 2009 by BLongley (talk | contribs) (→‎Looks all positive to me)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Additional measures

Generally positive news, but the lack of updates on the price category makes me wonder if that's of use. Some good Publisher regularization recently makes me think we need another measure, but that can't be as absolute as the other measures - we don't want "one publisher to rule them all", we just want to reach the right level of accurately recording imprint and publishers without over-duplication. Other suggestions for (definable) measures are welcome though. BLongley 21:31, 10 July 2008 (UTC)

Yep! rbh (Bob) 01:37, 11 July 2008 (UTC)
Older (<1930) publications are often difficult to find price data for, so there is probably a floor for that category, although I doubt that we have reached it yet. There is a still a lot of dirty data with mangled ISBNs, prices, page counts, etc that we need to clean up. And no, this page hasn't been forgotten, at least not by me, but there are so many other things going on and so little time to work on all of them... Ahasuerus 02:50, 11 July 2008 (UTC)
I, for one, do look at this. There re other measures that I would in theory like to see, but I'm not sure if they are worth the the trouble of running and posting them. Still if someone feels like:
  • Prices are much harder to find for older works, and for many older works there never was a cover price in the modern sense. Therefore, it would be nice to see a version of the price query with a date limit, or even perhaps multiple versions with different date limits. For example
    • Percent of publications with prices published from 1900 to the present
    • Percent of publications with prices published from 1920 to the present
    • Percent of publications with prices published from 1940 to the present
  • Verification is also an important measure of data entropy, IMO,and so I would like to see
    • Percent of publications verified
    • Percent of titles with at least one verified publication. While often when entering or verifying a title people add other pubs that can be inferred from data on the copyright page, and/or data from secondary sources, if at least one pub is verified, our confidence in the whole can be somewhat higher. If we had a well defined concept of "edition" then I would say "editions with at least one verified pub"
Again, I am not asking for any of these, but I think they might possibly be of some limited value.
On publishers, we can't really devise queries that measure how far we are from normalization until we have better defined what we mean by normal, IMO. The only query i can think of that would offer some limited value (besides the raw list of publisher names now in the system, which is too long for a wiki list) would be a bracketed list of number of publications/publisher. That is: the number of publishers with exactly one pub; number with 2-10 pubs; number with 11-20 pubs; ... number with 50-100 pubs; etc The precise brackets are not IMO important, except that they should probably get larger as the count rises, and that "exactly one" should almost surely be a separate bracket. Does anyone else think such data would be helpful?
Anyway, those are my views. -DES Talk 15:27, 11 July 2008 (UTC)
We can check the current number/percentage of Verified Publications on the Statistics page -- currently 14.01% -- but it doesn't have historical data or secondary verifications information. I agree that "percent of titles with at least one verified publication" sounds like a useful metric and something to add to our to-do list. Ahasuerus 16:45, 11 July 2008 (UTC)
Primary Verified (including Transient), and Verified (any way) are easy to add: e.g.
select 'Primary Verified', count(*)
from pubs p, verification v
where v.pub_id = p.pub_id
and (v.reference_id = 1 or v.reference_id = 12)
UNION
select 'Verified', count(*)
from pubs p, verification v
where v.pub_id = p.pub_id
UNION
select 'Total Pubs', count(*)
from pubs p
Gives us more encouraging percentages like:
Primary Verified	18609	15.0%
Verified		27420	22.1%
Total Pubs		123993	

BLongley 18:59, 11 July 2008 (UTC)


I think that those are indeed encouraging, adn a time series on those numbers would be interesting and instructive, IMO. -DES Talk 22:26, 11 July 2008 (UTC)
OK, added. BLongley 09:12, 12 July 2008 (UTC)

Titles with Verified Pub

This is a bigger can of worms, so new section.

  • How to count titles with no pubs at all? Some are strays left behind after a pub-delete. Others are placeholders for pubs we've not yet found.
  • Variants. It's not easy to separate titles that only exist to go under the canonical author display, from those that really did get published under different titles or different pen-names.
  • Magazines. These tend to only have one publication[1] so it's a Yes/No category rather than "we know one pub exists, how many other printings do?" question. I think these could usefully be separated out, e.g. I know I've entered a load of British stubs that must go in the unverified category, whereas a lot of current work is from primary copies. I'm not sure I want to publish the fact that I'm destroying the magazine editors' statistical reputation though... ;-)
  • Chapterbooks. Current ISFDB treatment of shortfiction published stand-alone sucks, IMO, and these are a mess I'm not going to attempt to deal with.
  • "Complete Novel" serials - is this proof of a NOVELs existence? Or do we demand a physical book? (Cross-over with Chapterbook above.)
  • Nonfiction. Much has been added just to link REVIEWs to. We don't particularly separate useful (in ISFDB terms) Nonfiction books like Bibliographies and Biographies from plain Science Books. Or other works by same author that may be here for completeness alone.
  • Miscounts in Omnibuses. NOVELs in Omnibuses are fairly easy, collections or anthologies in such may list only the shortfiction and not the COLLECTION or ANTHOLOGY records.

[1] ISFDB doesn't let you clone a Magazine so foreign editions are very under-represented until they deviate horribly and need separate titles.

Still, with all the disclaimers above to be noted, here's a vague stab at some possibly useful stats:

select 'Titles with Verified Pub', t.title_ttype, count(DISTINCT t.title_id)
from titles t, pub_content pc, pubs p, verification v
where v.pub_id = p.pub_id
and   t.title_id = pc.title_id
and   pc.pub_id = p.pub_id
and   p.pub_ctype = t.title_ttype
and   p.pub_ctype IN ('ANTHOLOGY', 'COLLECTION', 'NONFICTION', 'NOVEL', 'OMNIBUS')
and   t.title_ttype IN ('ANTHOLOGY', 'COLLECTION', 'NONFICTION', 'NOVEL', 'OMNIBUS')
GROUP BY t.title_ttype
UNION
select 'Titles with Pub', t.title_ttype, count(DISTINCT t.title_id)
from titles t, pub_content pc, pubs p
where t.title_id = pc.title_id
and   pc.pub_id = p.pub_id
and   p.pub_ctype = t.title_ttype
and   p.pub_ctype IN ('ANTHOLOGY', 'COLLECTION', 'NONFICTION', 'NOVEL', 'OMNIBUS')
and   t.title_ttype IN ('ANTHOLOGY', 'COLLECTION', 'NONFICTION', 'NOVEL', 'OMNIBUS')
GROUP BY t.title_ttype
ORDER BY 2,1 desc

That gives us percentages like:

Titles with Verified Pub	ANTHOLOGY	1332	27.3%
Titles with Pub		ANTHOLOGY	4874	
Titles with Verified Pub	COLLECTION	1504	30.0%
Titles with Pub		COLLECTION	5007	
Titles with Verified Pub	NONFICTION	251	6.9%
Titles with Pub		NONFICTION	3631	
Titles with Verified Pub	NOVEL		7497	16.5%
Titles with Pub		NOVEL		45569	
Titles with Verified Pub	OMNIBUS		429	21.5%
Titles with Pub		OMNIBUS		1993	

Does anyone find any of these figures useful? BLongley 20:25, 11 July 2008 (UTC)

I find this intersting. Particularly the significnatly lower percentage for novels than for anthologies or colletions. Again, a time series on these numbers would be even more interesting. How hard is it to run queries on specific backup data sets from specific dates, to create such time series? -DES Talk 22:29, 11 July 2008 (UTC)
I suspect Novels are a low percentage as it's often easy to create prior stub publications from a printing history in a later verified pub. Often too easy, which is why I'll occasionally leave a long printing history in notes rather than create 10 or 20 stubs. I'll usually add first hardcover and paperback editions though, if we don't have them already. Anthologies and collections are often revised or expanded rather than just reprinted, or were "Annual" editions that go out of print far faster than a classic novel. BLongley 09:23, 12 July 2008 (UTC)
It's possible to go through the old backups to create the time series in hindsight, but it's a few minutes to load each one and I myself only keep the current one. Ahasuerus back-filled some of the existing stats though, he may still have the older backups conveniently loaded. BLongley 09:23, 12 July 2008 (UTC)
And please, if anyone else uses plain SQL for offline ISFDB queries, check my SQL and understanding of pub contents and title types etc - point out any glaring errors or at least question them. I've no idea if I use it the way Al intends. I've added my earlier wanderings here if anyone finds those useful and can think of a better place for such. BLongley 20:49, 11 July 2008 (UTC)

Still Watching?

I hope I'm not updating the stats for my benefit only, so please chime in with comments even if it's "Why do you bother with this crap?" or "Yes, I get a warm fuzzy feeling after each update, I've just never felt the need to add anything as you're doing such a good job". ;-) BLongley

My current feelings are that we're generally moving in the right direction on all measures except maybe "Publications with Prices". We may have plateaued on that - there are indeed some books that will NEVER have "official" prices - old ones (easily excluded if we put an arbitrary date on such) and Australian ones. And the forward-looking Dissembler additions will always be a bit uncertain or missing - the measure might improve if we exclude those. (If we do change the measures, we must note when they changed, or recalculate the old stats (which I'm loath to do, even if the backups are available for those dates: too much work for too little benefit)). BLongley 22:58, 12 September 2008 (UTC)

"Publications with pages" (non-magazine) is also getting distorted by Dissembler data - I believe we're importing estimated page-counts, that aren't always corrected to real ones that quickly. I know I've only corrected a few as I'm not actually buying much SF as it comes out. (What proportion of our "Selected Upcoming Books" are our editors actually buying?) Now that Dissembler additions are Noted those could be excluded from the scripts but it would be another notable change. BLongley 22:58, 12 September 2008 (UTC)

Still, we could accomplish a notable change by fixing all the "unk" or "audio" books with no pages. (2000+ pubs easily identifiable on Amazon.) Or by me just adding all the PRIOR printings referred to in pubs I own. Any sort of big project could distort the stats really. Or totally miss being recorded - e.g. if people are working on publishers we still have no stats to show if they're being improved or not. But I'm happy to add/remove/adjust Entropy stats for any good reason, or hand over the reins to someone else - when it goes a bit quiet on discussion though, "abandon" looks like another option. :-/ BLongley 22:58, 12 September 2008 (UTC)

I check the stats once or twice a month, especially the mags where we appear to be approaching an asymptote. I'm happy as long as the entropy is going down, or at least holding steady. I do appreciate the work you're putting into this.--Rkihara 07:01, 2 December 2008 (UTC)
At last! Comments! That's got to be worth another update. :-) BLongley 20:08, 2 December 2008 (UTC)
I'm actually on a project that could reduce the verification counts a bit (ensuring at least one publication of an anthology/collection has contents), so I'm glad that's still rising despite me. I ought to figure out measures for such. BLongley 20:08, 2 December 2008 (UTC)

Looks all positive to me

Good movement in all the right directions, I think. BLongley 20:49, 10 February 2009 (UTC)

And this month a HUGE change in Secondary Verifications. BLongley 19:09, 11 March 2009 (UTC)
I'm pretty sure that this great change in verifications is Bluesman's work. He's been verifying with Contento1, Locus1 and Currey. (And I've recently got back to my Tuck verifications and started verifying using Reginald1.) Sadly, just learned today that his Currey verifications weren't showing up on the recent verifications list. I wonder if they're accounted for in the percentages in the table you created? MHHutchins 19:36, 10 April 2009 (UTC)
The figures include 4470 Currey verifications and 4271 Transient ones so I think at least that problem isn't included. BLongley 21:26, 10 April 2009 (UTC)
Well, his efforts made me look at the figures again and realise I've been miscounting. I wasn't counting verified pubs, I was counting verifications overall, so some pubs were being counted several times over. :-( BLongley 20:39, 10 April 2009 (UTC)
I'll break out the raw figures to show his good work, but the Percentages will have to start over again. BLongley 20:39, 10 April 2009 (UTC)