Difference between revisions of "ISFDB talk:Data Entropy"

From ISFDB
Jump to navigation Jump to search
(Verified pubs/titles)
(Clearer verification percentages)
Line 18: Line 18:
  
 
::::We can check the current number/percentage of Verified Publications on the [http://www.isfdb.org/cgi-bin/stats.cgi Statistics] page -- currently 14.01% -- but it doesn't have historical data or secondary verifications information. I agree that "percent of titles with at least one verified publication" sounds like a useful metric and something to add to our to-do list. [[User:Ahasuerus|Ahasuerus]] 16:45, 11 July 2008 (UTC)
 
::::We can check the current number/percentage of Verified Publications on the [http://www.isfdb.org/cgi-bin/stats.cgi Statistics] page -- currently 14.01% -- but it doesn't have historical data or secondary verifications information. I agree that "percent of titles with at least one verified publication" sounds like a useful metric and something to add to our to-do list. [[User:Ahasuerus|Ahasuerus]] 16:45, 11 July 2008 (UTC)
 +
 +
::::: Primary Verified (including Transient), and Verified (any way) are easy to add: e.g.
 +
 +
select 'Primary Verified', count(*)
 +
from pubs p, verification v
 +
where v.pub_id = p.pub_id
 +
and (v.reference_id = 1 or v.reference_id = 12)
 +
UNION
 +
select 'Verified', count(*)
 +
from pubs p, verification v
 +
where v.pub_id = p.pub_id
 +
UNION
 +
select 'Total Pubs', count(*)
 +
from pubs p
 +
 +
::::: Gives us more encouraging percentages like:
 +
Primary Verified 18609 15.0%
 +
Verified 27420 22.1%
 +
Total Pubs 123993
 +
[[User:BLongley|BLongley]] 18:59, 11 July 2008 (UTC)

Revision as of 14:59, 11 July 2008

Generally positive news, but the lack of updates on the price category makes me wonder if that's of use. Some good Publisher regularization recently makes me think we need another measure, but that can't be as absolute as the other measures - we don't want "one publisher to rule them all", we just want to reach the right level of accurately recording imprint and publishers without over-duplication. Other suggestions for (definable) measures are welcome though. BLongley 21:31, 10 July 2008 (UTC)

Yep! rbh (Bob) 01:37, 11 July 2008 (UTC)
Older (<1930) publications are often difficult to find price data for, so there is probably a floor for that category, although I doubt that we have reached it yet. There is a still a lot of dirty data with mangled ISBNs, prices, page counts, etc that we need to clean up. And no, this page hasn't been forgotten, at least not by me, but there are so many other things going on and so little time to work on all of them... Ahasuerus 02:50, 11 July 2008 (UTC)
I, for one, do look at this. There re other measures that I would in theory like to see, but I'm not sure if they are worth the the trouble of running and posting them. Still if someone feels like:
  • Prices are much harder to find for older works, and for many older works there never was a cover price in the modern sense. Therefore, it would be nice to see a version of the price query with a date limit, or even perhaps multiple versions with different date limits. For example
    • Percent of publications with prices published from 1900 to the present
    • Percent of publications with prices published from 1920 to the present
    • Percent of publications with prices published from 1940 to the present
  • Verification is also an important measure of data entropy, IMO,and so I would like to see
    • Percent of publications verified
    • Percent of titles with at least one verified publication. While often when entering or verifying a title people add other pubs that can be inferred from data on the copyright page, and/or data from secondary sources, if at least one pub is verified, our confidence in the whole can be somewhat higher. If we had a well defined concept of "edition" then I would say "editions with at least one verified pub"
Again, I am not asking for any of these, but I think they might possibly be of some limited value.
On publishers, we can't really devise queries that measure how far we are from normalization until we have better defined what we mean by normal, IMO. The only query i can think of that would offer some limited value (besides the raw list of publisher names now in the system, which is too long for a wiki list) would be a bracketed list of number of publications/publisher. That is: the number of publishers with exactly one pub; number with 2-10 pubs; number with 11-20 pubs; ... number with 50-100 pubs; etc The precise brackets are not IMO important, except that they should probably get larger as the count rises, and that "exactly one" should almost surely be a separate bracket. Does anyone else think such data would be helpful?
Anyway, those are my views. -DES Talk 15:27, 11 July 2008 (UTC)
We can check the current number/percentage of Verified Publications on the Statistics page -- currently 14.01% -- but it doesn't have historical data or secondary verifications information. I agree that "percent of titles with at least one verified publication" sounds like a useful metric and something to add to our to-do list. Ahasuerus 16:45, 11 July 2008 (UTC)
Primary Verified (including Transient), and Verified (any way) are easy to add: e.g.
select 'Primary Verified', count(*)
from pubs p, verification v
where v.pub_id = p.pub_id
and (v.reference_id = 1 or v.reference_id = 12)
UNION
select 'Verified', count(*)
from pubs p, verification v
where v.pub_id = p.pub_id
UNION
select 'Total Pubs', count(*)
from pubs p
Gives us more encouraging percentages like:
Primary Verified	18609	15.0%
Verified		27420	22.1%
Total Pubs		123993	

BLongley 18:59, 11 July 2008 (UTC)