Difference between revisions of "ISFDB talk:Data Entropy"

From ISFDB
Jump to navigation Jump to search
(possible additional queries)
Line 4: Line 4:
  
 
::Older (<1930) publications are often difficult to find price data for, so there is probably a floor for that category, although I doubt that we have reached it yet. There is a still a lot of dirty data with mangled ISBNs, prices, page counts, etc that we need to clean up. And not, this page hasnt been forgotten, at least not by me, but there are some many other things going on and so little time to do all of it... [[User:Ahasuerus|Ahasuerus]] 02:50, 11 July 2008 (UTC)
 
::Older (<1930) publications are often difficult to find price data for, so there is probably a floor for that category, although I doubt that we have reached it yet. There is a still a lot of dirty data with mangled ISBNs, prices, page counts, etc that we need to clean up. And not, this page hasnt been forgotten, at least not by me, but there are some many other things going on and so little time to do all of it... [[User:Ahasuerus|Ahasuerus]] 02:50, 11 July 2008 (UTC)
 +
:::I, for one, do look at this. There re other measueres that I would in theory like to see, but I'm not sure if they are worth the the trouble of running and posting them. Still if someone feels like:
 +
:::*Prices are much harder to find for older works, and for many older works there never was a cover price in the modern sense. Therefore, it would be nice to see a version of the price query with a date limit, or even perhaps multiple versions with different date limits. For example
 +
:::**Percent of publications with prices published from 1900 to the present
 +
:::**Percent of publications with prices published from 1920 to the present
 +
:::**Percent of publications with prices published from 1940 to the present
 +
:::*Verification is also an important measure of data entropy, IMO,and so I would like to see
 +
:::**Percent of publications verified
 +
:::**Percent of titles with at least one verified publication.  While often when entering or verifing a title people add other pubs that can be inferred from data on the copyright page, and/or data from secondary sources, if at least one pub is verified, our confidence in the whole can be soemwhat higher. If we had a well defiend concept of "edition" then I would say "editions with at least one verified pub"
 +
:::Again, I am not '''asking''' for any of these, but I think they might possibly be of some limited value.
 +
:::On publishers, we can't really devise queries that measure how far we are from normalization until we have better defined what we mean by normal, IMO. The only query i can think of that woudl offer some limited value (besides the raw list of publsiher names now in the system, which is too long for a wiki list)  would be a bracketed list of number of publications/publisher. That is: the number of publishers with exactly one pub; number with 2-10 pubs; number with 11-20 pubs; ... number with 50-100 pubs; etc The precise brackets are not IMO important, except that thyy should probably get larger as the count rises, and that "exactly one" should almost surely be a separate bracket. Does anyone else think such data would be helpful?
 +
:::Anyway, those are my views. -[[User:DESiegel60|DES]] <sup>[[User talk:DESiegel60|Talk]]</sup> 15:27, 11 July 2008 (UTC)

Revision as of 11:27, 11 July 2008

Generally positive news, but the lack of updates on the price category makes me wonder if that's of use. Some good Publisher regularization recently makes me think we need another measure, but that can't be as absolute as the other measures - we don't want "one publisher to rule them all", we just want to reach the right level of accurately recording imprint and publishers without over-duplication. Other suggestions for (definable) measures are welcome though. BLongley 21:31, 10 July 2008 (UTC)

Yep! rbh (Bob) 01:37, 11 July 2008 (UTC)
Older (<1930) publications are often difficult to find price data for, so there is probably a floor for that category, although I doubt that we have reached it yet. There is a still a lot of dirty data with mangled ISBNs, prices, page counts, etc that we need to clean up. And not, this page hasnt been forgotten, at least not by me, but there are some many other things going on and so little time to do all of it... Ahasuerus 02:50, 11 July 2008 (UTC)
I, for one, do look at this. There re other measueres that I would in theory like to see, but I'm not sure if they are worth the the trouble of running and posting them. Still if someone feels like:
  • Prices are much harder to find for older works, and for many older works there never was a cover price in the modern sense. Therefore, it would be nice to see a version of the price query with a date limit, or even perhaps multiple versions with different date limits. For example
    • Percent of publications with prices published from 1900 to the present
    • Percent of publications with prices published from 1920 to the present
    • Percent of publications with prices published from 1940 to the present
  • Verification is also an important measure of data entropy, IMO,and so I would like to see
    • Percent of publications verified
    • Percent of titles with at least one verified publication. While often when entering or verifing a title people add other pubs that can be inferred from data on the copyright page, and/or data from secondary sources, if at least one pub is verified, our confidence in the whole can be soemwhat higher. If we had a well defiend concept of "edition" then I would say "editions with at least one verified pub"
Again, I am not asking for any of these, but I think they might possibly be of some limited value.
On publishers, we can't really devise queries that measure how far we are from normalization until we have better defined what we mean by normal, IMO. The only query i can think of that woudl offer some limited value (besides the raw list of publsiher names now in the system, which is too long for a wiki list) would be a bracketed list of number of publications/publisher. That is: the number of publishers with exactly one pub; number with 2-10 pubs; number with 11-20 pubs; ... number with 50-100 pubs; etc The precise brackets are not IMO important, except that thyy should probably get larger as the count rises, and that "exactly one" should almost surely be a separate bracket. Does anyone else think such data would be helpful?
Anyway, those are my views. -DES Talk 15:27, 11 July 2008 (UTC)