ISFDB:Data Entropy

From ISFDB
Revision as of 00:42, 15 January 2008 by Ahasuerus (talk | contribs) (→‎Progress: Added 2007-08-11)
Jump to navigation Jump to search

Progress

The percentages are for bad/missing data and should ideally continually be decreasing. The total number of magazine pages and publications should generally be increasing as more are published, but may dip if we actively work on duplicate removal for a while. Other clean-up projects may cause blips: e.g. adding publications that appeared only in reviews has led to a slight increase in publications lacking prices as it's difficult to find this data for older publications from online sources alone.

Backup Date Magazine Pages Publications w/Pages Publications w/Prices
Bad Good % Bad Bad Good % Bad Bad Good % Bad
2006-09-28 68322 2035 97.1 21889 55116 28.4 11730 74884 13.5
2006-12-17 65277 6883 90.5 21662 57122 27.5 12111 75958 13.8
2007-02-18 64693 12240 84.1 21499 60864 26.1 12509 79135 13.6
2007-04-15 64016 15957 80.0 21062 65376 24.4 12611 82835 13.2
2007-06-12 62676 22479 73.6 20738 68877 23.1 12828 85481 13.0
2007-08-11 59852 33603 64.0 20409 74829 21.4 13023 90513 12.58
2007-11-08 53750 48353 52.6 19535 82131 19.2 13118 96196 12.0
2007-12-09 52599 52163 50.0 19502 84153 18.8 13468 97749 12.1
2007-12-29 50814 58049 46.6 19189 86626 18.1 13659 99747 12.0
2008-01-07 50108 60482 45.3 19140 87695 17.9 13852 100449 12.1

SQL scripts Used

These are rough scripts and could generally all be improved: however, the results above were generated using these so any changes/improvements invalidate (even if only partially) the figures gathered so far.

Magazine Pages with or without page numbers:

select case ifnull(pubc_page, -999) when -999 then "Bad" else "Good" end, count(*)
from pub_content pc, pubs p, titles t
where p.pub_id = pc.pub_id
and p.pub_ctype = 'MAGAZINE'
and t.title_id = pc.title_id
and t.title_ttype NOT IN ('EDITOR','COVERART')
group by case ifnull(pubc_page, -999) when -999 then "Bad" else "Good" end

Publications with or without total page count:

select case ifnull(pub_pages, -999) when -999 then "Bad" else "Good" end, count(*)
from pubs p
WHERE p.pub_ptype NOT LIKE '%audio%'  And p.pub_ptype NOT LIKE '%cassette%'
And p.pub_ptype NOT LIKE 'CD%'        And p.pub_ptype NOT LIKE 'compact disc%'
And p.pub_ptype NOT LIKE 'e%book%'    And p.pub_ptype NOT LIKE 'electron%'
And p.pub_ptype NOT LIKE '%web%'      And p.pub_ptype NOT LIKE '%ezine%'
And p.pub_ptype NOT LIKE '%internet%' And p.pub_ptype NOT LIKE '%mp3%'
And p.pub_ptype NOT LIKE '%Adobe%'    And p.pub_ptype NOT LIKE '%Mobipocket%'
And p.pub_ptype NOT LIKE '%PDF%'      And p.pub_ptype NOT LIKE '%tape%'
And p.pub_ptype NOT LIKE '%www%'      And p.pub_ptype NOT LIKE '%digit%'
And p.pub_ptype NOT LIKE '%online%'
group by case ifnull(pub_pages, -999) when -999 then "Bad" else "Good" end

Publications with or without prices:

select case ifnull(pub_price, 'L') when 'L' then "Bad" else "Good" end, count(*)
from pubs p
group by case ifnull(pub_price, 'L') when 'L' then "Bad" else "Good" end