Difference between revisions of "ISFDB:Data Entropy"

From ISFDB
Jump to navigation Jump to search
(→‎Progress: Updated again)
(→‎Progress: Added latest update)
Line 242: Line 242:
 
| colspan="1" | 14404
 
| colspan="1" | 14404
 
| colspan="1" | 105073
 
| colspan="1" | 105073
 +
| colspan="1" | 12.1
 +
 +
|- align="center"
 +
| 2008-04-28
 +
| colspan="1" | 46321
 +
| colspan="1" | 75369
 +
| colspan="1" | 38.2
 +
| colspan="1" | 17134
 +
| colspan="1" | 96324
 +
| colspan="1" | 15.1
 +
| colspan="1" | 14577
 +
| colspan="1" | 106316
 
| colspan="1" | 12.1
 
| colspan="1" | 12.1
  

Revision as of 14:49, 28 April 2008

Progress

The percentages are for bad/missing data and should ideally continually be decreasing. The total number of magazine pages and publications should generally be increasing as more are published, but may dip if we actively work on duplicate removal for a while. Other clean-up projects may cause blips: e.g. adding publications that appeared only in reviews has led to a slight increase in publications lacking prices as it's difficult to find this data for older publications from online sources alone.

Backup Date Magazine Pages Publications w/Pages Publications w/Prices
Bad Good % Bad Bad Good % Bad Bad Good % Bad
2006-09-28 68322 2035 97.1 21889 55116 28.4 11730 74884 13.5
2006-12-17 65277 6883 90.5 21662 57122 27.5 12111 75958 13.8
2007-02-18 64693 12240 84.1 21499 60864 26.1 12509 79135 13.6
2007-04-15 64016 15957 80.0 21062 65376 24.4 12611 82835 13.2
2007-06-12 62676 22479 73.6 20738 68877 23.1 12828 85481 13.0
2007-08-11 59852 33603 64.0 20409 74829 21.4 13023 90513 12.6
2007-11-08 53750 48353 52.6 19535 82131 19.2 13118 96196 12.0
2007-12-09 52599 52163 50.0 19502 84153 18.8 13468 97749 12.1
2007-12-29 50814 58049 46.6 19189 86626 18.1 13659 99747 12.0
2008-01-07 50108 60482 45.3 19140 87695 17.9 13852 100449 12.1
2008-01-14 49913 61442 44.8 19130 88133 17.8 13921 100771 12.1
2008-01-28 49481 62651 44.1 18845 89176 17.4 13972 101451 12.1
2008-02-04 49013 63802 43.5 18689 89658 17.0 13977 101734 12.1
2008-02-11 48548 65301 42.6 18520 90118 17.0 14020 101964 12.1
2008-03-01 47531 68888 40.8 18187 91785 16.5 14087 103187 12.0
2008-04-06 46449 73894 38.6 17337 93842 15.6 14270 104347 12.0
2008-04-14 46446 74347 38.5 17236 94745 15.4 14404 105073 12.1
2008-04-28 46321 75369 38.2 17134 96324 15.1 14577 106316 12.1

Prices

Backup Date Total pubs Pubs without a price % of pubs without a price Bad price % of pubs with a bad price US price New UK price Old UK price Canadian price Australian price
2008-01-20 115043 14166 12.3% 1027 0.89% 80540 17356 1583 145 214

SQL scripts Used

These are rough scripts and could generally all be improved: however, the results above were generated using these so any changes/improvements invalidate (even if only partially) the figures gathered so far.

Magazine Pages with or without page numbers:

select case ifnull(pubc_page, -999) when -999 then "Bad" else "Good" end, count(*)
from pub_content pc, pubs p, titles t
where p.pub_id = pc.pub_id
and p.pub_ctype = 'MAGAZINE'
and t.title_id = pc.title_id
and t.title_ttype NOT IN ('EDITOR','COVERART')
group by case ifnull(pubc_page, -999) when -999 then "Bad" else "Good" end

Publications with or without total page count:

select case ifnull(pub_pages, -999) when -999 then "Bad" else "Good" end, count(*)
from pubs p
WHERE p.pub_ptype NOT LIKE '%audio%'  And p.pub_ptype NOT LIKE '%cassette%'
And p.pub_ptype NOT LIKE 'CD%'        And p.pub_ptype NOT LIKE 'compact disc%'
And p.pub_ptype NOT LIKE 'e%book%'    And p.pub_ptype NOT LIKE 'electron%'
And p.pub_ptype NOT LIKE '%web%'      And p.pub_ptype NOT LIKE '%ezine%'
And p.pub_ptype NOT LIKE '%internet%' And p.pub_ptype NOT LIKE '%mp3%'
And p.pub_ptype NOT LIKE '%Adobe%'    And p.pub_ptype NOT LIKE '%Mobipocket%'
And p.pub_ptype NOT LIKE '%PDF%'      And p.pub_ptype NOT LIKE '%tape%'
And p.pub_ptype NOT LIKE '%www%'      And p.pub_ptype NOT LIKE '%digit%'
And p.pub_ptype NOT LIKE '%online%'
group by case ifnull(pub_pages, -999) when -999 then "Bad" else "Good" end

Publications with or without prices:

select case ifnull(pub_price, 'L') when 'L' then "Bad" else "Good" end, count(*)
from pubs p
group by case ifnull(pub_price, 'L') when 'L' then "Bad" else "Good" end