Difference between revisions of "ISFDB:Data Entropy"

From ISFDB
Jump to navigation Jump to search
m
(→‎Progress: Add 3rd November)
Line 390: Line 390:
 
| colspan="1" | 23.6
 
| colspan="1" | 23.6
 
|-
 
|-
 +
 +
|- align="center"
 +
| 2008-11-03
 +
| colspan="1" | 39824
 +
| colspan="1" | 100168
 +
| colspan="1" | 28.4
 +
| colspan="1" | 14663
 +
| colspan="1" | 108616
 +
| colspan="1" | 11.9
 +
| colspan="1" | 15646
 +
| colspan="1" | 114943
 +
| colspan="1" | 12.0
 +
| colspan="1" | 22643
 +
| colspan="1" | 17.3
 +
| colspan="1" | 32201
 +
| colspan="1" | 24.7
 +
|-
 +
 
|}
 
|}
  

Revision as of 17:39, 4 November 2008

Progress

The percentages are for bad/missing data and should ideally continually be decreasing. The total number of magazine pages and publications should generally be increasing as more are published, but may dip if we actively work on duplicate removal for a while. Other clean-up projects may cause blips: e.g. adding publications that appeared only in reviews has led to a slight increase in publications lacking prices as it's difficult to find this data for older publications from online sources alone.

Backup Date Magazine Pages Publications w/Pages Publications w/Prices Verifications
Bad Good % Bad Bad Good % Bad Bad Good % Bad Primary % Primary Any Kind % Any Kind
2006-09-28 68322 2035 97.1 21889 55116 28.4 11730 74884 13.5
2006-12-17 65277 6883 90.5 21662 57122 27.5 12111 75958 13.8
2007-02-18 64693 12240 84.1 21499 60864 26.1 12509 79135 13.6
2007-04-15 64016 15957 80.0 21062 65376 24.4 12611 82835 13.2
2007-06-12 62676 22479 73.6 20738 68877 23.1 12828 85481 13.0
2007-08-11 59852 33603 64.0 20409 74829 21.4 13023 90513 12.6
2007-11-08 53750 48353 52.6 19535 82131 19.2 13118 96196 12.0
2007-12-09 52599 52163 50.0 19502 84153 18.8 13468 97749 12.1
2007-12-29 50814 58049 46.6 19189 86626 18.1 13659 99747 12.0
2008-01-07 50108 60482 45.3 19140 87695 17.9 13852 100449 12.1
2008-01-14 49913 61442 44.8 19130 88133 17.8 13921 100771 12.1
2008-01-28 49481 62651 44.1 18845 89176 17.4 13972 101451 12.1
2008-02-04 49013 63802 43.5 18689 89658 17.0 13977 101734 12.1
2008-02-11 48548 65301 42.6 18520 90118 17.0 14020 101964 12.1
2008-03-01 47531 68888 40.8 18187 91785 16.5 14087 103187 12.0
2008-04-06 46449 73894 38.6 17337 93842 15.6 14270 104347 12.0
2008-04-14 46446 74347 38.5 17236 94745 15.4 14404 105073 12.1
2008-04-28 46321 75369 38.2 17134 96324 15.1 14577 106316 12.1
2008-05-15 45710 77627 37.1 16846 97913 14.7 14736 107396 12.1
2008-05-25 45188 79407 36.3 16752 98929 14.5 14819 108257 12.0
2008-07-09 43063 86647 33.2 15367 101577 13.1 14861 109132 12.0 18609 15.0 27420 22.1
2008-07-26 42058 89751 31.9 15092 102692 12.8 14959 109847 12.0 19109 15.3 27986 22.4
2008-08-10 41784 91044 31.5 15032 103948 12.6 15020 110920 11.9 19503 15.5 28473 22.6
2008-09-12 41008 94182 30.3 14956 105614 12.4 15331 112352 12.0 20517 16.1 29698 23.3
2008-10-01 40448 95801 29.7 14817 106769 12.2 15491 113246 12.0 21091 16.4 30408 23.6
2008-11-03 39824 100168 28.4 14663 108616 11.9 15646 114943 12.0 22643 17.3 32201 24.7

Prices

Backup Date Total pubs Pubs without a price % of pubs without a price Bad price % of pubs with a bad price US price New UK price Old UK price Canadian price Australian price
2008-01-20 115043 14166 12.3% 1027 0.89% 80540 17356 1583 145 214

SQL scripts Used

These are rough scripts and could generally all be improved: however, the results above were generated using these so any changes/improvements invalidate (even if only partially) the figures gathered so far.

Magazine Pages with or without page numbers:

select case ifnull(pubc_page, -999) when -999 then "Bad" else "Good" end, count(*)
from pub_content pc, pubs p, titles t
where p.pub_id = pc.pub_id
and p.pub_ctype = 'MAGAZINE'
and t.title_id = pc.title_id
and t.title_ttype NOT IN ('EDITOR','COVERART')
group by case ifnull(pubc_page, -999) when -999 then "Bad" else "Good" end

Publications with or without total page count:

select case ifnull(pub_pages, -999) when -999 then "Bad" else "Good" end, count(*)
from pubs p
WHERE p.pub_ptype NOT LIKE '%audio%'  And p.pub_ptype NOT LIKE '%cassette%'
And p.pub_ptype NOT LIKE 'CD%'        And p.pub_ptype NOT LIKE 'compact disc%'
And p.pub_ptype NOT LIKE 'e%book%'    And p.pub_ptype NOT LIKE 'electron%'
And p.pub_ptype NOT LIKE '%web%'      And p.pub_ptype NOT LIKE '%ezine%'
And p.pub_ptype NOT LIKE '%internet%' And p.pub_ptype NOT LIKE '%mp3%'
And p.pub_ptype NOT LIKE '%Adobe%'    And p.pub_ptype NOT LIKE '%Mobipocket%'
And p.pub_ptype NOT LIKE '%PDF%'      And p.pub_ptype NOT LIKE '%tape%'
And p.pub_ptype NOT LIKE '%www%'      And p.pub_ptype NOT LIKE '%digit%'
And p.pub_ptype NOT LIKE '%online%'
group by case ifnull(pub_pages, -999) when -999 then "Bad" else "Good" end

Publications with or without prices:

select case ifnull(pub_price, 'L') when 'L' then "Bad" else "Good" end, count(*)
from pubs p
group by case ifnull(pub_price, 'L') when 'L' then "Bad" else "Good" end

Verifications:

select 'Primary Verified', count(*)
from pubs p, verification v
where v.pub_id = p.pub_id
and (v.reference_id = 1 or v.reference_id = 12)
UNION
select 'Verified', count(*)
from pubs p, verification v
where v.pub_id = p.pub_id
UNION
select 'Total Pubs', count(*)
from pubs p