User:ErsatzCulture/SQLQueries

From ISFDB
< User:ErsatzCulture
Revision as of 21:35, 19 November 2022 by ErsatzCulture (talk | contribs) (fancy version of verification stats)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Author related

Non-standard name detection

See R&S_Example_page/Initials_normalization.

Author names beginning with two upper case letters with no full-stops/periods

(91 records as of September 2019)

   select a.author_id, a.author_canonical author_name, a.author_legalname, p.author_id real_author_id, ra.author_canonical real_name from authors a LEFT OUTER JOIN pseudonyms p ON (a.author_id = p.pseudonym) LEFT OUTER JOIN authors ra ON (ra.author_id = p.author_id) where CAST(a.author_canonical AS BINARY) regexp '^[A-Z][A-Z]' order by real_author_id, a.author_canonical;

Author names beginning with an initial and a full-stop/period, but then no space or hyphen

(14 records as of September 2019)

   select a.author_id, a.author_canonical author_name, a.author_legalname, p.author_id real_author_id, ra.author_canonical real_name from authors a LEFT OUTER JOIN pseudonyms p ON (a.author_id = p.pseudonym) LEFT OUTER JOIN authors ra ON (ra.author_id = p.author_id) where CAST(a.author_canonical AS BINARY) regexp '^[A-Z]\\.[^ \-]' order by real_author_id, a.author_canonical;

Author names beginning with an initial and then a non-letter or full-stop/period

(269 records as of September 2019 - this pulls in a lot of false-positives e.g. HTML entities - which may or may not be relevant in this context - some non-ASCII letters, etc)

   select a.author_id, a.author_canonical author_name, a.author_legalname, p.author_id real_author_id, ra.author_canonical real_name from authors a LEFT OUTER JOIN pseudonyms p ON (a.author_id = p.pseudonym) LEFT OUTER JOIN authors ra ON (ra.author_id = p.author_id) where CAST(a.author_canonical AS BINARY) regexp '^[A-Z][^\\p{L}\\.\']' order by real_author_id, a.author_canonical;

Author names not beginning with an upper case ASCII letter or HTML entity

(708 records as of September 2019 - this pulls in a lot of false positives with non-ASCII upper case letters)

   select a.author_id, a.author_canonical author_name, a.author_legalname, p.author_id real_author_id, ra.author_canonical real_name from authors a LEFT OUTER JOIN pseudonyms p ON (a.author_id = p.pseudonym) LEFT OUTER JOIN authors ra ON (ra.author_id = p.author_id) where CAST(a.author_canonical AS BINARY) regexp '^[^A-Z&]' order by real_author_id, a.author_canonical;

Author names not beginning with a letter or HTML entity

(118 records as of September 2019 - this has a number of accented Zs that theoretically shouldn't be reported on with this regex)

   select a.author_id, a.author_canonical author_name, a.author_legalname, p.author_id real_author_id, ra.author_canonical real_name from authors a LEFT OUTER JOIN pseudonyms p ON (a.author_id = p.pseudonym) LEFT OUTER JOIN authors ra ON (ra.author_id = p.author_id) where CAST(a.author_canonical AS BINARY) regexp '^[^\\p{L}&]' order by real_author_id, a.author_canonical;

Titles ranked by the number of publications they have

This seems like a reasonable proxy for answering the question "WHat are the biggest SF&F titles of all time?"

   select tc.title_id, title_title, c FROM (select title_id, count(1) c from pub_content group by title_id) tc LEFT OUTER JOIN titles t ON (t.title_id = tc.title_id) WHERE t.title_ttype = 'NOVEL' ORDER by c DESC limit 100;

The above example filters on NOVELs only, but you can easily tweak that.

This query does have a few issues though:

  • It doesn't take into account title parents/children, so (a) translated publications are omitted (I think) and (b) you get things like "1984" and "Nineteen Eighty-Four" appearing separately (and probably why HP and the Sorceror's/Philosopher's Stone doesn't appear, but some of it's sequels do).
  • It takes a fair amount of processing power - the query takes 44 seconds to run on my box. (This is why I haven't even bothered to address the aforementioned parent issue, as that'd make it even slower.)
  • Titles which are in the public domain will likely have bloated pub counts. It would be nice to have lists that include and exclude these, but that would involve doing adding author death year vs now() into the query, which would make it even slower.

Translations

All/most fiction translated into English

Reference

Warning: thousands of results.

   select t.title_id eng_id, t.title_title eng_title,
   CAST(t.title_copyright AS CHAR) eng_copyright,
   t.title_ttype, t.title_storylen,
   tp.title_id orig_id, tp.title_title orig_title,
   CAST(tp.title_copyright AS CHAR) orig_copyright,
   l.lang_name
   from titles t LEFT OUTER JOIN titles tp ON t.title_parent = tp.title_id
   LEFT OUTER JOIN languages l ON l.lang_id = tp.title_language
   WHERE t.title_language != tp.title_language
     AND t.title_ttype in ('COLLECTION', 'ANTHOLOGY', 'SHORTFICTION', 'NOVEL', 'OMNIBUS')
     AND t.title_language = 17
   ORDER BY t.title_copyright DESC;""")

There'll be a Python wrapper script that converts this into a long HTML page/table in my GitHub repo sooner or later.

Titles/pubs translated from one language to another

In this case, from Chinese (17) to English (17), additionally filtering on pub date 2020 or later

   select t.title_id, t.title_title, p.pub_id, p.pub_title, p.pub_year, p.publisher_id, publishers.publisher_name,
     p.pub_ptype, p.pub_ctype, t.title_ttype 
   from pubs p 
   LEFT OUTER JOIN pub_content pc ON pc.pub_id = p.pub_id
   LEFT OUTER JOIN titles t ON pc.title_id = t.title_id 
   LEFT OUTER JOIN titles parent_t ON parent_t.title_id = t.title_parent 
   LEFT OUTER JOIN publishers ON publishers.publisher_id = p.publisher_id 
   WHERE t.title_parent IS NOT NULL -- filters out most translations
     and parent_t.title_language = 13 -- 13 = Chinese
     AND t.title_language = 17 -- 17 = English
     AND pub_year >= '2020-01-01'
   ORDER BY t.title_id, p.pub_id, p.publisher_id, p.pub_year, p.pub_ptype;

The results are a bit messy, as you have novels, and short fiction all munged together, including anthologies or magazines that may only include a minority of translated material.

Awards

Show how often the Hugo Best Novel award went to a SF and/or fantasy novel

   select DISTINCT ta.title_id, a.award_title, a.award_id, YEAR(a.award_year), tm.tag_id, tag_name 
   from awards a LEFT OUTER JOIN title_awards ta ON ta.award_id = a.award_id 
   LEFT OUTER JOIN tag_mapping tm ON tm.title_id = ta.title_id 
   LEFT OUTER JOIN tags ON tags.tag_id = tm.tag_id 
   WHERE a.award_cat_id = 261 and a.award_level = 1 AND
    (tag_name is NULL or tag_name IN ('science fiction', 'fantasy')) 
   ORDER BY a.award_year;

There will be multiple rows for titles that are tagged both SF and fantasy e.g. Lord of Light.

NB: this will show NULL if the title is a variant (e.g. Three Body Problem), as the tags are applied to the parent title. This could probably be fixed, although for the cases I'm interested in, it's probably not a big problem

Verifications

Top 100 primary verified pubs

Barebones:

   select * from (
      select pub_id, count(1) c from primary_verifications pv group by pub_id
   ) foo ORDER BY c DESC LIMIT 100;

Top 100 primary verified titles this century

NB: doesn't take into account parent titles. Takes ~17 seconds to run on my fairly old PC

  SELECT * FROM (
    SELECT pc.title_id, t.title_title, t.title_ttype, count(1) c 
    FROM primary_verifications pv  
    NATURAL JOIN pubs p 
    NATURAL JOIN pub_content pc 
    LEFT OUTER JOIN titles t ON t.title_id = pc.title_id 
    WHERE p.pub_ctype = t.title_ttype 
      AND title_copyright >= '2000-00-00' 
    GROUP BY title_id) 
  foo 
  ORDER by c DESC LIMIT 100;


Fancy version, with ranking, authors, proper handling of parent titles

   WITH root_query AS (
     SELECT CASE WHEN t.title_parent !=0 THEN t.title_parent ELSE t.title_id END title_root_id,
       -- t.title_title, t.title_ttype,
       COUNT(1) num_verifications,
       COUNT(DISTINCT p.pub_id) num_verified_pubs,
       COUNT(DISTINCT pv.user_id) num_verifiers
     FROM primary_verifications pv
     NATURAL JOIN pubs p
     NATURAL JOIN pub_content pc
     LEFT OUTER JOIN titles t ON t.title_id = pc.title_id
     WHERE p.pub_ctype = t.title_ttype
     GROUP BY title_root_id
   )
   SELECT t.title_id, t.title_title, YEAR(t.title_copyright) year, t.title_ttype,
     GROUP_CONCAT(author_canonical SEPARATOR ' and ') authors,
     rq.num_verifications,  RANK() OVER (ORDER BY num_verifications DESC) num_verifications_rank,
     rq.num_verified_pubs, RANK() OVER (ORDER BY num_verified_pubs DESC) num_verified_pubs_rank,
     rq.num_verifiers, RANK() OVER (ORDER BY num_verifiers DESC) num_verifiers_rank
   FROM root_query rq
   INNER JOIN titles t ON rq.title_root_id = t.title_id
   LEFT OUTER JOIN canonical_author ca ON ca.title_id = t.title_id
   LEFT OUTER JOIN authors a ON a.author_id = ca.author_id
   GROUP BY t.title_id
   ORDER BY num_verifiers DESC, num_verifications DESC, num_verified_pubs DESC
   LIMIT 100;