Revision as of 16:22, 20 May 2021

Author related

Non-standard name detection

See R&S_Example_page/Initials_normalization.

Author names beginning with two upper case letters with no full-stops/periods

(91 records as of September 2019)

   select a.author_id, a.author_canonical author_name, a.author_legalname, p.author_id real_author_id, ra.author_canonical real_name from authors a LEFT OUTER JOIN pseudonyms p ON (a.author_id = p.pseudonym) LEFT OUTER JOIN authors ra ON (ra.author_id = p.author_id) where CAST(a.author_canonical AS BINARY) regexp '^[A-Z][A-Z]' order by real_author_id, a.author_canonical;

Author names beginning with an initial and a full-stop/period, but then no space or hyphen

(14 records as of September 2019)

   select a.author_id, a.author_canonical author_name, a.author_legalname, p.author_id real_author_id, ra.author_canonical real_name from authors a LEFT OUTER JOIN pseudonyms p ON (a.author_id = p.pseudonym) LEFT OUTER JOIN authors ra ON (ra.author_id = p.author_id) where CAST(a.author_canonical AS BINARY) regexp '^[A-Z]\\.[^ \-]' order by real_author_id, a.author_canonical;

Author names beginning with an initial and then a non-letter or full-stop/period

(269 records as of September 2019 - this pulls in a lot of false-positives e.g. HTML entities - which may or may not be relevant in this context - some non-ASCII letters, etc)

   select a.author_id, a.author_canonical author_name, a.author_legalname, p.author_id real_author_id, ra.author_canonical real_name from authors a LEFT OUTER JOIN pseudonyms p ON (a.author_id = p.pseudonym) LEFT OUTER JOIN authors ra ON (ra.author_id = p.author_id) where CAST(a.author_canonical AS BINARY) regexp '^[A-Z][^\\p{L}\\.\']' order by real_author_id, a.author_canonical;

Author names not beginning with an upper case ASCII letter or HTML entity

(708 records as of September 2019 - this pulls in a lot of false positives with non-ASCII upper case letters)

   select a.author_id, a.author_canonical author_name, a.author_legalname, p.author_id real_author_id, ra.author_canonical real_name from authors a LEFT OUTER JOIN pseudonyms p ON (a.author_id = p.pseudonym) LEFT OUTER JOIN authors ra ON (ra.author_id = p.author_id) where CAST(a.author_canonical AS BINARY) regexp '^[^A-Z&]' order by real_author_id, a.author_canonical;

Author names not beginning with a letter or HTML entity

(118 records as of September 2019 - this has a number of accented Zs that theoretically shouldn't be reported on with this regex)

   select a.author_id, a.author_canonical author_name, a.author_legalname, p.author_id real_author_id, ra.author_canonical real_name from authors a LEFT OUTER JOIN pseudonyms p ON (a.author_id = p.pseudonym) LEFT OUTER JOIN authors ra ON (ra.author_id = p.author_id) where CAST(a.author_canonical AS BINARY) regexp '^[^\\p{L}&]' order by real_author_id, a.author_canonical;

Titles ranked by the number of publications they have

This seems like a reasonable proxy for answering the question "WHat are the biggest SF&F titles of all time?"

   select tc.title_id, title_title, c FROM (select title_id, count(1) c from pub_content group by title_id) tc LEFT OUTER JOIN titles t ON (t.title_id = tc.title_id) WHERE t.title_ttype = 'NOVEL' ORDER by c DESC limit 100;

The above example filters on NOVELs only, but you can easily tweak that.

This query does have a few issues though:

It doesn't take into account title parents/children, so (a) translated publications are omitted (I think) and (b) you get things like "1984" and "Nineteen Eighty-Four" appearing separately (and probably why HP and the Sorceror's/Philosopher's Stone doesn't appear, but some of it's sequels do).
It takes a fair amount of processing power - the query takes 44 seconds to run on my box. (This is why I haven't even bothered to address the aforementioned parent issue, as that'd make it even slower.)
Titles which are in the public domain will likely have bloated pub counts. It would be nice to have lists that include and exclude these, but that would involve doing adding author death year vs now() into the query, which would make it even slower.

Translations

All/most fiction translated into English

Reference

Warning: thousands of results.

   select t.title_id eng_id, t.title_title eng_title,
   CAST(t.title_copyright AS CHAR) eng_copyright,
   t.title_ttype, t.title_storylen,
   tp.title_id orig_id, tp.title_title orig_title,
   CAST(tp.title_copyright AS CHAR) orig_copyright,
   l.lang_name
   from titles t LEFT OUTER JOIN titles tp ON t.title_parent = tp.title_id
   LEFT OUTER JOIN languages l ON l.lang_id = tp.title_language
   WHERE t.title_language != tp.title_language
     AND t.title_ttype in ('COLLECTION', 'ANTHOLOGY', 'SHORTFICTION', 'NOVEL', 'OMNIBUS')
     AND t.title_language = 17
   ORDER BY t.title_copyright DESC;""")

There'll be a Python wrapper script that converts this into a long HTML page/table in my GitHub repo sooner or later.

Titles/pubs translated from one language to another

In this case, from Chinese (17) to English (17), additionally filtering on pub date 2020 or later

   select t.title_id, t.title_title, p.pub_id, p.pub_title, p.pub_year, p.publisher_id, publishers.publisher_name,
     p.pub_ptype, p.pub_ctype, t.title_ttype 
   from pubs p 
   LEFT OUTER JOIN pub_content pc ON pc.pub_id = p.pub_id
   LEFT OUTER JOIN titles t ON pc.title_id = t.title_id 
   LEFT OUTER JOIN titles parent_t ON parent_t.title_id = t.title_parent 
   LEFT OUTER JOIN publishers ON publishers.publisher_id = p.publisher_id 
   WHERE t.title_parent IS NOT NULL -- filters out most translations
     and parent_t.title_language = 13 -- 13 = Chinese
     AND t.title_language = 17 -- 17 = English
     AND pub_year >= '2020-01-01'
   ORDER BY t.title_id, p.pub_id, p.publisher_id, p.pub_year, p.pub_ptype;

The results are a bit messy, as you have novels, and short fiction all munged together, including anthologies or magazines that may only include a minority of translated material.

Awards

Show how often the Hugo Best Novel award went to a SF and/or fantasy novel

   select DISTINCT ta.title_id, a.award_title, a.award_id, YEAR(a.award_year), tm.tag_id, tag_name 
   from awards a LEFT OUTER JOIN title_awards ta ON ta.award_id = a.award_id 
   LEFT OUTER JOIN tag_mapping tm ON tm.title_id = ta.title_id 
   LEFT OUTER JOIN tags ON tags.tag_id = tm.tag_id 
   WHERE a.award_cat_id = 261 and a.award_level = 1 AND
    (tag_name is NULL or tag_name IN ('science fiction', 'fantasy')) 
   ORDER BY a.award_year;

There will be multiple rows for titles that are tagged both SF and fantasy e.g. Lord of Light.

NB: this will show NULL if the title is a variant (e.g. Three Body Problem), as the tags are applied to the parent title. This could probably be fixed, although for the cases I'm interested in, it's probably not a big problem

Difference between revisions of "User:ErsatzCulture/SQLQueries"