User:ErsatzCulture/SQLQueries

From ISFDB
< User:ErsatzCulture
Revision as of 10:42, 28 April 2020 by ErsatzCulture (talk | contribs) (Query for all(ish) translations into English)
Jump to navigation Jump to search

Author related

Non-standard name detection

See R&S_Example_page/Initials_normalization.

Author names beginning with two upper case letters with no full-stops/periods

(91 records as of September 2019)

   select a.author_id, a.author_canonical author_name, a.author_legalname, p.author_id real_author_id, ra.author_canonical real_name from authors a LEFT OUTER JOIN pseudonyms p ON (a.author_id = p.pseudonym) LEFT OUTER JOIN authors ra ON (ra.author_id = p.author_id) where CAST(a.author_canonical AS BINARY) regexp '^[A-Z][A-Z]' order by real_author_id, a.author_canonical;

Author names beginning with an initial and a full-stop/period, but then no space or hyphen

(14 records as of September 2019)

   select a.author_id, a.author_canonical author_name, a.author_legalname, p.author_id real_author_id, ra.author_canonical real_name from authors a LEFT OUTER JOIN pseudonyms p ON (a.author_id = p.pseudonym) LEFT OUTER JOIN authors ra ON (ra.author_id = p.author_id) where CAST(a.author_canonical AS BINARY) regexp '^[A-Z]\\.[^ \-]' order by real_author_id, a.author_canonical;

Author names beginning with an initial and then a non-letter or full-stop/period

(269 records as of September 2019 - this pulls in a lot of false-positives e.g. HTML entities - which may or may not be relevant in this context - some non-ASCII letters, etc)

   select a.author_id, a.author_canonical author_name, a.author_legalname, p.author_id real_author_id, ra.author_canonical real_name from authors a LEFT OUTER JOIN pseudonyms p ON (a.author_id = p.pseudonym) LEFT OUTER JOIN authors ra ON (ra.author_id = p.author_id) where CAST(a.author_canonical AS BINARY) regexp '^[A-Z][^\\p{L}\\.\']' order by real_author_id, a.author_canonical;

Author names not beginning with an upper case ASCII letter or HTML entity

(708 records as of September 2019 - this pulls in a lot of false positives with non-ASCII upper case letters)

   select a.author_id, a.author_canonical author_name, a.author_legalname, p.author_id real_author_id, ra.author_canonical real_name from authors a LEFT OUTER JOIN pseudonyms p ON (a.author_id = p.pseudonym) LEFT OUTER JOIN authors ra ON (ra.author_id = p.author_id) where CAST(a.author_canonical AS BINARY) regexp '^[^A-Z&]' order by real_author_id, a.author_canonical;

Author names not beginning with a letter or HTML entity

(118 records as of September 2019 - this has a number of accented Zs that theoretically shouldn't be reported on with this regex)

   select a.author_id, a.author_canonical author_name, a.author_legalname, p.author_id real_author_id, ra.author_canonical real_name from authors a LEFT OUTER JOIN pseudonyms p ON (a.author_id = p.pseudonym) LEFT OUTER JOIN authors ra ON (ra.author_id = p.author_id) where CAST(a.author_canonical AS BINARY) regexp '^[^\\p{L}&]' order by real_author_id, a.author_canonical;

Titles ranked by the number of publications they have

This seems like a reasonable proxy for answering the question "WHat are the biggest SF&F titles of all time?"

   select tc.title_id, title_title, c FROM (select title_id, count(1) c from pub_content group by title_id) tc LEFT OUTER JOIN titles t ON (t.title_id = tc.title_id) WHERE t.title_ttype = 'NOVEL' ORDER by c DESC limit 100;

The above example filters on NOVELs only, but you can easily tweak that.

This query does have a few issues though:

  • It doesn't take into account title parents/children, so (a) translated publications are omitted (I think) and (b) you get things like "1984" and "Nineteen Eighty-Four" appearing separately (and probably why HP and the Sorceror's/Philosopher's Stone doesn't appear, but some of it's sequels do).
  • It takes a fair amount of processing power - the query takes 44 seconds to run on my box. (This is why I haven't even bothered to address the aforementioned parent issue, as that'd make it even slower.)
  • Titles which are in the public domain will likely have bloated pub counts. It would be nice to have lists that include and exclude these, but that would involve doing adding author death year vs now() into the query, which would make it even slower.

Translations

All/most fiction translated into English

Reference

Warning: thousands of results.

   select t.title_id eng_id, t.title_title eng_title,
   CAST(t.title_copyright AS CHAR) eng_copyright,
   t.title_ttype, t.title_storylen,
   tp.title_id orig_id, tp.title_title orig_title,
   CAST(tp.title_copyright AS CHAR) orig_copyright,
   l.lang_name
   from titles t LEFT OUTER JOIN titles tp ON t.title_parent = tp.title_id
   LEFT OUTER JOIN languages l ON l.lang_id = tp.title_language
   WHERE t.title_language != tp.title_language
     AND t.title_ttype in ('COLLECTION', 'ANTHOLOGY', 'SHORTFICTION', 'NOVEL', 'OMNIBUS')
     AND t.title_language = 17
   ORDER BY t.title_copyright DESC;""")

There'll be a Python wrapper script that converts this into a long HTML page/table in my GitHub repo sooner or later.