Difference between revisions of "User:ErsatzCulture/SQLQueries"
(Some SQL queries for finding non-standard author names) |
(fancy version of verification stats) |
||
(5 intermediate revisions by the same user not shown) | |||
Line 34: | Line 34: | ||
select a.author_id, a.author_canonical author_name, a.author_legalname, p.author_id real_author_id, ra.author_canonical real_name from authors a LEFT OUTER JOIN pseudonyms p ON (a.author_id = p.pseudonym) LEFT OUTER JOIN authors ra ON (ra.author_id = p.author_id) where CAST(a.author_canonical AS BINARY) regexp '^[^\\p{L}&]' order by real_author_id, a.author_canonical; | select a.author_id, a.author_canonical author_name, a.author_legalname, p.author_id real_author_id, ra.author_canonical real_name from authors a LEFT OUTER JOIN pseudonyms p ON (a.author_id = p.pseudonym) LEFT OUTER JOIN authors ra ON (ra.author_id = p.author_id) where CAST(a.author_canonical AS BINARY) regexp '^[^\\p{L}&]' order by real_author_id, a.author_canonical; | ||
+ | |||
+ | === Titles ranked by the number of publications they have === | ||
+ | |||
+ | This seems like a reasonable proxy for answering the question "WHat are the biggest SF&F titles of all time?" | ||
+ | |||
+ | select tc.title_id, title_title, c FROM (select title_id, count(1) c from pub_content group by title_id) tc LEFT OUTER JOIN titles t ON (t.title_id = tc.title_id) WHERE t.title_ttype = 'NOVEL' ORDER by c DESC limit 100; | ||
+ | |||
+ | The above example filters on NOVELs only, but you can easily tweak that. | ||
+ | |||
+ | This query does have a few issues though: | ||
+ | |||
+ | * It doesn't take into account title parents/children, so (a) translated publications are omitted (I think) and (b) you get things like "1984" and "Nineteen Eighty-Four" appearing separately (and probably why HP and the Sorceror's/Philosopher's Stone doesn't appear, but some of it's sequels do). | ||
+ | * It takes a fair amount of processing power - the query takes 44 seconds to run on my box. (This is why I haven't even bothered to address the aforementioned parent issue, as that'd make it even slower.) | ||
+ | * Titles which are in the public domain will likely have bloated pub counts. It would be nice to have lists that include and exclude these, but that would involve doing adding author death year vs now() into the query, which would make it even slower. | ||
+ | |||
+ | = Translations = | ||
+ | |||
+ | == All/most fiction translated into English == | ||
+ | |||
+ | [https://old.reddit.com/r/printSF/comments/g9m9bi/exploring_as_many_different_countrys_print_sf_as/fougs2e/ Reference] | ||
+ | |||
+ | Warning: thousands of results. | ||
+ | |||
+ | select t.title_id eng_id, t.title_title eng_title, | ||
+ | CAST(t.title_copyright AS CHAR) eng_copyright, | ||
+ | t.title_ttype, t.title_storylen, | ||
+ | tp.title_id orig_id, tp.title_title orig_title, | ||
+ | CAST(tp.title_copyright AS CHAR) orig_copyright, | ||
+ | l.lang_name | ||
+ | from titles t LEFT OUTER JOIN titles tp ON t.title_parent = tp.title_id | ||
+ | LEFT OUTER JOIN languages l ON l.lang_id = tp.title_language | ||
+ | WHERE t.title_language != tp.title_language | ||
+ | AND t.title_ttype in ('COLLECTION', 'ANTHOLOGY', 'SHORTFICTION', 'NOVEL', 'OMNIBUS') | ||
+ | AND t.title_language = 17 | ||
+ | ORDER BY t.title_copyright DESC;""") | ||
+ | |||
+ | There'll be a Python wrapper script that converts this into a long HTML page/table in my GitHub repo sooner or later. | ||
+ | |||
+ | == Titles/pubs translated from one language to another == | ||
+ | |||
+ | In this case, from Chinese (17) to English (17), additionally filtering on pub date 2020 or later | ||
+ | |||
+ | select t.title_id, t.title_title, p.pub_id, p.pub_title, p.pub_year, p.publisher_id, publishers.publisher_name, | ||
+ | p.pub_ptype, p.pub_ctype, t.title_ttype | ||
+ | from pubs p | ||
+ | LEFT OUTER JOIN pub_content pc ON pc.pub_id = p.pub_id | ||
+ | LEFT OUTER JOIN titles t ON pc.title_id = t.title_id | ||
+ | LEFT OUTER JOIN titles parent_t ON parent_t.title_id = t.title_parent | ||
+ | LEFT OUTER JOIN publishers ON publishers.publisher_id = p.publisher_id | ||
+ | WHERE t.title_parent IS NOT NULL -- filters out most translations | ||
+ | and parent_t.title_language = 13 -- 13 = Chinese | ||
+ | AND t.title_language = 17 -- 17 = English | ||
+ | AND pub_year >= '2020-01-01' | ||
+ | ORDER BY t.title_id, p.pub_id, p.publisher_id, p.pub_year, p.pub_ptype; | ||
+ | |||
+ | The results are a bit messy, as you have novels, and short fiction all munged together, including anthologies or magazines that may only include a minority of translated material. | ||
+ | |||
+ | = Awards = | ||
+ | |||
+ | == Show how often the Hugo Best Novel award went to a SF and/or fantasy novel == | ||
+ | |||
+ | select DISTINCT ta.title_id, a.award_title, a.award_id, YEAR(a.award_year), tm.tag_id, tag_name | ||
+ | from awards a LEFT OUTER JOIN title_awards ta ON ta.award_id = a.award_id | ||
+ | LEFT OUTER JOIN tag_mapping tm ON tm.title_id = ta.title_id | ||
+ | LEFT OUTER JOIN tags ON tags.tag_id = tm.tag_id | ||
+ | WHERE a.award_cat_id = 261 and a.award_level = 1 AND | ||
+ | (tag_name is NULL or tag_name IN ('science fiction', 'fantasy')) | ||
+ | ORDER BY a.award_year; | ||
+ | |||
+ | There will be multiple rows for titles that are tagged both SF and fantasy e.g. Lord of Light. | ||
+ | |||
+ | NB: this will show NULL if the title is a variant (e.g. Three Body Problem), as the tags are applied to the parent title. This could probably be fixed, although for the cases I'm interested in, it's probably not a big problem | ||
+ | |||
+ | = Verifications = | ||
+ | |||
+ | == Top 100 primary verified pubs == | ||
+ | |||
+ | Barebones: | ||
+ | |||
+ | select * from ( | ||
+ | select pub_id, count(1) c from primary_verifications pv group by pub_id | ||
+ | ) foo ORDER BY c DESC LIMIT 100; | ||
+ | |||
+ | == Top 100 primary verified titles this century == | ||
+ | |||
+ | NB: doesn't take into account parent titles. Takes ~17 seconds to run on my fairly old PC | ||
+ | |||
+ | SELECT * FROM ( | ||
+ | SELECT pc.title_id, t.title_title, t.title_ttype, count(1) c | ||
+ | FROM primary_verifications pv | ||
+ | NATURAL JOIN pubs p | ||
+ | NATURAL JOIN pub_content pc | ||
+ | LEFT OUTER JOIN titles t ON t.title_id = pc.title_id | ||
+ | WHERE p.pub_ctype = t.title_ttype | ||
+ | AND title_copyright >= '2000-00-00' | ||
+ | GROUP BY title_id) | ||
+ | foo | ||
+ | ORDER by c DESC LIMIT 100; | ||
+ | |||
+ | |||
+ | == Fancy version, with ranking, authors, proper handling of parent titles == | ||
+ | |||
+ | WITH root_query AS ( | ||
+ | SELECT CASE WHEN t.title_parent !=0 THEN t.title_parent ELSE t.title_id END title_root_id, | ||
+ | -- t.title_title, t.title_ttype, | ||
+ | COUNT(1) num_verifications, | ||
+ | COUNT(DISTINCT p.pub_id) num_verified_pubs, | ||
+ | COUNT(DISTINCT pv.user_id) num_verifiers | ||
+ | FROM primary_verifications pv | ||
+ | NATURAL JOIN pubs p | ||
+ | NATURAL JOIN pub_content pc | ||
+ | LEFT OUTER JOIN titles t ON t.title_id = pc.title_id | ||
+ | WHERE p.pub_ctype = t.title_ttype | ||
+ | GROUP BY title_root_id | ||
+ | ) | ||
+ | SELECT t.title_id, t.title_title, YEAR(t.title_copyright) year, t.title_ttype, | ||
+ | GROUP_CONCAT(author_canonical SEPARATOR ' and ') authors, | ||
+ | rq.num_verifications, RANK() OVER (ORDER BY num_verifications DESC) num_verifications_rank, | ||
+ | rq.num_verified_pubs, RANK() OVER (ORDER BY num_verified_pubs DESC) num_verified_pubs_rank, | ||
+ | rq.num_verifiers, RANK() OVER (ORDER BY num_verifiers DESC) num_verifiers_rank | ||
+ | FROM root_query rq | ||
+ | INNER JOIN titles t ON rq.title_root_id = t.title_id | ||
+ | LEFT OUTER JOIN canonical_author ca ON ca.title_id = t.title_id | ||
+ | LEFT OUTER JOIN authors a ON a.author_id = ca.author_id | ||
+ | GROUP BY t.title_id | ||
+ | ORDER BY num_verifiers DESC, num_verifications DESC, num_verified_pubs DESC | ||
+ | LIMIT 100; | ||
+ | |||
+ | |||
+ | [[Category:Development]] |
Latest revision as of 21:35, 19 November 2022
Non-standard name detection
See R&S_Example_page/Initials_normalization.
Author names beginning with two upper case letters with no full-stops/periods
(91 records as of September 2019)
select a.author_id, a.author_canonical author_name, a.author_legalname, p.author_id real_author_id, ra.author_canonical real_name from authors a LEFT OUTER JOIN pseudonyms p ON (a.author_id = p.pseudonym) LEFT OUTER JOIN authors ra ON (ra.author_id = p.author_id) where CAST(a.author_canonical AS BINARY) regexp '^[A-Z][A-Z]' order by real_author_id, a.author_canonical;
Author names beginning with an initial and a full-stop/period, but then no space or hyphen
(14 records as of September 2019)
select a.author_id, a.author_canonical author_name, a.author_legalname, p.author_id real_author_id, ra.author_canonical real_name from authors a LEFT OUTER JOIN pseudonyms p ON (a.author_id = p.pseudonym) LEFT OUTER JOIN authors ra ON (ra.author_id = p.author_id) where CAST(a.author_canonical AS BINARY) regexp '^[A-Z]\\.[^ \-]' order by real_author_id, a.author_canonical;
Author names beginning with an initial and then a non-letter or full-stop/period
(269 records as of September 2019 - this pulls in a lot of false-positives e.g. HTML entities - which may or may not be relevant in this context - some non-ASCII letters, etc)
select a.author_id, a.author_canonical author_name, a.author_legalname, p.author_id real_author_id, ra.author_canonical real_name from authors a LEFT OUTER JOIN pseudonyms p ON (a.author_id = p.pseudonym) LEFT OUTER JOIN authors ra ON (ra.author_id = p.author_id) where CAST(a.author_canonical AS BINARY) regexp '^[A-Z][^\\p{L}\\.\']' order by real_author_id, a.author_canonical;
Author names not beginning with an upper case ASCII letter or HTML entity
(708 records as of September 2019 - this pulls in a lot of false positives with non-ASCII upper case letters)
select a.author_id, a.author_canonical author_name, a.author_legalname, p.author_id real_author_id, ra.author_canonical real_name from authors a LEFT OUTER JOIN pseudonyms p ON (a.author_id = p.pseudonym) LEFT OUTER JOIN authors ra ON (ra.author_id = p.author_id) where CAST(a.author_canonical AS BINARY) regexp '^[^A-Z&]' order by real_author_id, a.author_canonical;
Author names not beginning with a letter or HTML entity
(118 records as of September 2019 - this has a number of accented Zs that theoretically shouldn't be reported on with this regex)
select a.author_id, a.author_canonical author_name, a.author_legalname, p.author_id real_author_id, ra.author_canonical real_name from authors a LEFT OUTER JOIN pseudonyms p ON (a.author_id = p.pseudonym) LEFT OUTER JOIN authors ra ON (ra.author_id = p.author_id) where CAST(a.author_canonical AS BINARY) regexp '^[^\\p{L}&]' order by real_author_id, a.author_canonical;
Titles ranked by the number of publications they have
This seems like a reasonable proxy for answering the question "WHat are the biggest SF&F titles of all time?"
select tc.title_id, title_title, c FROM (select title_id, count(1) c from pub_content group by title_id) tc LEFT OUTER JOIN titles t ON (t.title_id = tc.title_id) WHERE t.title_ttype = 'NOVEL' ORDER by c DESC limit 100;
The above example filters on NOVELs only, but you can easily tweak that.
This query does have a few issues though:
- It doesn't take into account title parents/children, so (a) translated publications are omitted (I think) and (b) you get things like "1984" and "Nineteen Eighty-Four" appearing separately (and probably why HP and the Sorceror's/Philosopher's Stone doesn't appear, but some of it's sequels do).
- It takes a fair amount of processing power - the query takes 44 seconds to run on my box. (This is why I haven't even bothered to address the aforementioned parent issue, as that'd make it even slower.)
- Titles which are in the public domain will likely have bloated pub counts. It would be nice to have lists that include and exclude these, but that would involve doing adding author death year vs now() into the query, which would make it even slower.
Translations
All/most fiction translated into English
Warning: thousands of results.
select t.title_id eng_id, t.title_title eng_title, CAST(t.title_copyright AS CHAR) eng_copyright, t.title_ttype, t.title_storylen, tp.title_id orig_id, tp.title_title orig_title, CAST(tp.title_copyright AS CHAR) orig_copyright, l.lang_name from titles t LEFT OUTER JOIN titles tp ON t.title_parent = tp.title_id LEFT OUTER JOIN languages l ON l.lang_id = tp.title_language WHERE t.title_language != tp.title_language AND t.title_ttype in ('COLLECTION', 'ANTHOLOGY', 'SHORTFICTION', 'NOVEL', 'OMNIBUS') AND t.title_language = 17 ORDER BY t.title_copyright DESC;""")
There'll be a Python wrapper script that converts this into a long HTML page/table in my GitHub repo sooner or later.
Titles/pubs translated from one language to another
In this case, from Chinese (17) to English (17), additionally filtering on pub date 2020 or later
select t.title_id, t.title_title, p.pub_id, p.pub_title, p.pub_year, p.publisher_id, publishers.publisher_name, p.pub_ptype, p.pub_ctype, t.title_ttype from pubs p LEFT OUTER JOIN pub_content pc ON pc.pub_id = p.pub_id LEFT OUTER JOIN titles t ON pc.title_id = t.title_id LEFT OUTER JOIN titles parent_t ON parent_t.title_id = t.title_parent LEFT OUTER JOIN publishers ON publishers.publisher_id = p.publisher_id WHERE t.title_parent IS NOT NULL -- filters out most translations and parent_t.title_language = 13 -- 13 = Chinese AND t.title_language = 17 -- 17 = English AND pub_year >= '2020-01-01' ORDER BY t.title_id, p.pub_id, p.publisher_id, p.pub_year, p.pub_ptype;
The results are a bit messy, as you have novels, and short fiction all munged together, including anthologies or magazines that may only include a minority of translated material.
Awards
Show how often the Hugo Best Novel award went to a SF and/or fantasy novel
select DISTINCT ta.title_id, a.award_title, a.award_id, YEAR(a.award_year), tm.tag_id, tag_name from awards a LEFT OUTER JOIN title_awards ta ON ta.award_id = a.award_id LEFT OUTER JOIN tag_mapping tm ON tm.title_id = ta.title_id LEFT OUTER JOIN tags ON tags.tag_id = tm.tag_id WHERE a.award_cat_id = 261 and a.award_level = 1 AND (tag_name is NULL or tag_name IN ('science fiction', 'fantasy')) ORDER BY a.award_year;
There will be multiple rows for titles that are tagged both SF and fantasy e.g. Lord of Light.
NB: this will show NULL if the title is a variant (e.g. Three Body Problem), as the tags are applied to the parent title. This could probably be fixed, although for the cases I'm interested in, it's probably not a big problem
Verifications
Top 100 primary verified pubs
Barebones:
select * from ( select pub_id, count(1) c from primary_verifications pv group by pub_id ) foo ORDER BY c DESC LIMIT 100;
Top 100 primary verified titles this century
NB: doesn't take into account parent titles. Takes ~17 seconds to run on my fairly old PC
SELECT * FROM ( SELECT pc.title_id, t.title_title, t.title_ttype, count(1) c FROM primary_verifications pv NATURAL JOIN pubs p NATURAL JOIN pub_content pc LEFT OUTER JOIN titles t ON t.title_id = pc.title_id WHERE p.pub_ctype = t.title_ttype AND title_copyright >= '2000-00-00' GROUP BY title_id) foo ORDER by c DESC LIMIT 100;
Fancy version, with ranking, authors, proper handling of parent titles
WITH root_query AS ( SELECT CASE WHEN t.title_parent !=0 THEN t.title_parent ELSE t.title_id END title_root_id, -- t.title_title, t.title_ttype, COUNT(1) num_verifications, COUNT(DISTINCT p.pub_id) num_verified_pubs, COUNT(DISTINCT pv.user_id) num_verifiers FROM primary_verifications pv NATURAL JOIN pubs p NATURAL JOIN pub_content pc LEFT OUTER JOIN titles t ON t.title_id = pc.title_id WHERE p.pub_ctype = t.title_ttype GROUP BY title_root_id ) SELECT t.title_id, t.title_title, YEAR(t.title_copyright) year, t.title_ttype, GROUP_CONCAT(author_canonical SEPARATOR ' and ') authors, rq.num_verifications, RANK() OVER (ORDER BY num_verifications DESC) num_verifications_rank, rq.num_verified_pubs, RANK() OVER (ORDER BY num_verified_pubs DESC) num_verified_pubs_rank, rq.num_verifiers, RANK() OVER (ORDER BY num_verifiers DESC) num_verifiers_rank FROM root_query rq INNER JOIN titles t ON rq.title_root_id = t.title_id LEFT OUTER JOIN canonical_author ca ON ca.title_id = t.title_id LEFT OUTER JOIN authors a ON a.author_id = ca.author_id GROUP BY t.title_id ORDER BY num_verifiers DESC, num_verifications DESC, num_verified_pubs DESC LIMIT 100;