User:ErsatzCulture/SQLQueries

From ISFDB
< User:ErsatzCulture
Revision as of 09:14, 4 October 2019 by ErsatzCulture (talk | contribs) (Some SQL queries for finding non-standard author names)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Author related

Non-standard name detection

See R&S_Example_page/Initials_normalization.

Author names beginning with two upper case letters with no full-stops/periods

(91 records as of September 2019)

   select a.author_id, a.author_canonical author_name, a.author_legalname, p.author_id real_author_id, ra.author_canonical real_name from authors a LEFT OUTER JOIN pseudonyms p ON (a.author_id = p.pseudonym) LEFT OUTER JOIN authors ra ON (ra.author_id = p.author_id) where CAST(a.author_canonical AS BINARY) regexp '^[A-Z][A-Z]' order by real_author_id, a.author_canonical;

Author names beginning with an initial and a full-stop/period, but then no space or hyphen

(14 records as of September 2019)

   select a.author_id, a.author_canonical author_name, a.author_legalname, p.author_id real_author_id, ra.author_canonical real_name from authors a LEFT OUTER JOIN pseudonyms p ON (a.author_id = p.pseudonym) LEFT OUTER JOIN authors ra ON (ra.author_id = p.author_id) where CAST(a.author_canonical AS BINARY) regexp '^[A-Z]\\.[^ \-]' order by real_author_id, a.author_canonical;

Author names beginning with an initial and then a non-letter or full-stop/period

(269 records as of September 2019 - this pulls in a lot of false-positives e.g. HTML entities - which may or may not be relevant in this context - some non-ASCII letters, etc)

   select a.author_id, a.author_canonical author_name, a.author_legalname, p.author_id real_author_id, ra.author_canonical real_name from authors a LEFT OUTER JOIN pseudonyms p ON (a.author_id = p.pseudonym) LEFT OUTER JOIN authors ra ON (ra.author_id = p.author_id) where CAST(a.author_canonical AS BINARY) regexp '^[A-Z][^\\p{L}\\.\']' order by real_author_id, a.author_canonical;

Author names not beginning with an upper case ASCII letter or HTML entity

(708 records as of September 2019 - this pulls in a lot of false positives with non-ASCII upper case letters)

   select a.author_id, a.author_canonical author_name, a.author_legalname, p.author_id real_author_id, ra.author_canonical real_name from authors a LEFT OUTER JOIN pseudonyms p ON (a.author_id = p.pseudonym) LEFT OUTER JOIN authors ra ON (ra.author_id = p.author_id) where CAST(a.author_canonical AS BINARY) regexp '^[^A-Z&]' order by real_author_id, a.author_canonical;

Author names not beginning with a letter or HTML entity

(118 records as of September 2019 - this has a number of accented Zs that theoretically shouldn't be reported on with this regex)

   select a.author_id, a.author_canonical author_name, a.author_legalname, p.author_id real_author_id, ra.author_canonical real_name from authors a LEFT OUTER JOIN pseudonyms p ON (a.author_id = p.pseudonym) LEFT OUTER JOIN authors ra ON (ra.author_id = p.author_id) where CAST(a.author_canonical AS BINARY) regexp '^[^\\p{L}&]' order by real_author_id, a.author_canonical;