User:ErsatzCulture/SQLQueries
< User:ErsatzCulture
Jump to navigation
Jump to search
Revision as of 09:14, 4 October 2019 by ErsatzCulture (talk | contribs) (Some SQL queries for finding non-standard author names)
Non-standard name detection
See R&S_Example_page/Initials_normalization.
Author names beginning with two upper case letters with no full-stops/periods
(91 records as of September 2019)
select a.author_id, a.author_canonical author_name, a.author_legalname, p.author_id real_author_id, ra.author_canonical real_name from authors a LEFT OUTER JOIN pseudonyms p ON (a.author_id = p.pseudonym) LEFT OUTER JOIN authors ra ON (ra.author_id = p.author_id) where CAST(a.author_canonical AS BINARY) regexp '^[A-Z][A-Z]' order by real_author_id, a.author_canonical;
Author names beginning with an initial and a full-stop/period, but then no space or hyphen
(14 records as of September 2019)
select a.author_id, a.author_canonical author_name, a.author_legalname, p.author_id real_author_id, ra.author_canonical real_name from authors a LEFT OUTER JOIN pseudonyms p ON (a.author_id = p.pseudonym) LEFT OUTER JOIN authors ra ON (ra.author_id = p.author_id) where CAST(a.author_canonical AS BINARY) regexp '^[A-Z]\\.[^ \-]' order by real_author_id, a.author_canonical;
Author names beginning with an initial and then a non-letter or full-stop/period
(269 records as of September 2019 - this pulls in a lot of false-positives e.g. HTML entities - which may or may not be relevant in this context - some non-ASCII letters, etc)
select a.author_id, a.author_canonical author_name, a.author_legalname, p.author_id real_author_id, ra.author_canonical real_name from authors a LEFT OUTER JOIN pseudonyms p ON (a.author_id = p.pseudonym) LEFT OUTER JOIN authors ra ON (ra.author_id = p.author_id) where CAST(a.author_canonical AS BINARY) regexp '^[A-Z][^\\p{L}\\.\']' order by real_author_id, a.author_canonical;
Author names not beginning with an upper case ASCII letter or HTML entity
(708 records as of September 2019 - this pulls in a lot of false positives with non-ASCII upper case letters)
select a.author_id, a.author_canonical author_name, a.author_legalname, p.author_id real_author_id, ra.author_canonical real_name from authors a LEFT OUTER JOIN pseudonyms p ON (a.author_id = p.pseudonym) LEFT OUTER JOIN authors ra ON (ra.author_id = p.author_id) where CAST(a.author_canonical AS BINARY) regexp '^[^A-Z&]' order by real_author_id, a.author_canonical;
Author names not beginning with a letter or HTML entity
(118 records as of September 2019 - this has a number of accented Zs that theoretically shouldn't be reported on with this regex)
select a.author_id, a.author_canonical author_name, a.author_legalname, p.author_id real_author_id, ra.author_canonical real_name from authors a LEFT OUTER JOIN pseudonyms p ON (a.author_id = p.pseudonym) LEFT OUTER JOIN authors ra ON (ra.author_id = p.author_id) where CAST(a.author_canonical AS BINARY) regexp '^[^\\p{L}&]' order by real_author_id, a.author_canonical;