User:ErsatzCulture/Tagging

From ISFDB
< User:ErsatzCulture
Revision as of 07:09, 4 October 2019 by ErsatzCulture (talk | contribs) (Initial page)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

I think I've seen a few comments on this Wiki or in the bugs/feature requests about the low usage of tagging, and I agree. To try to help push that area forward, I'm in the middle of hacking together some tooling to help me know where I might be able to improve the tag data.

Report on books in my Goodreads collection that have no or few tags in ISFDB

(This will probably be of little use or interest to anyone who doesn't track their collection/reading on Goodreads.)

I have a script that goes through a CSV export from Goodreads, tries to match up the books against a local copy of the ISFDB database, and report on any books that don't have tags, have few tags, or don't have any of what I call "core" tags (which are the main/top level genres covered on ISFDB e.g. "science fiction", "fantasy", "horror" and a couple of others).

Some example output (the real output is colour coded, so this extract is less legible than it might otherwise be):

   ./check_isfdb_tags.py -f read -f science-fiction
   Diaspora by Greg Egan has 10 tags in ISFDB http://www.isfdb.org/cgi-bin/title.cgi?1399 
   ERROR:root:Could not find M.G. Wheaton/Emily Eternal in ISFDB - skipping
   Lock In by John Scalzi has 3 tags in ISFDB http://www.isfdb.org/cgi-bin/title.cgi?1658286 
   Children of Ruin by Adrian Tchaikovsky has 5 tags in ISFDB http://www.isfdb.org/cgi-bin/title.cgi?2528986 
   Europe at Midnight by Dave Hutchinson has 4 tags in ISFDB http://www.isfdb.org/cgi-bin/title.cgi?1916529 
   Ubik by Philip K. Dick has 2 tags in ISFDB http://www.isfdb.org/cgi-bin/title.cgi?948 
   A Memory Called Empire by Arkady Martine has 2 tags in ISFDB http://www.isfdb.org/cgi-bin/title.cgi?2500310 
   Semiosis by Sue Burke has 3 tags in ISFDB http://www.isfdb.org/cgi-bin/title.cgi?2300478 
   Apex by Ramez Naam has 2 tags in ISFDB http://www.isfdb.org/cgi-bin/title.cgi?1864582 
   Starplex by Robert J. Sawyer has 0 tags in ISFDB http://www.isfdb.org/cgi-bin/title.cgi?5748 
   ERROR:root:Could not find Neal Asher/Gridlinked: An Agent Cormac Novel 1 in ISFDB - skipping
   The City in the Middle of the Night by Charlie Jane Anders has 2 tags in ISFDB http://www.isfdb.org/cgi-bin/title.cgi? 2486327 
   The Left Hand of Darkness by Ursula K. Le Guin has 20 tags in ISFDB http://www.isfdb.org/cgi-bin/title.cgi?7662 
   ERROR:root:Could not find C.L. Moore/Doomsday Morning in ISFDB - skipping
   Shadow Captain by Alastair Reynolds has 1 tags in ISFDB http://www.isfdb.org/cgi-bin/title.cgi?2469371 
   The Handmaid's Tale by Margaret Atwood has 7 tags in ISFDB http://www.isfdb.org/cgi-bin/title.cgi?1816 
   The Handmaid's Tale lacks a core tag: sexism,dystopia,into-tv,into-movie,list NPR Top100 (2011),misogyny,near future http://www.isfdb.org/cgi-bin/title.cgi?1816 
   ERROR:root:Could not find Aliette de Bodard/The Tea Master and the Detective in ISFDB - skipping
   Embers of War by Gareth L. Powell has 2 tags in ISFDB http://www.isfdb.org/cgi-bin/title.cgi?2301133 
   The Fountains of Paradise by Arthur C. Clarke has 4 tags in ISFDB http://www.isfdb.org/cgi-bin/title.cgi?1903 
   ... etc ...

The current version is pretty dumb/lazy when it comes to matching up author name (e.g. "C.L. Moore vs C. L. Moore") or title discrepancies (e.g. "Gridlinked" vs "Gridlinked: An Agent Cormac Novel 1"), hence the "ERROR"s above.

The idea is that I run this script, and can easily click on the ISFDB URL for any that it thinks could do with improvement.

The code for this report is in my GitHub repos here and here, but requires a fair bit of technical knowledge to set up. If anyone wants me to run it against their own Goodreads data, just make the CSV export of your collection available and I'll happily run the script against it and give you the results back.

(There's no technical reason that something like this couldn't be brought into a state where it was within ISFDB itself, and you could upload the CSV to the site - however, this processing is relatively database intensive, and I don't know that having that sort of high-load job on a public facing site would be a good idea.)