Help:How to work with Records Built by Robots

From ISFDB
Revision as of 16:13, 16 July 2018 by Ahasuerus (talk | contribs) (→‎Existence: Note about possible deletion of valid ISBNs by online stores)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

General Issues with Records Built by Robots

Bibliographic records built by robots present unique challenges.

Existence

The first question to ask yourself when reviewing a record built by a robot is whether the book has actually been published or is about to be published. ISFDB robots use a number of sources to find SF-related books, notably library catalogs and online bookstores like Amazon. Some of these sources list books which were announced at some point in the past but were later canceled, renamed or otherwise changed.

If a record built by a robot provides a link to the Web page that served as the source of the data, check the linked page to see if the ISBN looks like it may have been canceled. ISBN cancellation is very likely if the linked record is very sparse, e.g. there is no page count, no publisher, no cover scan, no price, etc. Note that cancelled ISBNs are not deleted from Amazon's databases. Instead Amazon changes their publication date to a date up to 10-15+ years in the future.

Conversely, an online store may delete a published ISBN if it has been superseded by a later ISBN.

Eligibility

The second question which frequently arises when dealing with records created by robots is whether the book is within the scope of the project and should be listed by ISFDB. See ISFDB:Policy#Contents/Project Scope Policy for the current project scope.

Note that many comic books, manga, RPG modules, etc are listed by Amazon and other sources in a way that makes them look like SF books. More often than not, our robots are not able to tell the difference. In addition, a book labeled "horror" may be a "psychological horror" novel with no SF elements and therefore not eligible for inclusion in ISFDB (unless the author is over that hard to define "certain threshold" mentioned by ISFDB:Policy.) Similarly, Amazon may file a non-genre book under "fantasy" when it's actually an "erotic fantasy", under "ghosts" because the title is "Ghosts of the Past", and so on.

Avoiding Data Duplication

Robots try to avoid creating submissions for the ISBNs that are already present in the ISFDB database. They do it by checking the presence of each ISBN in their working copy of the ISFDB database. However, there are times when the check fails and a duplicate record is created. This typically happens when the existing ISFDB record doesn't have an ISBN on file, a common scenario with Amazon-derived ebooks because Amazon doesn't display ISBNs for ebooks.

Data Quality Issues

Records built by robots can have various data quality issues. The data entry clerks used by online bookstores and even by some libraries tend to make a lot of data entry errors. This includes but is not limited to reversing first and last names, misspelling names and titles, entering irrelevant information as part of the title, and assigning incorrect or misleading subject headings to titles.

Publisher Issues

  • The publisher name used by the robot may be incorrect. This frequently happens when a US store is selling UK books or vice versa. It can also happen when an online bookstore uses the name of the distribution company instead of the name of the publisher. Sometimes a publisher goes out of business and some of its announced books are later released by another publisher.
  • The stated publisher is not disambiguated with a country-specific suffix, e.g. the record says "Tor" instead of "Tor (UK)".
  • The publisher name is missing. If the book looks like it was probably published by a traditional publisher, this is an indication that the ISBN has been canceled or delayed. On the other hand, if the book was self-published, it's possible that the author chose not to use a publisher name, so further research is advised. Amazon's Look Inside can be particularly helpful.
  • The publisher name may be -- as far as ISFDB is concerned -- the name of a Publication Series. For example, a robot may create a record for a book published by "Harlequin Nocturne", which we view as a publication series published by Harlequin.
  • The publisher name doesn't exist in ISFDB. Sometimes it indicates that the robot has found a brand new publisher. However, in most cases it means that the record contains an altered or corrupted version of another publisher name that ISFDB already has on file, e.g. "Berkley" instead of "Berkley Books". When this happens, make sure to research the publisher and correct the data.
  • Amazon's branch "CreateSpace" facilitates self-publishing. Some people who publish books via CreateSpace form a publishing company of their own and include its name in the publication while other people do not. In most cases Amazon lists CreateSpace-published books as by "CreateSpace" regardless of whether another publisher name was specified in the book. For this reason it is important to check what's stated in the book using Amazon's Look Inside. If another publisher name is specified, change the value in the Publisher field to that name. If no publisher name is specified, leave the field blank.

Price Issues

The price recorded by the robot may be incorrect. When a book is originally published in one country and offered for sale in another country (e.g. US/UK), the price may be listed in a different currency. When the price field of a robotic record contains an unlikely looking number like "£4.83", it usually indicates a conversion of a foreign price. Note, however, the opposite is not always true: it's possible for a record to have a normal-looking price even it was published in another country. This is due to the fact that booksellers often adjust prices of imported books to look more normal. Also, some publishers have a US office and a UK office, which can make distinguishing cases of "simultaneous publication" from imports challenging.

Page Count

The page counts listed by Amazon are based on publisher estimates produced months ahead of publication. They are almost invariably different compared to the actual page count of the published book. Publisher Web pages, library catalogs and OCLC frequently have more accurate page counts.

Format

Robots often derive the format information from book dimensions and other data provided by third parties. Sometimes their calculations are incorrect. Also, robots do not always have access to the same data that humans do. If a robot-generated record doesn't have a value in the format field, it's worth checking the source of the robot's information to see if a human can figure it out.

Image URL

A robot can't tell whether an image is good or bad. For example, Amazon may display:

  • a placeholder image
  • an image with "Cover Not Final" displayed in a corner
  • a blank image which says something like "Image to be unveiled prior to publication"

These types of placeholder URLs need to be manually deleted from the publication record. In addition, they may indicate that the ISBN was cancelled prior to publication or at least delayed. A records with a questionable image usually requires additional research to determine the ISBN's existence.

When working with images hosted by Amazon, familiarize yourself with the Amazon section of Template:PublicationFields:ImageURL, which explains how Amazon image URLs are structured.

Publication Series

Robots generally do not have access to Publication Series information, so it needs to be added manually if available.

Series

Most records built by robots do not include series information. Note, however, that some sources embed the series name in the title, usually in parentheses, e.g. "Empire of the Dragon (Event Group Thriller)". If the series name is embedded in the title field, it needs to be manually moved to the series field.

Note that there is no way to associate series information with Contents titles when submitting an anthology, collection or chapbook. You will need to wait for the submission to be approved, then create additional submissions adding series information to the newly created titles.

Cover Artist

Robotically generated records rarely include the name of the cover artist. You may want to check the original source of the data as well as Amazon's Look Inside and, optionally, other online sources to see if you can find it.

Contents Titles

Robotically generated records rarely include Contents titles. For collections, chapbooks, omnibuses, and anthologies, you will want to find Contents titles using the original source and/or other online sources, then add them to the Contents section of the submission.

If some or all of the Contents titles already exist in the ISFDB, you can skip this step, wait for the approval, then import the titles. Note that this approach is only viable if the Contents titles in question use the same form of the title and author name(s) as what's already on file. If the title and/or the name(s) are different, you will need to include the new titles in the original submission, wait for approval, then create variant titles as appropriate.