Talk:Development

From ISFDB
Jump to navigation Jump to search

See Talk:Development/Archive for archived discussions

Blank author pages created by reviews

There are numerous pages like this "Dr. Clifford Wilson"[1] that appear to be empty but are created by "reviews"[2], is there some way of having the review displayed on the page so it doesn't appear blank. I know why they exist but I'm sure most users don't. Thanks.Kraang 01:58, 1 June 2010 (UTC)

It's certainly possible, but it is not trivial since the software that handles the Summary Page logic is very convoluted and in places quite inefficient. After fixing some bugs in its guts earlier today, I am thinking that we may need to rewrite it almost from scratch, especially if we want to be able to continue adding more features without losing our minds. Easier said than done, of course, so for now I have created Feature Request 3009750. Ahasuerus 04:07, 1 June 2010 (UTC)
I too hate blank author pages, which is why we have "Reviewed Author" as an advanced search option now. (I don't know how people use the results: create the linked title or convert the review to an essay?) But there are authors that seem to resist deletion for other reasons I don't understand. Yet. BLongley 23:07, 1 June 2010 (UTC)

Putting a local copy at isfdb.local, not localhost

In order to work around a "feature" of Firefox that refuses cookies coming from localhost, I want to put my local isfdb copy on an address like isfdb.local -- but I can't figure out how to configure Apache (& whatever else I need to configure) to do that. Pointers? JesseW 04:12, 22 November 2010 (UTC)

And, answered. You need to put an entry in /etc/hosts -- i.e. "127.0.0.1 isfdb.local", and use that in the common/localdefs.py file. It may, or may not be necessary to also have a "ServerName isfdb.local" line in your Apache configuration. JesseW 04:26, 22 November 2010 (UTC)
You could also use IP multihoming by choosing a different IP since IPv4 allocates and entire class A for loopback. You could use say 127.127.127.127 or anything else in the 127 range. The most straight forward would be to use "127.0.0.2 isfdb.local" Uzume 05:30, 17 February 2011 (UTC)

Unicode escaping in HTML forms

I have been working on a couple of related bugs with HTML escaping lately. The first bug is fairly straightforward: in many cases, we are not escaping field values in HTML forms, so any value with embedded quotes or angle brackets will cause problems. The problem was first reported as a bug affecting publishers, but it's more pervasive than that.

The good news is that there is a simple solution to this problem: use XMLescape, which is already in effect for Title fields. The bad news is that this solution has an unfortunate side effect. Our MySQL database stores Unicode characters using XML encoding, e.g. "& #1055;" is used to represent the Cyrillic "П", and XMLescape escapes the leading ampersand character whenever Unicode is used. The result is that any value which includes Unicode characters will be overcoded -- see this Title Edit page.

I have a tentative solution that addresses both issues, but it is not very elegant and I would like to run it by other developers to make sure that I am not missing something obvious. Here is what I have done on my development server so far:

  1. Created a new function, HTMLescape. All it does is call XMLescape with a new optional parameter, "htmlencoding". This step is needed since we don't want to change the default behavior of XMLescape, which is also used in a number of other places.
  2. Modified XMLescape so that if "htmlencoding" is passed in, the ampersand character is not escaped when it's part of the &#NNN; or &#NNNN; pattern (what about &#xHH;?)
  3. Modified all HTML forms to use HTMLescape instead of raw values or XMLescape. It was relatively easy to do since printing is typically centralized in one place.

This approach seems to work, but it would be great if we could find an easier way to solve the problem. Ahasuerus 02:13, 4 January 2011 (UTC)

My only advice is do what seems to work for now. There are a variety of escape/unescape routines that cover strange corner cases, and at this point even I don't recall why there are so many, or which should be used in which case. Some of this comes from the HTML/XML world we live in, but the current approach is not elegant. Alvonruff 00:44, 5 January 2011 (UTC)
The solution is to serve the web pages and store data in the DB in a Unicode encoding like UTF-8. You can then still HTML escape ampersand, less than, greater than and double quote if appropriate. Storing data in XML/HTML Unicode entities for values outside latin1 (because the DB and the web pages are served as such) is not a good solution as the DB data will only be useful in an application that knows about such encoding (mostly web browsers) and it expands Unicode characters by a large factor. I have considered entering some Japanese translations of some books by a well-known SF author but have held off pending better Unicode support and the ability to credit translators (and audio books with voice credits, other derivative work support, etc.). Uzume 21:59, 3 February 2011 (UTC)
I think everyone agrees that a transition to UTF-8 is desirable in principle, but, based on others' reported experiences, it can be easier said than done. Personally, I don't know enough about MySQL to chance it yet. Perhaps a more experienced contributor will step up to the plate and make the necessary changes at some point. Ahasuerus 06:09, 4 February 2011 (UTC)
I might be able to work on that—but it won't happen over night. We shall seen. Right now I am still waiting to get Python CGI code. Uzume 18:36, 4 February 2011 (UTC)

"Public" URLs

Is there a definition of which URLs are considered publicly linkable and which are not? If not I would like to solicit feedback to create such a list. I ask because as a code developer, one can change what the URLs used to access the site are. In some cases this is not likely to be an issue like changing links within the edit script hierarchy so long as all the scripts are updated to use them because it is expected such URLs will only be accessed by people using the web site and it is not expected people will directly link into such URL space from other sites/applications. On the other hand, if I just went and changed the URLs for accessing pubs and titles to something else, I can imagine screams that the such code introduced a "bug" because external links would be broken. Help:Linking templates and Category:Linking templates provide some hints as to what sorts of URL's are considered external/public. Here is my initial list of what I believe should not be changed without serious consideration for backwards compatibility (i.e., it is a bug to change these without discussion and some implemented transition methodology):

URL Note
/cgi-bin/index.cgi Home Page
/cgi-bin/pl.cgi?<pub_id> Pub
/cgi-bin/pl.cgi?<pub_tag> should be considered deprecated in light of Bug 3153982 "Change pub links from tags to IDs"
/cgi-bin/title.cgi?<title_id> Title
/cgi-bin/ea.cgi?<author_id> Author
/cgi-bin/ea.cgi?<author_canonical> this is questionable but I believe many things still use this
/cgi-bin/publisher.cgi?<publisher_id> Publisher
/cgi-bin/pe.cgi?<series_id> Series
/cgi-bin/seriesgrid.cgi?<series_id> Series Grid for Magazines
/cgi-bin/pubseries.cgi?<pub_series_id> Pub Series
/cgi-bin/rest/getpub.cgi?<pub_isbn> Pub API
/cgi-bin/rest/submission.cgi Submission API

And here are some that I am unsure of:

URL Note
/cgi-bin/pl.cgi?<pub_id>+<concise> apparently supported but not sure if anyone actually links to this
/cgi-bin/pl.cgi?<pub_tag>+<concise> should be considered deprecated in light of Bug 3153982 "Change pub links from tags to IDs"
/cgi-bin/eaw.cgi?<author_id> Awards
/cgi-bin/eaw.cgi?<author_canonical> apparently supported but not sure if anyone actually links to this
/cgi-bin/ae.cgi?<author_id> Alpha
/cgi-bin/ae.cgi?<author_canonical> apparently supported but not sure if anyone actually links to this
/cgi-bin/ch.cgi?<author_id> Chrono
/cgi-bin/ch.cgi?<author_canonical> apparently supported but not sure if anyone actually links to this
/cgi-bin/seriesgrid.cgi?<series_id>+<displayOrder> apparently supported but not sure if anyone actually links to this
/cgi-bin/ay.cgi?<award_ttype><award_year> Award List

I know there were some URL issues when moving from tamu.edu but I was not around to know how that was. Are there others that should be included or that one should be concerned about? Thanks. Uzume 18:09, 4 March 2011 (UTC)

There's only 4 types of deep-linking officially supported according to this page - which is well out of date, but should be checked and updated. BLongley 18:26, 4 March 2011 (UTC)
Also, the Web API documentation does confirm getpub.cgi as supported - I suspect, but obviously can't confirm, that Fixer does use this to prevent some duplicate submissions. We might want to add some more features to the Web API - I know I experimented with the two options for "Data Thief" and found them a little lacking. BLongley 18:37, 4 March 2011 (UTC)
That's right, when Fixer builds a submission, he queries getpub.cgi to see if that ISBN is already in the live database. This is in addition to checking his local copy of the database, which can be up to 7-8 days behind the live version. Ahasuerus 19:10, 4 March 2011 (UTC)
OK, this is great feedback. I moved the getpub API up to the known unsafe list but that said there are not that many bots and they are all run by developers so though they have a certain amount of "unsafeness" they are not as big of an issue as true "deep links" like ones from Wikipedia articles, etc. Comparing the FAQ link BLongley quotes above to the list I created above, I added publisher.cgi, pubseries.cgi, and seriesgrid.cgi based on Template:Pubr, Template:PubSeries, and Template:IssueGrid. Should these be added to the FAQ or are these a gray area for now? And of course the two API URLs have already been discussed. I am planning on killing pl.cgi?<pub_tag> but that is a work in progress and will have to be around a while for transition reasons (though I believe I have some outstanding code that will help reduce the amount of new links people make based on pub_tag). Uzume 22:12, 4 March 2011 (UTC)

Status update

I am slowly working through the list of submitted fixes. I haven't found anything troublesome so far; half a dozen scripts have been deployed on the live server. Sorry about the slow pace, allergies make everything foggy and I want to test everything thoroughly before deployment. Hopefully, I will test/deploy the rest of the scripts early in the week. Ahasuerus 05:33, 7 March 2011 (UTC)

I am sorry about your allergies. Uzume 14:28, 9 March 2011 (UTC)

Bug 3153982 status

The change that started the Great Migration from tags to pub IDs has been tested and installed. Next, we'll probably want to change the tags that are displayed on the "Bibliographic Comments" line to pub IDs, e.g. this pub will have "(STRWRSFTFT2011)" replaced with "(pub ID 338434)". Ahasuerus 04:36, 9 March 2011 (UTC)

Yes, I have already laid some ground work for that hurdle but renaming all the wiki articles in Publication:<pub_tag> to Publication:<pub_id> is going to take me a while to do and I want to do that before I change the code or existing wiki comments for pubs will be broken for a while (which I want to avoid). In order to make the least impact for the wiki comments, I shall likely do a small patch for just that which can be quickly tested and installed once I get all the wiki articles moved. That will hopefully make it so few new wiki comments will be made during the window of time where I have renamed things and the code gets installed (potentially users could make new comments in <pub_tag> space). After that gets done, I might throw all the Publication:<pub_tag> redirects into a category so that a wiki admin can delete them. Uzume 14:24, 9 March 2011 (UTC)
I also want to implement a transition mechanism into pl.cgi that does something special with incoming links that use <pub_tag>s. I haven't totally decided what and tested it out yet but something along the lines of a browser redirect with a warning message flagging and telling people to change the source of the in-bound link, etc. Anyway, thanks for the patch. That clears the way for the next step (ripping out some <pub_tag> functionality: remove from search results, actually disable the clone functionality that has now been undocumented, etc.) and will go a long ways towards stemming the tide of new links using <pub_tag>. Uzume 14:16, 9 March 2011 (UTC)
Since I have undocumented using <pub_tag>s for cloning and will likely rip out that ability soon, can you assign me Bug 3050469 too since that will be rendered an moot point when I am done with that (and I shall try to make sure it handles any invalid values gracefully). Thanks. Uzume 14:56, 9 March 2011 (UTC)
I hope that not all functionality is going to go immediately - for instance, external sites that have linked via pub-tag will want a quick way to find the id to replace with. And remember that Cover Image Uploads are using Pub Tags to give a hopefully unique (but not guaranteed, see Bug 2795822) name. If anybody does try and reunite the Cover Image backups with the Database backups a Tag-to-ID conversion will be needed still. BLongley 17:05, 9 March 2011 (UTC)
Yes, I am well aware of the many implications and that is why it won't happen over night by any stretch of the imagination but I can implement things that discourage and warn about such usage in preparation for a time when we can eventually do away with such. As for the images, I plan to implement uploading by <pub_id> as the default name at some point (there are some details to be worked out before any code is actually changed) and images can technically even now be uploaded with any name and are linked to from the pub records via a complete URL so we have that mapping until such time as the URL is replaced and then it is a moot point anyway. I see no reason to enforce any image file naming (currently there isn't any either but there is a default name provided by the link in the pub record). If you are loading images with <pub_tag> naming and not submitting these as coverart updates to the pub records, then I do not have a solution for retaining such a mapping (but do we want to support that anyway?). Uzume 17:52, 9 March 2011 (UTC)

While we're waiting for Ahasuerus to rise again...

I utilised the latest backup to provide these previews of four of the next set of Data Cleanup scripts. (Part Three. The fifth only provided one result so I fixed that myself.) ISFDB:Series of Variants, ISFDB:Variants of Nonexistent Titles, ISFDB:Stray Interviews and Reviews, and the BIG one: ISFDB:Stray Authors2. The last looks as if I've coded something wrong and the list should be shorter, but at the moment I can't think why, so any feedback on that in particular would be good. BLongley 23:34, 23 April 2011 (UTC)

An interesting use of our data

This might be of interest to people. BLongley 19:43, 14 July 2011 (UTC)

Link no longer works. Gives a 404 error.--Astromath 14:58, 25 December 2012 (UTC)

Automatic notification of verified pubs.

I'm not sure how doable this is, but my suggestion is when a verified pub is edited, it will automatically notify all verifiers of the pubs on their discussion pages. The only problem I foresee is that some verifiers wish to be notified on special pages they've set up for that purpose.--Astromath 15:15, 25 December 2012 (UTC)

Yes, it should be doable, although I'll have to check the Wiki software to see how easy it will be to implement. I have created FR 3598471 for this feature request.
There is also a larger issue here -- we want to maintain a complete history of all edits in the database so that we could tell that, e.g., the value of field A was changed from X to Y by editor N on 2012-12-25 (FR 2800816). This information is currently captured (but not made readily available to our users) for author edits, but not for any other types of edits. Ahasuerus 21:37, 25 December 2012 (UTC)

That being said, if this does come to pass, then there will definitely be a need of a preview page prior to saving the edits. This is in case an edit is cancelled by the editor for some reason which would clutter up the verifier(s)' discussion pages.--Astromath 15:15, 25 December 2012 (UTC)

I would expect that the verifiers' Talk page will not be updated until after the submission has been approved. Ahasuerus 21:37, 25 December 2012 (UTC)

LCCN & OCLC fields

With the prevailance of LCCN and OCLC numbers, wouldn't having separate fields for them make sense? Then you wouldn't need to go through the hassle of html coding of the links in the notes field.--Astromath 01:38, 1 January 2013 (UTC)

Indeed, it would be desirable. We have FR 3127708, which reads:
  • Add support for external identifiers at the Publication and Title levels. Publication identifiers can include LCCNs, Worldcat IDs, British Library Ids, Goodreads IDs, etc. Title identifiers can include "Work" identifiers used by LibraryThing, Goodreads and other social cataloging Web sites.
Ahasuerus 02:16, 1 January 2013 (UTC)

Nightly optimizations

I thought I would move the technical discussions from the community portal. We should be able to optimize Unicode SQL searches used in reports 65–70 and 73–78 and from nightly/nightly_update.py 1.209 by updating common/library.py 1.139 line 1325 with:

1325def badUnicodePatternMatch(field_name):
1326        # Reduce the number of Unicode patterns to search for substantially by finding all "combining diacritic" combinations
1327        #   not just the ones we know how to replace
1328        # All of the keys are either a single numeric character reference (NCR) or a single character followed by single NCR
1329        # We only care about the single NCR so we remove the lead character when it is not an ampersand
1330        # And then we remove duplicate NCRs by pushing the list into a (frozen) set
1331        ncrs = frozenset(key if (key[0] == '&') else key[1:] for key in unicode_translation())
1332        patterns = " or ".join("%s like binary '%%%s%%'" % (field_name, ncr) for ncr in ncrs)
1333        # Optimize by finding all NCRs (and throwing away everything else) first
1334        return "%s like binary '%%&#%%;%%' and ( %s )" % (field_name, patterns)
1335
1336def suspectUnicodePatternMatch(field_name):
1337        ncrs = frozenset(['&#700;', '&#699;'])
1338        patterns = " or ".join("%s like binary '%%%s%%'" % (field_name, ncr) for ncr in ncrs)
1339        # Optimize by finding all NCRs (and throwing away everything else) first
1340        return "%s like binary '%%&#%%;%%' and ( %s )" % (field_name, patterns)

I wonder if these SQL generating functions are used in the application. If they are not, perhaps they should be moved into the nightly code. Now we just need to look at other things (like nightly/nightly_os_files.py 1.4). Uzume 14:43, 19 April 2017 (UTC)