Difference between revisions of "User:ErsatzCulture/DevelopmentNotes"

From ISFDB
Jump to navigation Jump to search
(→‎HTTPS migration: Relative URLs - background)
(Another pain-point with subprocess in Py2.5)
 
(2 intermediate revisions by the same user not shown)
Line 112: Line 112:
  
 
: [[User:Ahasuerus|Ahasuerus]] 19:55, 6 October 2019 (EDT)
 
: [[User:Ahasuerus|Ahasuerus]] 19:55, 6 October 2019 (EDT)
 +
 +
== Python 2.5 vs Python 2.7 gotchas ==
 +
 +
ISFDB currently runs on Python 2.5.something, but it's hard to get this running on a modern (<5 years old) Linux distro, which will ship with 2.7.something (and 3.x) by default.  python.org doesn't provide Linux binaries, and building from source has issues with some crypto-type libraries changing.
 +
 +
From non-exhaustive testing on my dev box, the code works fine with 2.7, but that's not enough to feel comfortable upgrading the production server.  So, if doing development with Python 2.7, beware of the following gotchas that will break if the code is run on Python 2.5.
 +
 +
'''This is not a definitive list!'''
 +
 +
=== Base Python language ===
 +
 +
* 'except Exception as err' doesn't work, you need to do 'except Exception, err'
 +
 +
=== stdlib ===
 +
 +
* collections.namedtuple is not available (was added in 2.6)
 +
* subprocess.check_output() is not available (and probably other convenience methods in that module also).  Also (in 2.5.1 at least) the CalledProcessError exception's __init__() doesn't expect a third 'output' argument, so you have to instantiate it as if it was a regular object, with the returncode and cmd args, then define a .content property, and finally raise the exception
 +
 +
=== Third party libraries ===
 +
 +
* BeautifulSoup (not part of ISFDB proper, but used by my experimental tests) version 4 aka bs4 doesn't seem to exist, so you have to use the older version that's imported as "BeautifulSoup".  This has a number of API differences - not only does it not support the newer .select/.select_one methods that use CSS selectors, but there are differences in how &nbsp; is handled, whether the class attribute is a string or list etc.
 +
 +
 +
[[Category:Development]]

Latest revision as of 15:45, 3 November 2019

Development notes

emacs setup

My emacs setup is a long messy hackup assembled over decades, so it's not going to get shared publicly. However, for the benefit of anyone else, here's an ISFDB-specific snippet I have to follow Development#Code_Format (I use 4-space indentation for all of my personal and work projects):

   (add-hook 'python-mode-hook
         (lambda()
           (if (buffer-file-name)
               (when (or (string-prefix-p "/var/www/cgi-bin/" (buffer-file-name))
                         (string-prefix-p "/proj/3rdparty/isfdb-code-svn/" (buffer-file-name)))
                 ;; http://www.isfdb.org/wiki/index.php/Development#Coding_style
                 (setq tab-width 8)
                 (setq python-indent 8)
                 (setq default-tab-width 8)
                 (setq python-index-offset 8)
                 (setq python-indent-guess-indent-offset 8)
                 (message "Using ISFDB indentation standards")
                 )
             )
           )
         )

Obviously you'd need to switch the directory paths to wherever you've checked out the code. Setting all of those variables is almost certainly overkill, but after spending too long fighting elisp to do what I want, I haven't been inclined to reduce it down to just what is needed.

Caveats:

  1. Assumes you're using python-mode
  2. Doesn't (currently) honour the Coding style rule about honouring tabs on files that are currently tab-only, but there are actually relatively few of those - see Talk:Development#Code_style:_tabs_vs_spaces

I also have these lines:

   ;; and .cgi suffix (for ISFDB)
   (add-to-list 'auto-mode-alist '("\\.cgi$" . python-mode))

Not sure how much they are needed, as (a) you probably shouldn't be editing the CGI versions, and (b) the shebang line might be enough to clue emacs that these are Python files without needing to match the suffix.

HTTPS migration

Re. feature request 1298, brain dump notes follow.... (possibly this is just repeating stuff that is already known about?)

  • Assumption that this would be implemented by having a webserver process sitting on port 443 and just forwarding requests to the "real" webserver. The latter would initially be serving on port 80, but at some point switch it to a port that only serves locally, and add a redirect rule to the webserver to force all HTTP traffic to HTTPS.
  • (I thought I had an nginx config that did this, but if I did I can't find it now. Perhaps I'm getting mixed up with a node.js server process I wrote that does work that way.)
  • Assumption: Ultimately all site traffic would be over HTTPS. Theoretically you might want to have non-logged in pages served on both HTTP and HTTPS, but I think that's suboptimal. Plus, I thought that at some point Chrome was going to change to have big user warnings - i.e. stronger than the current "Not secure" box in the address bar - on all HTTP pages that contained forms (i.e. the search box on every ISFDB page) although I can't find a reference to exactly the thing I'm thinking of right now.
  • Assumption: that includes the wiki moving to HTTPS at the same time
  • Assumption: certs will be free ones from LetsEncrypt. These would have to be updated every 3 (IIRC) months. This can be automated, although with the LetsEncrypt s/w I'm using (which I think might be a bit out-of-date, or perhaps that just my usage), you have to briefly bring down your existing webserver for a minute or so when it updates the certs.
  • All network resources (primarily images, but also CSS and JS) should have their URLs switched to either https:// or protocol-relative (e.g. "//isfdb.org/what/ever/" to avoid mixed content warnings.
  • Same goes for external links to avoid "you are navigating away from a secure page" warnings.
  • Any hotlinked images on external domains that don't support HTTPS would (or must?) be replaced by equivalents that use HTTPS.
  • A quite possibly inaccurate count of "http://" links in the repo code by directory is as follows:
   biblio: 103
   common: 59
   css: 23
   edit: 188
   mod: 133
   nightly: 65
   rest: 57
   scripts: 149
  • UPDATE: That undercounts massively because there are a bunch of constructed URLs using string formats like "<a href='http:/%s/...'>" % (HTFAKE). (Personally, I think it'd be good to make those all domain-relative URLs - I only came across these when something broken on my dev server when I was using a mix of localhost and IP address - is there any reason why we need to specify the domain explicitly?)
There was an attempt to switch to relative URLs using a "quick and dirty" shortcut earlier this year -- see SR 162 "Use relative URLs instead of full URLs" for details. The attempt failed because we ended up with URLs like "http:/cgi-bin/recent.cgi", which were invalid. Once we migrate all links to ISFDBLink as discussed below, it should be easy to switch to relative URLs. Ahasuerus 09:23, 21 October 2019 (EDT)
  • I'm guessing that some of those (nightly, scripts) aren't an issue?
  • A crude audit of links in the database:
   MariaDB [isfdb]> select substring_index(pub_frontimage, '/', 1) domain, count(1) from pubs group by domain;
   +--------+----------+
   | domain | count(1) |
   +--------+----------+
   | NULL   |    75605 |
   |        |     9904 |
   | http:  |   389625 |
   | https: |    98361 |
   +--------+----------+
   
   MariaDB [isfdb]> select substring_index(author_image, '/', 1) domain, count(1) from authors group by domain;
   +--------+----------+
   | domain | count(1) |
   +--------+----------+
   | NULL   |   193828 |
   | http:  |     1979 |
   | https: |     2382 |
   +--------+----------+
   
   MariaDB [isfdb]> select substring_index(site_url, '/', 1) domain, count(1) from websites group by domain;
   +--------+----------+
   | domain | count(1) |
   +--------+----------+
   | http:  |       28 |
   | https: |        4 |
   +--------+----------+
   
   MariaDB [isfdb]> select substring_index(url, '/', 1) domain, count(1) from webpages group by domain;
   +---------+----------+
   | domain  | count(1) |
   +---------+----------+
   | http:   |   109480 |
   | https:  |    47537 |
   | test12  |        1 |
   | test123 |        1 |
   +---------+----------+
  • (Not sure where else URLs might be stored in the database. Change the split offset from 1 to 3 in the query to see the domains - most of the biggest sources of http:// links are sites that should support https equivalents/derivatives e.g. Amazon, Wikipedia, Goodreads. SFE3 looks to be the biggest non-HTTPS supporting site.)
  • Any edit and mod approval pages that allow URLs to be entered should have some warning and/or rejection of submitted http:// links.
Interesting points. A few things off the top of my head:
  • The vast majority of hard-coded "http://" occurrences in the ISFDB code should be migrated to "ISFDBLink" anyway. It resides in common/library.py and is imported by almost all modules. Once the migration has been completed, it will be trivial to make common/library.py do whatever we want it to do, e.g. use a globally defined variable to determine whether to use HTTP or HTTPS.
  • Another affected table is "notes": 50K HTTP links (including 22K ISFDB links) vs. 10K HTTPS links. The "note" field in the "authors" table is a minor concern: 269 HTTP links (including 106 ISFDB links) and 99 HTTPS links. Many ISFDB links can be replaced with supported templates, which can then be configured to switch between HTTP and HTTPS on demand, similar to ISFDBLink above.
  • I agree that a free certificate source like LetsEncrypt is the best way to. However, as I recall, some hosts charge more if you want to listen on port 443. We'd first need to check with the current host if the ISFDB plan has any restrictions on port 443.
Ahasuerus 19:55, 6 October 2019 (EDT)

Python 2.5 vs Python 2.7 gotchas

ISFDB currently runs on Python 2.5.something, but it's hard to get this running on a modern (<5 years old) Linux distro, which will ship with 2.7.something (and 3.x) by default. python.org doesn't provide Linux binaries, and building from source has issues with some crypto-type libraries changing.

From non-exhaustive testing on my dev box, the code works fine with 2.7, but that's not enough to feel comfortable upgrading the production server. So, if doing development with Python 2.7, beware of the following gotchas that will break if the code is run on Python 2.5.

This is not a definitive list!

Base Python language

  • 'except Exception as err' doesn't work, you need to do 'except Exception, err'

stdlib

  • collections.namedtuple is not available (was added in 2.6)
  • subprocess.check_output() is not available (and probably other convenience methods in that module also). Also (in 2.5.1 at least) the CalledProcessError exception's __init__() doesn't expect a third 'output' argument, so you have to instantiate it as if it was a regular object, with the returncode and cmd args, then define a .content property, and finally raise the exception

Third party libraries

  • BeautifulSoup (not part of ISFDB proper, but used by my experimental tests) version 4 aka bs4 doesn't seem to exist, so you have to use the older version that's imported as "BeautifulSoup". This has a number of API differences - not only does it not support the newer .select/.select_one methods that use CSS selectors, but there are differences in how   is handled, whether the class attribute is a string or list etc.