Talk:Development

From ISFDB

Jump to: navigation, search
ISFDB Discussion Pages and Noticeboards
Before posting to this page, consider whether one of the other discussion pages or noticeboards might suit your needs better.
Help desk
Questions about doing a specific task, or how to correct information when the solution is not immediately obvious.
• New post • Archives
Verification requests
Help with bibliographic, image credit, and other questions which require a physical check of the work in question.
• New post • Archives
Rules and standards
Discussions about the rules and standards, as well as questions about interpretation and application of those rules.
• New post • Rules changelog • Archives
Community Portal
General discussion about anything not covered by the more specialized noticeboards to the left.
• New post • Archives
Moderator noticeboard
Get the attention of moderators regarding submission questions.
 
• New post • Archives • Cancel submission
Roadmap: For the original discussion of Roadmap 2017 see this archived section. For the current implementation status, see What's New#Roadmap 2017.



Archive Quick Links
Archives of old discussions from the Talk:Development page.


1 · 2



Contents

An interesting use of our data

This might be of interest to people. BLongley 19:43, 14 July 2011 (UTC)

Link no longer works. Gives a 404 error.--Astromath 14:58, 25 December 2012 (UTC)

Automatic notification of verified pubs.

I'm not sure how doable this is, but my suggestion is when a verified pub is edited, it will automatically notify all verifiers of the pubs on their discussion pages. The only problem I foresee is that some verifiers wish to be notified on special pages they've set up for that purpose.--Astromath 15:15, 25 December 2012 (UTC)

Yes, it should be doable, although I'll have to check the Wiki software to see how easy it will be to implement. I have created FR 3598471 for this feature request.
There is also a larger issue here -- we want to maintain a complete history of all edits in the database so that we could tell that, e.g., the value of field A was changed from X to Y by editor N on 2012-12-25 (FR 2800816). This information is currently captured (but not made readily available to our users) for author edits, but not for any other types of edits. Ahasuerus 21:37, 25 December 2012 (UTC)

That being said, if this does come to pass, then there will definitely be a need of a preview page prior to saving the edits. This is in case an edit is cancelled by the editor for some reason which would clutter up the verifier(s)' discussion pages.--Astromath 15:15, 25 December 2012 (UTC)

I would expect that the verifiers' Talk page will not be updated until after the submission has been approved. Ahasuerus 21:37, 25 December 2012 (UTC)

LCCN & OCLC fields

With the prevailance of LCCN and OCLC numbers, wouldn't having separate fields for them make sense? Then you wouldn't need to go through the hassle of html coding of the links in the notes field.--Astromath 01:38, 1 January 2013 (UTC)

Indeed, it would be desirable. We have FR 3127708, which reads:
  • Add support for external identifiers at the Publication and Title levels. Publication identifiers can include LCCNs, Worldcat IDs, British Library Ids, Goodreads IDs, etc. Title identifiers can include "Work" identifiers used by LibraryThing, Goodreads and other social cataloging Web sites.
Ahasuerus 02:16, 1 January 2013 (UTC)

Nightly optimizations

I thought I would move the technical discussions from the community portal. We should be able to optimize Unicode SQL searches used in reports 65–70 and 73–78 and from nightly/nightly_update.py 1.209 by updating common/library.py 1.139 line 1325 with:

  1. def badUnicodePatternMatch(field_name):
  2.         # Reduce the number of Unicode patterns to search for substantially by finding all "combining diacritic" combinations
  3.         #   not just the ones we know how to replace
  4.         # All of the keys are either a single numeric character reference (NCR) or a single character followed by single NCR
  5.         # We only care about the single NCR so we remove the lead character of keys that do not start with an NCR "&#" prefix
  6.         # And then we remove duplicate NCRs by pushing the list into a (frozen) set
  7.         ncrs = frozenset(key if key.startswith("&#") else key[1:] for key in unicode_translation())
  8.         patterns = " or ".join("%s like binary '%%%s%%'" % (field_name, ncr) for ncr in ncrs)
  9.         # Optimize by finding all NCR prefixes (and throwing away everything else) first
  10.         return "%s like binary '%%&#%%' and ( %s )" % (field_name, patterns)
  11.  
  12. def suspectUnicodePatternMatch(field_name):
  13.         ncrs = frozenset(['ʼ', 'ʻ'])
  14.         patterns = " or ".join("%s like binary '%%%s%%'" % (field_name, ncr) for ncr in ncrs)
  15.         # Optimize by finding all NCR prefixes (and throwing away everything else) first
  16.         return "%s like binary '%%&#%%' and ( %s )" % (field_name, patterns)

I wonder if these SQL generating functions are used in the application. If they are not, perhaps they should be moved into the nightly code. Now we just need to look at other things (like nightly/nightly_os_files.py 1.4). Uzume 14:43, 19 April 2017 (UTC)

They are also used in edit/cleanup_report.py, which is why they reside in common/library.py.
Re: the additional functionality, we may want to create new cleanup reports. The current "combining diacritics" reports look for string that should have been converted at input time. The new reports would look for combining diacritics which are not on the "translation" list and may need to be added to it. Ahasuerus 16:32, 19 April 2017 (UTC)
If you prefer the original behavior we can implement this for now:
  1. def badUnicodePatternMatch(field_name):
  2.         patterns = " or ".join("%s like binary '%%%s%%'" % (field_name, item) for item in frozenset(unicode_translation()))
  3.         # Optimize by finding all NCR prefixes (and throwing away everything else) first
  4.         return "%s like binary '%%&#%%' and ( %s )" % (field_name, patterns)
  5.  
  6. def suspectUnicodePatternMatch(field_name):
  7.         patterns = " or ".join("%s like binary '%%%s%%'" % (field_name, item) for item in frozenset(['ʼ', 'ʻ']))
  8.         # Optimize by finding all NCR prefixes (and throwing away everything else) first
  9.         return "%s like binary '%%&#%%' and ( %s )" % (field_name, patterns)
It will still be a significant improvement if a somewhat less of one. Uzume 17:06, 19 April 2017 (UTC)
Thanks. Your code is functionally identical to the snippet that Marty sent this morning and which I incorporated a few minutes ago. I haven't touched the "suspect" reports yet. Ahasuerus 18:20, 19 April 2017 (UTC)
Ah and you forked the code nightly/nightly_update.py 1.211 1.212 diff). The advantage of fixing it in the original place is it also improves edit/cleanup_report.py (as you mentioned it is also used there). Uzume 20:55, 19 April 2017 (UTC)

Nightly cleanup reports - 2017-04-19 snapshot

Here is where we stand as of 2017-04-19. Only reports that take more than 2 seconds to compile are included:

Report 1 took 9.92 seconds to compile
Report 2 took 9.28 seconds to compile
Report 3 took 8.60 seconds to compile
Report 8 took 3.59 seconds to compile
Report 14 took 9.45 seconds to compile
Report 16 took 16.08 seconds to compile
Report 20 took 3.40 seconds to compile
Report 32 took 4.41 seconds to compile
Report 33 took 22.71 seconds to compile
Report 34 took 10.95 seconds to compile
Report 38 took 21.75 seconds to compile
Report 40 took 2.04 seconds to compile
Report 42 took 3.10 seconds to compile
Report 45 took 3.15 seconds to compile
Report 47 took 26.83 seconds to compile
Report 48 took 3.23 seconds to compile
Report 52 took 34.65 seconds to compile
Report 54 took 8.52 seconds to compile
Report 63 took 3.37 seconds to compile
Report 79 took 3.21 seconds to compile
Report 80 took 20.00 seconds to compile
Report 87 took 5.16 seconds to compile
Report 88 took 3.67 seconds to compile
Report 92 took 3.18 seconds to compile
Report 93 took 11.83 seconds to compile
Report 107 took 4.76 seconds to compile
Report 111 took 14.72 seconds to compile
Report 127 took 6.41 seconds to compile
Report 137 took 7.33 seconds to compile
Report 161 took 5.19 seconds to compile
Report 167 took 29.25 seconds to compile
Report 168 took 3.11 seconds to compile
Report 191 took 4.77 seconds to compile
Report 193 took 23.55 seconds to compile
Report 196 took 3.44 seconds to compile
Report 197 took 3.45 seconds to compile
Report 200 took 3.90 seconds to compile
Report 204 took 6.59 seconds to compile

Ahasuerus 18:24, 19 April 2017 (UTC)

97 and 99 are gone after adding USE INDEX. Ahasuerus 16:47, 20 April 2017 (UTC)
Do we have times on the complete nightly run (all the reports and os files, etc.)? Uzume 20:36, 20 April 2017 (UTC)
Not yet. I am working on Fixer's monthly haul at the moment. 4,000 ISBNs to go... Ahasuerus 20:53, 20 April 2017 (UTC)
Ouch! That just underscores the need to farm out the manual editing and submission process to ISFDB editors. Good luck (with this month's anyway). Uzume 01:20, 21 April 2017 (UTC)
151 has been zapped. Ahasuerus 03:44, 21 April 2017 (UTC)

WSGI

Moved from Ahasuerus's Talk page.

BTW, I noticed Schema:authors says we are using author_note over note_id. Would you care to explain the reason/history on that? Thanks, Uzume 18:54, 17 April 2017 (UTC)

The reason why all notes (and title synopses) were originally put in the "notes" table had to do with the origins of the ISFDB project. ISFDB 1.0 as created in 1995 was a hand-crafted C-based database. When the software was rewritten in 2004-2006 using Python and MySQL, some implementation details were carried over even though they were no longer needed. For example, the publisher merge code increments and keeps track of "MaxRecords" instead of using natively available Python functionality. Over the last 8 years I have rewritten much of the code, but some vestiges remain. Ahasuerus 19:24, 17 April 2017 (UTC)
That type of global code only works because we are using CGI and the entire application state dies at the end of every HTTP transaction. I do not mind the application being deployed in CGI but one thing I wanted to do was to be able to deploy it in other ways too. Python has a specified API for that called WSGI. I would like to move towards using that. That said, I wonder if it would be easier to rewrite the application that way starting with just record display and work our way up to full edit and mod capability. It could be deployed side-by-side (keeping the same DB back-end) in a beta sort of way until it was very stable and we could then retire the old one. Uzume 01:24, 19 April 2017 (UTC)
I remember you mentioning WSGI at one point, but I am not really familiar with the technology. What kind of ROI are we looking at here (both on the R side and on the I side)? Ahasuerus 01:48, 19 April 2017 (UTC)
Well it makes the application considerably more portable. If it was WSGI we could deploy it in non-CGI settings. CGI has poor performance compared to other deployment technologies because it has to create a new OS process for every HTTP access. We also have security issues with the way we have used CGI because we have many CGI scripts (basically one per URL which makes for a large attack surface to maintain often with redundant security code) and the few libraries are copied around into CGI space (they do not need to be there at all; they just need to be placed into one of the paths that the Python interpreter searches). There is the R. The I side needs some assessment. Python 2.5 was the first to adopt WSGI via PEP 333 and includes wsgiref a reference library so we have that (even though later versions might be more refined). We can convert the application into a WSGI Python library module and have CGI stubs that call into that to maintain CGI-based deployment capability. There will likely be some other issues though like global code that depends on the state of the application being recreated upon each HTTP access (e.g., I believe you pointed this out in the publisher merge code above). Uzume 05:12, 19 April 2017 (UTC)
Actually, the reference to the publisher merge code was just an example of our original Python code written the way you would write C code, which makes it hard to read and fragile.
As far as the benefits of WSGI go, they sound nice. However, there are many different ways to improve/reorganize the way the application is structured. Before we undertake a project of that magnitude, we'll need to compare alternative approaches and decide which (if any) are worth the effort. For now, especially given the number of features that are badly needed, my approach is "if it ain't broke, don't fix it." Ahasuerus 18:05, 20 April 2017 (UTC)
I do understand. But part of the issue is that the feature set is hard to add to and maintain because of the current (lack of) infrastructure. A cleanup could expedite development and thus possibly remedy the badly needed parts as well. A two pronged approach could be useful. Keep working on the current stuff and perhaps on the side also develop a possible replacement/uplift. Uzume 20:29, 20 April 2017 (UTC)
There comes a time in the life cycle of any system -- be it a module, an application, a car or a water heater -- when the maintenance costs rise to the point where it needs to be overhauled or replaced. It happened with ISFDB 1.0 in the early 2000s when it was replaced with ISFDB 2.0. It has happened with a number of ISFDB 2.0 modules, most recently the Advanced Search module, which I overhauled a few days ago. Eventually it will happen to ISFDB 2.0 as a whole. However, I don't think we are anywhere close to that point. I won't start spending any bandwidth on it until we are. Ahasuerus 02:06, 21 April 2017 (UTC)
I was not saying you should yet, however, often it is hard to see when is the optimal point until something comes along that underscores the issues. What I was suggesting is perhaps someone else (perhaps me should I have the time) take point on begin working on such and we can work together bouncing ideas off one another to either make that work better and/or backport some of the ideas here until the world gets better one way or another :) Uzume 04:30, 21 April 2017 (UTC)
Alas, "bouncing ideas" = "bandwidth". Ahasuerus 14:32, 24 April 2017 (UTC)

Another possible Fixer future

As a totally wild thought, you could build your own web application/website that is an interface to Fixer allowing users (you could possibly reuse ISFDB user credentials or not as you see fit) to look at and fix the collected Fixer data while submitting entries to ISFDB. In terms of the approval and editing part it would provide a similar (but web-based) interface you use. In terms of other items (like code and direct DB changes, etc.) it would of course be securely more limited. Anyway, it would allow people other than yourself to work the Fixer queues and would have the advantages of automated data collection and manual editing (and of course manual moderation at ISFDB). I am not sure all what Fixer does (I haven't looked at its code much) Uzume 23:58, 20 April 2017 (UTC)

Fixer's code is not public. It only runs on the development server and cannot be deployed to the production or any other server for various reasons. Ahasuerus 00:18, 21 April 2017 (UTC)
I assumed as much. I was just suggesting a Fixer web interface application somewhere (even if the development server is not publicly accessible). The data could be pushed to a publicly accessibly place (perhaps even the main ISFDB server). The point is to separate the processes of Fixer data collection and the ISFDB data submission (and farming this part out to ISFDB editors somehow). Uzume 01:16, 21 April 2017 (UTC)
I think there are five main steps in the process:
  1. Data collection
  2. Preliminary data cleanup
  3. Final data cleanup
  4. Submission creation
  5. Submission approval
Steps 1 and 2 are performed by Fixer. Steps 3 and 4 have been performed by me up until now, but Annie and I have been experimenting with User:Fixer/Public, a new public process. Once she comes back from Europe, we can finalized the process, at which point more editors may jump in. Hopefully. Ahasuerus 01:51, 21 April 2017 (UTC)
Right. I somehow doubt using a wiki page as a queue is an optimal process, however, it might be better in light of reducing your load. Uzume 04:22, 21 April 2017 (UTC)

but I know one of its main jobs is new pub submissions. It could be possible to add a method to prepopulate fields in edit/newpub and edit/addpub (we already do this for edit/newpub pub type with the construct "edit/newpub.cgi?Novel" but more could be added) and those features could be used by users of the Fixer interface application to push Fixer's collected and templated data into ISFDB's newpub and addpub forms for manual editing and submission (vs. only you doing the manual editing work and Fixer directly queuing submissions via the rest/submission web API). For example, the proposed new Fixer web interface could create links like http://www.isfdb.org/cgi-bin/edit/newpub.cgi?pub_ctype=Novel&pub_year=2017-04-10&pub_publisher=What%20Books%20Press&.... If the arguments became too long, you could use POST instead of GET querystring. Other possible functions of Fixer could potentially be handled similarly (e.g., edit/editpub could prepopulate from the database based on current data and then prepopulate based on Fixer provided form data allowing for proposed edits, etc.). The moderator note could be prepopulated with Fixer data letting the moderator know the origin of at least some of the data was Fixer. Users are good at editing ISFDB but it is hard to scour the world of publishers continuously to find all the latest publication data reliably. This is one of the main powerful features of Fixer. What I am proposing lets you have both. It would leave you free to work the ISFDB code and the Fixer code (to keep it in top shape collecting data) while having others use the data Fixer collects to create ISFDB submissions. Uzume 23:58, 20 April 2017 (UTC)

Nightly reports - live data

Here is the data from the production server as of 2017-04-24. The total elapsed time was 11 minutes. The threshold was 2 seconds:

SVG files (10 reports) took 53.32 seconds to compile
Summary stats took 2.23 seconds to compile
Contributor stats took 34.31 seconds to compile
Verifier stats took 10.41 seconds to compile
Moderator stats took 3.84 seconds to compile
Authors by debut date took 33.73 seconds to compile
1 took 23.85 seconds to compile
2 took 17.10 seconds to compile
3 took 16.83 seconds to compile
8 took 3.04 seconds to compile
14 took 2.10 seconds to compile
15 took 2.95 seconds to compile
20 took 2.83 seconds to compile
32 took 7.84 seconds to compile
33 took 30.65 seconds to compile
34 took 11.85 seconds to compile
38 took 2.44 seconds to compile
40 took 2.93 seconds to compile
41 took 2.41 seconds to compile
42 took 3.00 seconds to compile
45 took 4.15 seconds to compile
47 took 35.82 seconds to compile
48 took 3.79 seconds to compile
49 took 2.00 seconds to compile
52 took 27.75 seconds to compile
54 took 8.67 seconds to compile
58 took 2.04 seconds to compile
59 took 2.38 seconds to compile
60 took 2.85 seconds to compile
61 took 2.04 seconds to compile
63 took 2.79 seconds to compile
80 took 20.97 seconds to compile
87 took 4.43 seconds to compile
88 took 5.88 seconds to compile
92 took 2.05 seconds to compile
93 took 13.67 seconds to compile
94 took 2.66 seconds to compile
95 took 2.50 seconds to compile
107 took 2.37 seconds to compile
111 took 7.90 seconds to compile
127 took 3.95 seconds to compile
137 took 4.95 seconds to compile
143 took 2.10 seconds to compile
151 took 65.12 seconds to compile (already optimized)
167 took 2.96 seconds to compile
168 took 8.80 seconds to compile
169 took 2.19 seconds to compile
177 took 2.01 seconds to compile
182 took 2.69 seconds to compile
188 took 17.88 seconds to compile
191 took 3.63 seconds to compile
193 took 19.05 seconds to compile
196 took 2.81 seconds to compile
197 took 2.83 seconds to compile
204 took 3.41 seconds to compile


Code style: tabs vs spaces

Development#Code_Format asserts that "The code appears to use 'TAB' instead of 'SPACE SPACE SPACE SPACE' to indent the code." However, it seems that some files use spaces e.g. biblio/seriesgrid.py to name a file I picked by chance, which may well be atypical, although the SVN history indicates it's been around since at least 2017.

Also, if tabs are indeed the official style for indentation, what is the preferred value for tab stops, 4 chars, 8 chars or something else? (The aforementioned seriesgrid.py uses 8 space indentation, FWIW.)

I'm absolutely not trying to start a tabs vs spaces war, just seeking clarification. Whilst I definitely have my personal preference (spaces), I concur with the linked text that mixing and matching is the worst of all worlds.

(I should probably have asked this question before I spent an hour fighting my emacs setup to get it to automatically select tabs or spaces depending on what project directory the file being edited was in...) ErsatzCulture 18:28, 2 October 2019 (EDT)

It pains me to say it, but there is no real standard. Older modules tended to use tabs, newer modules tend to use 8 spaces. Some, like biblio/pe.py, use both, which, as you said, is the worst of both worlds. A few modules use 4 spaces. For what it's worth, my IDLE editor is set to use 8 spaces.
As a point of reference, back when the code was first made public (ca. 2008) we used CVS. The code was migrated to SVN in 2017 and that's what we have been using for the last 2 years. Pre-2017 history remains in a read-only CVS repository. I haven't touched CVS since shortly after the migration, so hopefully we won't need it ever again. Ahasuerus 18:56, 2 October 2019 (EDT)
Is there any reason the project can't be "officially" switched to using 8-spaces as a standard? (I don't know how many active devs there are who might have an opinion, the SVN history isn't very revealing in that regard :-)
FWIW, my personal emacs setup is that if I change a Python file, a save-hook also goes through and normalizes any tabs in the file to spaces. (Not sure if it goes in the opposite direction, I've not worked on any Python projects that standardized on tabs.) This seemed to be a reasonable compromise between having someone go through all files in a project and standardize them, vs leaving existing non-standard code unchanged, which leads to the sort of "anything goes" situation you described here. ErsatzCulture 19:30, 2 October 2019 (EDT)
Sure, we can make 8 spaces "official". I'll update the Development page shortly. We can also create a script to look for modules which use a mix of tabs and spaces if we want to go after the "low-hanging fruit".
Re: the number of active developers, things have changed over time. During the "ISFDB 1.0" era (C and a custom database, 1995-2005) and the first few years of the "ISFDB 2.0" era (Python/MySQL, 2006-), Al von Ruff was our only developer. In 2009 the code was made available on SourceForge, Al took a step back and I became the project administrator. From that point on, the number of contributing developers fluctuated as people came and went -- see Development/Archive.
I have been responsible for the majority of the changes since I am retired and can work on the project full time. Unfortunately, my health hasn't been that great lately, so things have slowed down quite a bit. My current big project is rewriting our main (closed source, not Python/MySQL) data acquisition robot to make it a part of the core software. That way it will still be maintainable even if I become unavailable. Ahasuerus 10:48, 3 October 2019 (EDT)
Development#Code_Format has been updated/cleaned up. Hopefully the new language and format make sense. Ahasuerus 11:13, 3 October 2019 (EDT)
Thanks. FWIW, I just hacked up a quick audit script that trawls through the .py files in the repo, and counts what indentation each file uses. Results were:
   isfdb-code-svn $ ./audit_tabs_and_spaces.py 
   [('Both tabs and spaces', 419), ('Spaces only', 74), ('Tabs only', 28), ('No indentation', 13)]
(This only looked at the first character of each line, hopefully there aren't any cases of tabs and spaces being used for indentation on the same line, except maybe for alignment in multi-line code, which is OK.)
At some point I might actually get round to making a meaningful contribution to this code base rather than merely nitpicking... ErsatzCulture 16:35, 3 October 2019 (EDT)
Hey, more information is always good to have even when it's depressing :-) Thanks for digging! Ahasuerus 18:39, 3 October 2019 (EDT)
I would much prefer four spaces over eight. This is in line with PEP-8 too. PEP-8 does state we can keep tabs for old code and our old doc (as copied above here) certainly implies four spaces (per tab). Why the move to eight? Uzume 18:54, 3 January 2020 (EST)
Something I just found out, Python 3 refuses to run code that mixes tabs and spaces:
   common $ python3
   Python 3.6.5 (default, Apr  4 2018, 15:09:05) 
   ... elided ...
   >>> from isbn import isbnVariations
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/home/3rdparty/isfdb-code-svn/common/isbn.py", line 12, in <module>
     from library import validISBN, toISBN10, toISBN13
     File "/home/3rdparty/isfdb-code-svn/common/library.py", line 132
     if year == '0000':
                    ^
   TabError: inconsistent use of tabs and spaces in indentation
For a personal-but-ISFDB-related project written in Python 3, I wanted something to handle ISBN10<->ISBN13 conversion, and thought this module might be able to help. Obviously any upgrade to Python 3 of the real ISFDB code is a long way off - if ever? - so this is pretty academic, but I thought it worth mentioning. I thought it might be the sort of thing that there'd be an argument/env. var. you could pass to the Python interpreter to ignore this error, but none of the Stack Overflow answers around this error mentioned any such thing.
I might submit a patch to make the spaces/tabs consistent on this file, if it looks like that's the easiest way to solve my problem, although it's not a super-high priority issue for me right now ErsatzCulture 11:09, 25 March 2020 (EDT)
Personal tools