Talk:Development
From ISFDB
See Talk:Development/Archive for archived discussions
Trimming the Wiki
Our Wiki tables are getting fatter and the database is now over 4Gb in size with over 95% of it on the Wiki side. About a year ago, Al put together a script to drop all but the last 15 revisions of each Wiki page, which seemed to help. He tried running the script again earlier this spring when we had various performance and stability problems, but it didn't seem to do much.
The current version of the script is wikitrim.py. It is available in the "scripts" subdirectory on Sourceforge. Do we have brave (and Wiki-literate) volunteers who could review it and determine if it needs to be upgraded to work with our version of the Wiki software? Ahasuerus 03:41, 12 June 2009 (UTC)
- Have we changed Wiki software versions since then? I don't recall hearing about that, and there's nothing on "What's New". The script should still work as designed, but apart from the usual suspects most pages don't go through many revisions. Maybe all the new Image pages are distorting the size since we enabled uploading? BLongley 17:59, 12 June 2009 (UTC)
- Checking the full backup, I see that "mw_text" is 3.4Gb while our images (which are backed up separately) are only 580Mb. Ahasuerus 22:49, 13 June 2009 (UTC)
- If you look at Special:Mostrevisions you see that only 35 pages have more than 20 revisions, and #1 is Rules and standards discussions (2,171 revisions). #2 is Main Page (681 revisions). I'm not sure if this includes revisions already dropped but whose timesapms and other metadata are still retained. -DES Talk 23:12, 13 June 2009 (UTC)
- Strange that it doesn't list the Community Portal, which has accumulated more than 500 revisions... Ahasuerus 00:09, 14 June 2009 (UTC)
- It appears to ignore all "ISFDB:" wiki pages... I seem to recall a discussion that those pages aren't really part of a wiki namespace so several functions ignore those pages.... perhaps the cleanup scripts have been ignoring that page, and Moderator Noticeboard, etc. Kevin 00:48, 14 June 2009 (UTC)
- The mostrevisions page reports appear to includes changes that the text has been deleted for, and changes for which even the change 'note' information has been deleted. Kevin 00:46, 14 June 2009 (UTC)
- What I can see is that it's keeping 20 revisions not 15 as you say, so we could make it 15 and get a bit more benefit. Or specify different limits for different namespaces. I can't test such though, no data locally and with no local WikiMedia I can't create any properly either. BLongley 17:59, 12 June 2009 (UTC)
- It turns out that it's 20 in Sourceforge but 15 on the live server. I'll upload the current version tonight. Ahasuerus 22:56, 12 June 2009 (UTC)
- One thing that might help is to know which namespaces are the biggest offenders. It would help if you could get some figures for total pages and total text revisions, e.g. run these queries against a full backup:
select page_namespace, count(*) from mw_page group by page_namespace order by 2 desc
- and
select old_namespace, count(*) from mw_text group by old_namespace order by 2 desc
- (untested for same reasons). Another thing to look at is to remove the revision history for the revisions whose text we're deleting. Is it any good to know that, say, Mike Christie changed something sometime if we can't see what changed? I think that means clearing mw_revision where we've already deleted (in past runs) the mw_text entry already, or where we are deleting it (in future runs of wikitrim.py). I could write that, but would really flag it with a big "I didn't test this, and am not sure how good an idea it is!" warning. Do we have anyone with a Local Wikimedia installation working yet? BLongley 17:59, 12 June 2009 (UTC)
- I can take a look at it when I have the chance. I run a wiki and can do some analysis and trimming. --MartyD 01:40, 13 June 2009 (UTC)
- No need to write anything to clean up the revision history. MediaWiki has maintenance/deleteOrphanedRevisions.php:
php deleteOrphanedRevisions.php --report
- to show you what it's going to do, and omit --report to actually do it. That removes revision records not corresponding to any remaining version of any active or deleted page (deleted pages are "archived", not deleted. ).
- When I ran this report, I received the following error:
[root@isfdb maintenance]# php deleteOrphanedRevisions.php --report
Delete Orphaned Revisions Checking for orphaned revisions...PHP Notice: Undefined variable: revisions in /var/www/html/wiki/maintenance/deleteOrphanedRevisions.php on line 35 found 0.
- I don't know much about PHP, so I am not sure whether the message was just informational or whether "found 0" was bogus. Ahasuerus 00:21, 14 June 2009 (UTC)
- Looks like an error of some sort. --MartyD 10:57, 14 June 2009 (UTC)
- There's also a purgeOldText.php that will remove mw_text entries for which there is no revision or archive:
php purgeOldText.php
- to show you what it's going to do, and add --purge to actually do it.
- The script dumped a lot of data on the screen and then:
from within function "".
MySQL returned error "1064: You have an error in your SQL syntax; check the manu
al that corresponds to your MySQL server version for the right syntax to use nea
r ' 113641, 113642, 113645, 113646, 113647, 113648, 113649, 113650, 113651, 1136
52,' at line 1 (localhost)"
- Ahasuerus 00:52, 14 June 2009 (UTC)
- Looks like it probably exceeds the number of entries MySQL allows in an "in" list. We may have to modify the script's query to include some sort of limit on how many each pass will attempt. --MartyD 10:57, 14 June 2009 (UTC)
- Once those are cleaned up, you can use http://www.isfdb.org/wiki/index.php/Special:Mostrevisions to see which pages have the most revisions currently. I think the reason there is a special trimming script is because MediaWiki's maintenance scripts do not provide a way to keep some history (the only script purges all old revisions). The script itself looks fine to me.
php deleteArchivedFiles.php php deleteArchivedRevisions.php
- are two additional scripts (add --delete to actually do them) that will get rid of delete files and pages (meaning, they can't be restored anymore). http://www.isfdb.org/wiki/index.php/Special:Log/delete shows you the deletion log. --MartyD 12:13, 13 June 2009 (UTC)
- Do we really need to keep the last 'x' revisions? for the worst offenders, 20 edits is the difference between running the script 1-7 days later. (Shrug).
- Do we really need to trim the history that much? Storage is cheap these days -- hard drives with hundreds of Gb are standard. I do find revision history handy and more so on the more popular pages I typically look at recent edits as diffs, and often at older edits that way too. My inclination would be to keep all revisions, and failing that, as many as space reasonably allowed. -DES Talk 01:07, 14 June 2009 (UTC)
- At the moment, we are doing OK space-wise, but, unfortunately, there are a few problems with the current situation. First, the backup process slows everything down a lot and the database becomes almost unusable while it runs. The bigger the Wiki database gets, the longer the outage lasts. NorAm-based users may not notice it since it happens in the middle of the night here, but Asian/Oz/European users may not be so lucky.
- Second, we suspect that the size of the Wiki database affects our performance. We can't really prove it, but performance seemed to improve last year after Al purged the old versions.
- Third, it makes the process of copying backup files to my local server more time consuming and increases the risk of failed FTPs due to the amount of time that it takes. It also makes generating publicly posted backups more time consuming. It is still manageable, but it would be more productive for me to spend my time on testing and software deployment.
- Having said that, we already archive the Community Portal, the Rules/Standards page and our editors' Talk pages. Given their size and the number of revisions, it's possible that if we could simply drop all but the latest version of the "top offenders", we probably may not need to do anything else. Do we know if there is a script that lets you purge all versions of just one page? Ahasuerus 01:20, 14 June 2009 (UTC)
- Yes.
php deleteOldRevisions.php page_id1 page_id2 ...
- Here you add --delete to delete them and use no page_id values to do it to everything. Unfortunately, I don't know an easy way to find out the page ids (it's the database id, not the title), except to do a query and look up the id for the title. There might be some MediaWiki extensions to make some of this easier. --MartyD 10:57, 14 June 2009 (UTC)
- I created a new revision of User:Isfdb test, found its ID in mw_page, tested that it was indeed the right ID, and then ran "php deleteOldRevisions.php 5471" without "--delete". It found 1 active and 1 old revision, so everything looked fine and I ran it with "--delete". It produced what looked like an "OK, deleting" message, but then I got the same dump of thousands of numbers that I mentioned above and the same error:
2, 139825, 141742, 141805, 141806, 141807, 141808, 141833, 142087, 142089, 14451
9, 145243, 145245, 145246, 145247, 145249, 145469, 145824, 149307, 149304, 15027
2, 149907, 150255, 150544, 150780, 152759 )"
from within function "".
MySQL returned error "1064: You have an error in your SQL syntax; check the manu
al that corresponds to your MySQL server version for the right syntax to use nea
r ' 113641, 113642, 113645, 113646, 113647, 113648, 113649, 113650, 113651, 1136
52,' at line 1 (localhost)"
- The old revision was not deleted, but rerunning "php deleteOldRevisions.php 5471" without "--delete" produced the following output:
Delete Old Revisions
Limiting to `mw_page`.page_id IN (5471) Searching for active revisions...done. Searching for inactive revisions...done. PHP Notice: Undefined variable: old in /var/www/html/wiki/maintenance/deleteOld Revisions.inc on line 50 0 old revisions found.
- I don't think I understand what's going on here and it's probably safer not to experiment on the live server any further until we have a better understanding of what's erroring out and how to fix it. Ahasuerus 23:43, 14 June 2009 (UTC)
- Yes, could be data, or could be problems in whatever version of scripts the site has. If you tar me up the maintenance directory on the server, I can compare them against the version of MediaWiki I have installed. That may give us some clues without much effort or risk. --MartyD 01:32, 15 June 2009 (UTC)
Undocumented XML NewPub submission format
I've found the following undocumented tag in XML:NewPub:
- ClonedTo - Used to export/import contents between publications. Contents will be added to the publication with id specified in ClonedTo.
It is used with MOD_PUB_CLONE submission type. I guess it is not part of the official/stable XML API, but shouldn't it be listed somewhere on Wiki? --Roglo 14:47, 27 June 2009 (UTC)
Short-Cut Process
For the trivial (affects only single modules at a time, usually) I've been using this:
1. A developer selects a bug to fix or a feature to implement. 2. The developer makes changes locally, tests the new behavior 3. If it doesn't work, abandon! 4. Else: developer commits the code in Sourceforge. 5. The developer updates the Bug/Feature on Sourceforge 6. The developer posts on the "Outstanding changes" section of the development page that the change is ready for testing.
It's not perfect, but keeps me coding on the "low-hanging fruit" stuff. I know some people will have problems with this, but I would like to know what the problems for them actually are. For instance, some Feature Requests cover lots of modules that all need changing the same way, but as the requirements are general and there's nobody clarifying exactly what's affected we'll end up with features half-implemented. Which isn't a problem for me really, so long as we're improving things, but it does make Project Managers nervous. BLongley 21:55, 5 July 2009 (UTC)
- I have been thinking along similar lines lately and it looks like we do need to relax our process requirements a bit. "Approved" is a good thing to have for major changes that require design decisions like adding Roles, but minor features do not need a drawn out discussion. Perhaps we could change the process to something like "Post on the Community Portal that you are about to implement Feature Request 666 and see if anyone has objections/suggestions"? Ahasuerus 03:12, 7 July 2009 (UTC)
- "About to implement" might be a bit strong - after all, if it doesn't get past you, it is never actually implemented. Maybe we need a third section between "Outstanding changes" (likely to be implemented) and "On Hold" (is making someone nervous) and create a "should I even suggest this, as it looks dead simple to do, so I did it, but it might have been discussed and dismissed ages ago?" category (with a shorter name of course). BLongley 21:59, 10 July 2009 (UTC)
- I'm basically trying to not discourage people from trying things out, while not encouraging them to work on stuff that's unwanted. BLongley 21:59, 10 July 2009 (UTC)
- For instance, today I've looked at the 1980 printings of The Hitch Hiker's Guide to the Galaxy and thought "when you have identical pubs apart from price, the higher-priced might be better sorted after the lower-priced". I know that's not always true - a budget edition might share the same year of publication as the original, but in my experience popular books keep getting reprinted at ever higher prices, so a sort change might be appropriate. So I'd have a look at that, if it's simple I'd find out by doing it, then I'd create a Feature Request (if there isn't one already) and add the code for people to test. If it was complex but I found it worth looking at, I'd just create the Feature Request. If I saw the effect, and then changed my mind about sort order, I'd not submit it. BLongley
- Logging a Feature Request up front (well, once you've settled on the idea that you think it would be useful) is certainly the way to go, but I don't think code should be checked in, except perhaps on a branch, for things that haven't been agreed to unless there is little risk of controversy. --MartyD 10:04, 11 July 2009 (UTC)
- This is why I started with the word "trivial" which I hope is no worse than your "little risk of controversy". BLongley 21:37, 11 July 2009 (UTC)
- So to use your example, by all means log the request, but at that point, because it's almost certain that there will be conflicting opinions about sort order, a discussion should be posted somewhere. --MartyD 10:04, 11 July 2009 (UTC)
- I hope that adding some final deciding order to two otherwise identical pubs that would otherwise appear in random order (I think - haven't checked it yet) wouldn't be controversial. But I could be wrong, there's been some really strange arguments here at times. BLongley 21:37, 11 July 2009 (UTC)
- With any luck, there will be a clear consensus. If not, variations may need to be submitted for other developers/testers to try, but those would need to be submitted in a way so as not to interfere with other development (i.e., using branches). --MartyD 10:04, 11 July 2009 (UTC)
- I'd like to avoid branches, that adds an area of complexity beyond most of our people's abilities, I think. (Over 50% of the developers I've worked with/against or managed find it easier to ignore branches and just make their version of the code the latest.) If it is really as trivial as I'm suggesting though, (a few lines changed in a single module) then it's no trouble to revert or change something that caused unexpected controversy. BLongley 21:37, 11 July 2009 (UTC)
- If that and examples/screen shots are insufficient, only then should an open and under-discussion proposal make it into the main line so that it can be deployed for non-developers/testers to get their hands on in order to offer better feedback. My guess is most things can be resolved without going that far. --MartyD 10:04, 11 July 2009 (UTC)
- Yes, if we restrict this to the trivial, non-controversial, doesn't-affect-any-other-known-development-efforts, it won't get that far. I'm talking about things so trivial that it would be an insult to people's intelligence to post a screen-shot before asking them to vote on whether such a change would have a negative effect. BLongley 21:37, 11 July 2009 (UTC)
- One thing we really need to avoid is submitting code changes that the community does not want used -- it will be difficult to make sure some of all of those pieces don't get incorporated accidentally into future work. And logging a Feature Request gives us a place to record the community's rejection of ideas as well, which will help us when the same idea comes up again in the future. --MartyD 10:04, 11 July 2009 (UTC)
- I too have found that even (seemingly) simple straightforward Feature Requests can elicit comments from other editors that make me look at the issue in a whole different light and tweak or even completely change the proposed feature.
- Due to the number of outstanding FRs, we may want to try to tackle them in "batches", e.g. list all Series-related FRs on the Community Portal and see if we can agree on the direction that we want to take this part of the database in. Once we have a consensus, we can mark all related FRs "Approve", link them back to the Wiki discussion and create any additional FRs that may be required. Ahasuerus 19:43, 11 July 2009 (UTC)
- "Batching" would indeed help developers see if anything they thought was non-controversial was being affected by other possible efforts. But should we hold everything up while we discuss the full impact of ISFDB:Moderator_noticeboard#Is_the_link_to_talk_page_for_new_submissions_working.3F or can I just adjust a few modules that nobody else is working on in the meantime? BLongley 21:51, 11 July 2009 (UTC)
- As it turned out that (see Bill's Talk page), there are some issues with database-Wiki links that need to be addressed first. I'll start a new section below. Ahasuerus 22:25, 12 July 2009 (UTC)

