User:Alvonruff/The Charset Problem

From ISFDB
Jump to navigation Jump to search

For now, this is just a place to collect notes. So far the changes made to isfdb2 are:

  • Set the mysql charset variables from utf8mb4 to latin1
  • Split isfdb UNICODE variable into UNICODE_IN and UNICODE_OUT
  • Set UNICODE_OUT to utf-8
  • Set UNICODE_IN to iso-8859-1


Overall Data Types


                     latin-1                 str                    str                 ascii              iso-5589-1
          |  MySQL ------------> MySQLdb ------------> Python2.7 ------------> ISFDB ------------> Apache ------------> Browser
          |
ISFDB1    |
          |          latin-1                 str                     ???               iso-5589-1              iso-5589-1           iso-5589-1
          |  MySQL <------------ MySQLdb <------------ Python2.7 <------------ ISFDB <------------ ISFDB/XML <------------ Apache <------------ Browser





                     utf-8                   str                    str                 ascii             iso-55891-1
          |  MySQL ------------> MySQLdb ------------> Python2.7 ------------> ISFDB ------------> Apache ------------> Browser
          |
ISFDB2    |
(default) |          utf-8                   str                    str                iso-5589-1                ???                 ???
          |  MySQL <------------ MySQLdb <------------ Python2.7 <------------ ISFDB <------------ ISFDB/XML <------------ Apache <------------ Browser



                     latin-1                 str                    str                 ascii               utf-8
          |  MySQL ------------> MySQLdb ------------> Python2.7 ------------> ISFDB ------------> Apache ------------> Browser
          |
ISFDB2    |
(fixed)   |          latin-1                 str                    str                iso-5589-1               ???                 iso-5589-1
          |  MySQL <------------ MySQLdb <------------ Python2.7 <------------ ISFDB <------------ ISFDB/XML <------------ Apache <------------ Browser

Validation of input data, Philip José Farmer (targeting the accented character):

  • Browser forms are the same from isfdb1 and isfdb2
  • The generated form data (validated within Chrome) are the same from isfdb1 and isfdb2. Both generate character 0xE9.
  • The generated XML data (prior to injection into MySQL) appears like this: "....&#1072; - Jos\é Farmer</Title>"
    • The backslash is introduced in EvalField(), with a call to db.escape_string(), which is a MySQLdb function.
    • escape_string() isjust a python wrapper around mysql_escape_string(), which is found in the _mysql.so library
  • The XML data present in MySQL is different between isfdb1 and isfdb2. On isfdb1 the character is 0xE9 (é) while on isfdb2 the character is 0x3F (?)

Browser

HTML POST forms can now specify the input character set:

<form accept-charset="character_set">
  • If we don't specify a character set, then we see the same issue as the output problem, with characters outside of latin1 being mangled.
  • If we specify utf-8, then those outside characters are fine, but current characters like Japanese and Cyrillic are mangled.
  • If we specify iso-5889-1, then the letters outside of latin1 are mangled.

Config Settings #1

If we set the following:

  • HTML form has default charset (same as isfdb1), AND
  • Apache httpd.conf has: AddDefaultCharset ISO-8859-1 (same as isfdb1), AND
  • The HTML header has: <meta charset="iso-8859-1"/> (same as isfdb1)

Then:

  • Cyrillic appears ok in the form

But, not able to mix:

  • Cyrillic AND German on the same line
  • Cyrillic AND non-Latin1 characters like 'José' on the same line

Config Settings #2

If we set the following:

  • HTML form has: ACCEPT-CHARSET="iso-5598-1" (not what isfdb1 has), AND
  • Apache httpd.conf has: AddDefaultCharset ISO-8859-1 (same as isfdb1), AND
  • The HTML header has: <meta charset="iso-8859-1"/> (same as isfdb1)

Then I able to input:

  • Cyrillic AND German on the same line

But, not able to mix:

  • Cyrillic AND non-Latin1 characters like 'José'

Config Settings #3

If we set the following:

  • HTML form has: ACCEPT-CHARSET="iso-5598-1" (not what isfdb1 has), AND
  • Apache httpd.conf has: AddDefaultCharset ISO-8859-1 (same as isfdb1), AND
  • The HTML header has: <meta charset="utf-8"/> (not what isfdb1 has)

Then not able to mix:

  • Cyrillic and German: German looks ok, but Cyrillic is replaced with ???????
  • Cyrillic AND non-Latin1: non-latin looks ok, but Cyrillic is replaced with ???????

Interestingly, the XML shows a well-formed <Subject> field, which is a copy of the book title, but the <Title> field is mangled.

Apache

Newer versions of Apache have the following default configuration parameter:

AddDefaultCharset UTF-8

Using the inspect facility in Chrome, we can see the form data coming back from the browser. If:

  • UNICODE_IN/OUT is set to iso-5589-1 (same as isfdb1)
  • Apache charset set to: AddDefaultCharset ISO-8859-1
  • and No charset specified in the HTML form statement,

then the form contents on isfdb1 and isfdb2 are identical. So the character differences are in the isfdb scripts.

XML

In the editing system, the XML encoding is picked up from the same UNICODE variable that controls the outgoing charset. When set to UTF-8, the xml headers then appear as:

<?xml version="1.0" encoding="UTF-8"?>

However, in isfdb1, UNICODE is set to ISO-8859-1, so they appear as:

<?xml version="1.0" encoding="iso-5589-1"?>

There are iso-5589-1 presumptions built into the isfdb editing system, so encoding with utf-8 causes encoding/decoding errors.

ISBDB

There are two charset configs in common/isfdb.py:

There is an HTML session header (not observable in the HTML header):

print 'Content-type: text/html; charset=%s\n' % UNICODE

Additionally a <meta> tag is issued in the HTML flow:

print '<meta http-equiv="content-type" content="text/html; charset=%s" >' % UNICODE

The variable UNICODE is set to "ISO-8859-1" on isfdb1. Setting it to "UTF-8" fixes the output problem on isfdb2, but not the editing problem.

Python2.7

If we write a short test program as follows:

	db = MySQLdb.connect(DBASEHOST, USERNAME, PASSWORD, conv=IsfdbConvSetup())
	db.select_db(DBASE)

	authorID = int(sys.argv[1])
	authorName = SQLgetAuthorName(authorID)
	print authorName

Then we get a result, that at first glance, looks like the correct output. So for instance, if I use an authorID of 26, this prints out Philip José Farmer. However, if I add the following code:

	print len(authorName)
	print type(authorName)

It outputs 19 for the length, while the actual length of his name is 18, while outputs a type of str:

	19
	<type 'str'>

If we convert the string to UTF-8, using: unicodeValue = value.decode("utf-8", "strict"), then the same output would be:

	Philip José Farmer
	18
	<type 'unicode'>

If we convert the string to LATIN-1, using: unicodeValue = value.decode("latin-1", "strict"), then the output would be:

	Philip José Farmer
	19
	<type 'unicode'>


MySQLdb

The Connection() function takes an optional arguments named use_unicode, and charset (these only work on MySQL-4.1 and newer).

conn = mysql.connect(host='127.0.0.1',
                     user='user',
                     passwd='passwd',
                     db='db',
                     charset='utf8',
                     use_unicode=True)

The current implementation on ISFDB1 does not set use_unicode nor charset. The following test program:

	db = MySQLdb.connect(DBASEHOST, USERNAME, PASSWORD, conv=IsfdbConvSetup())
	db.select_db(DBASE)

	authorID = int(sys.argv[1])
	authorName = SQLgetAuthorName(authorID)
	print authorName
	print len(authorName)
	print type(authorName)

generates:

Philip José Farmer
19
<type 'str'>

While altering the MySQLdb.connect line to:

	db = MySQLdb.connect(DBASEHOST, USERNAME, PASSWORD, conv=IsfdbConvSetup(), use_unicode=True, charset="latin1")

generates:

Philip José Farmer
18
<type 'unicode'>

So data comes into MySQLdb as latin1, and everything coming into the isfdb scripts from MySQLdb is currently of type str.

MySQL

The current ISFDB character set of the MySQL database is latin1 (ISO-8859-1):

mysql> select default_character_set_name, default_collation_name from information_schema.schemata where schema_name='isfdb';
+----------------------------+------------------------+
| DEFAULT_CHARACTER_SET_NAME | DEFAULT_COLLATION_NAME |
+----------------------------+------------------------+
| latin1                     | latin1_swedish_ci      |
+----------------------------+------------------------+

That said, there are other MySQL charset variables to look at. On ISFDB1, we have:

mysql> show variables like '%character_set%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | latin1                     | 
| character_set_connection | latin1                     | 
| character_set_database   | latin1                     | 
| character_set_filesystem | binary                     | 
| character_set_results    | latin1                     | 
| character_set_server     | latin1                     | 
| character_set_system     | utf8                       | 
| character_sets_dir       | /usr/share/mysql/charsets/ | 
+--------------------------+----------------------------+

While on ISFDB2, MySQL defaulted these variables to:

mysql> show variables like '%character_set%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8mb4                    |
| character_set_connection | utf8mb4                    |
| character_set_database   | latin1                     |
| character_set_filesystem | binary                     |
| character_set_results    | utf8mb4                    |
| character_set_server     | utf8mb4                    |
| character_set_system     | utf8mb3                    |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+

These variables can be set using the mysql app by issuing the following commands:

  • set character_set_results = 'latin1';
  • set character_set_server = 'latin1';
  • set character_set_client = 'latin1';
  • set character_set_connection = 'latin1';

character_set_system is a read-only variable and cannot be changed at runtime. Changing the four above variables had no observable effect on the issue.