User:Alvonruff/The Charset Problem

From ISFDB
Jump to navigation Jump to search

For now, this is just a place to collect notes.


Overall Data Types


                   latin-1                 str                    str                 ascii                ???
        |  MySQL ------------> MySQLdb ------------> Python2.7 ------------> ISFDB ------------> Apache ------------> Browser
        |
ISFDB1  |
        |            ???                   ???                    ???                 ???                      ???                  ???
        |  MySQL <------------ MySQLdb <------------ Python2.7 <------------ ISFDB <------------ ISFDB/XML <------------ Apache <------------ Browser





                   latin-1                 str                    str                 str                 utf-8
        |  MySQL ------------> MySQLdb ------------> Python2.7 ------------> ISFDB ------------> Apache ------------> Browser
        |
ISFDB2  |
        |            ???                   ???                    ???                 ???                      ???                  ???
        |  MySQL <------------ MySQLdb <------------ Python2.7 <------------ ISFDB <------------ ISFDB/XML <------------ Apache <------------ Browser


Apache

Newer versions of Apache have the following default configuration parameter:

AddDefaultCharset UTF-8

XML

<?xml version="1.0" encoding="UTF-8"?>

ISBDB

There are two charset configs in common/isfdb.py:

There is an HTML session header (not observable in the HTML header):

print 'Content-type: text/html; charset=%s\n' % UNICODE

Additionally a <meta> tag is issued in the HTML flow:

print '<meta http-equiv="content-type" content="text/html; charset=%s" >' % UNICODE

The variable UNICODE is set to "ISO-8859-1" on isfdb1. Setting it to "UTF-8" fixes the output problem on isfdb2, but not the editing problem.

Python2.7

If we write a short test program as follows:

	db = MySQLdb.connect(DBASEHOST, USERNAME, PASSWORD, conv=IsfdbConvSetup())
	db.select_db(DBASE)

	authorID = int(sys.argv[1])
	authorName = SQLgetAuthorName(authorID)
	print authorName

Then we get a result, that at first glance, looks like the correct output. So for instance, if I use an authorID of 26, this prints out Philip José Farmer. However, if I add the following code:

	print len(authorName)
	print type(authorName)

It outputs 19 for the length, while the actual length of his name is 18, while outputs a type of str:

	19
	<type 'str'>

If we convert the string to UTF-8, using: unicodeValue = value.decode("utf-8", "strict"), then the same output would be:

	Philip José Farmer
	18
	<type 'unicode'>

If we convert the string to LATIN-1, using: unicodeValue = value.decode("latin-1", "strict"), then the output would be:

	Philip José Farmer
	19
	<type 'unicode'>


MySQLdb

The Connection() function takes an optional arguments named use_unicode, and charset (these only work on MySQL-4.1 and newer).

conn = mysql.connect(host='127.0.0.1',
                     user='user',
                     passwd='passwd',
                     db='db',
                     charset='utf8',
                     use_unicode=True)

The current implementation on ISFDB1 does not set use_unicode nor charset. The following test program:

	db = MySQLdb.connect(DBASEHOST, USERNAME, PASSWORD, conv=IsfdbConvSetup())
	db.select_db(DBASE)

	authorID = int(sys.argv[1])
	authorName = SQLgetAuthorName(authorID)
	print authorName
	print len(authorName)
	print type(authorName)

generates:

Philip José Farmer
19
<type 'str'>

While altering the MySQLdb.connect line to:

	db = MySQLdb.connect(DBASEHOST, USERNAME, PASSWORD, conv=IsfdbConvSetup(), use_unicode=True, charset="latin1")

generates:

Philip José Farmer
18
<type 'unicode'>

So data comes into MySQLdb as latin1, and everything coming into the isfdb scripts from MySQLdb is currently of type str.

MySQL

The current ISFDB character set of the MySQL database is latin1 (ISO-8859-1):

mysql> select default_character_set_name, default_collation_name from information_schema.schemata where schema_name='isfdb';
+----------------------------+------------------------+
| DEFAULT_CHARACTER_SET_NAME | DEFAULT_COLLATION_NAME |
+----------------------------+------------------------+
| latin1                     | latin1_swedish_ci      |
+----------------------------+------------------------+

That said, there are other MySQL charset variables to look at. On ISFDB1, we have:

mysql> show variables like '%character_set%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | latin1                     | 
| character_set_connection | latin1                     | 
| character_set_database   | latin1                     | 
| character_set_filesystem | binary                     | 
| character_set_results    | latin1                     | 
| character_set_server     | latin1                     | 
| character_set_system     | utf8                       | 
| character_sets_dir       | /usr/share/mysql/charsets/ | 
+--------------------------+----------------------------+

While on ISFDB2, MySQL defaulted these variables to:

mysql> show variables like '%character_set%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8mb4                    |
| character_set_connection | utf8mb4                    |
| character_set_database   | latin1                     |
| character_set_filesystem | binary                     |
| character_set_results    | utf8mb4                    |
| character_set_server     | utf8mb4                    |
| character_set_system     | utf8mb3                    |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+

These variables can be set using the mysql app by issuing the following commands:

  • set character_set_results = 'latin1';
  • set character_set_server = 'latin1';
  • set character_set_client = 'latin1';
  • set character_set_connection = 'latin1';

character_set_system is a read-only variable and cannot be changed at runtime. Changing the four above variables had no observable effect on the issue.