User:Alvonruff/The Charset Problem

From ISFDB
Jump to navigation Jump to search

XXX

  • Browser - The Browser just follows the html content-type indicator, as well as the <meta> tag. This definitely affects the appearance of the text, as this was one of the first hacks attempted at isfdb2.
  • Apache - Apache now has a configurable charset. This defaults to utf-8, based on this entry in the config file: AddDefaultCharset UTF-8
  • ISFDB Scripts - Whatever is stored in the UNICODE variable in localdefs.py, which is currently ISO-8859-1 (latin1)
  • Python2.7 - Defaults to UTF-8
  • MySQLdb - ??
  • MySQL - Set to latin1

Python2.7

If we write a short test program as follows:

	db = MySQLdb.connect(DBASEHOST, USERNAME, PASSWORD, conv=IsfdbConvSetup())
	db.select_db(DBASE)

	authorID = int(sys.argv[1])
	authorName = SQLgetAuthorName(authorID)
	print authorName

Then we get a result, that at first glance, looks like the correct output. So for instance, if I use an authorID of 26, this prints out Philip José Farmer. However, if I add the following code:

	print len(authorName)
	print type(authorName)

It outputs 19 for the length, while the actual length of his name is 18, while outputs a type of str:

	19
	<type 'str'>

If we convert the string to UTF-8, using: unicodeValue = value.decode("utf-8", "strict"), then the same output would be:

	Philip José Farmer
	18
	<type 'unicode'>

If we convert the string to LATIN-1, using: unicodeValue = value.decode("latin-1", "strict"), then the output would be:

	Philip José Farmer
	19
	<type 'unicode'>

MySQLdb

The Connection() function takes an optional arguments named use_unicode, and charset (these only work on MySQL-4.1 and newer).

conn = mysql.connect(host='127.0.0.1',
                     user='user',
                     passwd='passwd',
                     db='db',
                     charset='utf8',
                     use_unicode=True)

These arguments are not used on ISFDB1, so we will also bypass these for now on ISFDB2.

MySQL

The current ISFDB character set of the MySQL database is latin1 (ISO-8859-1):

mysql> select default_character_set_name, default_collation_name from information_schema.schemata where schema_name='isfdb';
+----------------------------+------------------------+
| DEFAULT_CHARACTER_SET_NAME | DEFAULT_COLLATION_NAME |
+----------------------------+------------------------+
| latin1                     | latin1_swedish_ci      |
+----------------------------+------------------------+

That said, there are other MySQL charset variables to look at. On ISFDB1, we have:

mysql> show variables like '%character_set%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | latin1                     | 
| character_set_connection | latin1                     | 
| character_set_database   | latin1                     | 
| character_set_filesystem | binary                     | 
| character_set_results    | latin1                     | 
| character_set_server     | latin1                     | 
| character_set_system     | utf8                       | 
| character_sets_dir       | /usr/share/mysql/charsets/ | 
+--------------------------+----------------------------+

While on ISFDB2, MySQL defaulted these variables to:

mysql> show variables like '%character_set%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8mb4                    |
| character_set_connection | utf8mb4                    |
| character_set_database   | latin1                     |
| character_set_filesystem | binary                     |
| character_set_results    | utf8mb4                    |
| character_set_server     | utf8mb4                    |
| character_set_system     | utf8mb3                    |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+

These variables can be set using the mysql app by issuing the following commands:

  • set character_set_results = 'latin1';
  • set character_set_server = 'latin1';
  • set character_set_client = 'latin1';
  • set character_set_connection = 'latin1';

character_set_system is a read-only variable and cannot be changed at runtime. Changing the four above variables had no observable effect on the issue.