Difference between revisions of "User:Alvonruff/The Charset Problem"

From ISFDB
Jump to navigation Jump to search
Line 1: Line 1:
XXX
 
  
* Browser - The Browser just follows the html content-type indicator, as well as the <meta> tag. This definitely affects the appearance of the text, as this was one of the first hacks attempted at isfdb2.
 
* Apache - Apache now has a configurable charset. This defaults to utf-8, based on this entry in the config file: ''AddDefaultCharset UTF-8''
 
* ISFDB Scripts - Whatever is stored in the UNICODE variable in localdefs.py, which is currently ISO-8859-1 (latin1)
 
* Python2.7 - Defaults to UTF-8
 
* MySQLdb - ??
 
* MySQL - Set to latin1
 
  
 
==Apache==
 
==Apache==

Revision as of 09:22, 29 April 2022


Apache

Newer versions of Apache have the following default configuration parameter:

AddDefaultCharset UTF-8


ISBDB

There are two charset configs in common/isfdb.py:

There is an HTML session header (not observable in the HTML header):

print 'Content-type: text/html; charset=%s\n' % UNICODE

Additionally a <meta> tag is issued in the HTML flow:

print '<meta http-equiv="content-type" content="text/html; charset=%s" >' % UNICODE

The variable UNICODE is set to "ISO-8859-1" on isfdb1. Setting it to "UTF-8" fixes the output problem on isfdb2, but not the editing problem.


Python2.7

If we write a short test program as follows:

	db = MySQLdb.connect(DBASEHOST, USERNAME, PASSWORD, conv=IsfdbConvSetup())
	db.select_db(DBASE)

	authorID = int(sys.argv[1])
	authorName = SQLgetAuthorName(authorID)
	print authorName

Then we get a result, that at first glance, looks like the correct output. So for instance, if I use an authorID of 26, this prints out Philip José Farmer. However, if I add the following code:

	print len(authorName)
	print type(authorName)

It outputs 19 for the length, while the actual length of his name is 18, while outputs a type of str:

	19
	<type 'str'>

If we convert the string to UTF-8, using: unicodeValue = value.decode("utf-8", "strict"), then the same output would be:

	Philip José Farmer
	18
	<type 'unicode'>

If we convert the string to LATIN-1, using: unicodeValue = value.decode("latin-1", "strict"), then the output would be:

	Philip José Farmer
	19
	<type 'unicode'>


MySQLdb

The Connection() function takes an optional arguments named use_unicode, and charset (these only work on MySQL-4.1 and newer).

conn = mysql.connect(host='127.0.0.1',
                     user='user',
                     passwd='passwd',
                     db='db',
                     charset='utf8',
                     use_unicode=True)

These arguments are not used on ISFDB1, so we will also bypass these for now on ISFDB2.


MySQL

The current ISFDB character set of the MySQL database is latin1 (ISO-8859-1):

mysql> select default_character_set_name, default_collation_name from information_schema.schemata where schema_name='isfdb';
+----------------------------+------------------------+
| DEFAULT_CHARACTER_SET_NAME | DEFAULT_COLLATION_NAME |
+----------------------------+------------------------+
| latin1                     | latin1_swedish_ci      |
+----------------------------+------------------------+

That said, there are other MySQL charset variables to look at. On ISFDB1, we have:

mysql> show variables like '%character_set%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | latin1                     | 
| character_set_connection | latin1                     | 
| character_set_database   | latin1                     | 
| character_set_filesystem | binary                     | 
| character_set_results    | latin1                     | 
| character_set_server     | latin1                     | 
| character_set_system     | utf8                       | 
| character_sets_dir       | /usr/share/mysql/charsets/ | 
+--------------------------+----------------------------+

While on ISFDB2, MySQL defaulted these variables to:

mysql> show variables like '%character_set%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8mb4                    |
| character_set_connection | utf8mb4                    |
| character_set_database   | latin1                     |
| character_set_filesystem | binary                     |
| character_set_results    | utf8mb4                    |
| character_set_server     | utf8mb4                    |
| character_set_system     | utf8mb3                    |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+

These variables can be set using the mysql app by issuing the following commands:

  • set character_set_results = 'latin1';
  • set character_set_server = 'latin1';
  • set character_set_client = 'latin1';
  • set character_set_connection = 'latin1';

character_set_system is a read-only variable and cannot be changed at runtime. Changing the four above variables had no observable effect on the issue.