Difference between revisions of "User:Alvonruff/The Charset Problem"

From ISFDB
Jump to navigation Jump to search
Line 32: Line 32:
  
 
The variable UNICODE is set to "ISO-8859-1" on isfdb1. Setting it to "UTF-8" fixes the output problem on isfdb2, but not the editing problem.
 
The variable UNICODE is set to "ISO-8859-1" on isfdb1. Setting it to "UTF-8" fixes the output problem on isfdb2, but not the editing problem.
 +
  
 
==Python2.7==
 
==Python2.7==
Line 75: Line 76:
 
<type 'unicode'>
 
<type 'unicode'>
 
</pre>
 
</pre>
 +
  
 
==MySQLdb==
 
==MySQLdb==
Line 89: Line 91:
  
 
These arguments are not used on ISFDB1, so we will also bypass these for now on ISFDB2.
 
These arguments are not used on ISFDB1, so we will also bypass these for now on ISFDB2.
 +
  
 
==MySQL==
 
==MySQL==

Revision as of 09:21, 29 April 2022

XXX

  • Browser - The Browser just follows the html content-type indicator, as well as the <meta> tag. This definitely affects the appearance of the text, as this was one of the first hacks attempted at isfdb2.
  • Apache - Apache now has a configurable charset. This defaults to utf-8, based on this entry in the config file: AddDefaultCharset UTF-8
  • ISFDB Scripts - Whatever is stored in the UNICODE variable in localdefs.py, which is currently ISO-8859-1 (latin1)
  • Python2.7 - Defaults to UTF-8
  • MySQLdb - ??
  • MySQL - Set to latin1

Apache

Newer versions of Apache have the following default configuration parameter:

AddDefaultCharset UTF-8


ISBDB

There are two charset configs in common/isfdb.py:

There is an HTML session header (not observable in the HTML header):

print 'Content-type: text/html; charset=%s\n' % UNICODE

Additionally a <meta> tag is issued in the HTML flow:

print '<meta http-equiv="content-type" content="text/html; charset=%s" >' % UNICODE

The variable UNICODE is set to "ISO-8859-1" on isfdb1. Setting it to "UTF-8" fixes the output problem on isfdb2, but not the editing problem.


Python2.7

If we write a short test program as follows:

	db = MySQLdb.connect(DBASEHOST, USERNAME, PASSWORD, conv=IsfdbConvSetup())
	db.select_db(DBASE)

	authorID = int(sys.argv[1])
	authorName = SQLgetAuthorName(authorID)
	print authorName

Then we get a result, that at first glance, looks like the correct output. So for instance, if I use an authorID of 26, this prints out Philip José Farmer. However, if I add the following code:

	print len(authorName)
	print type(authorName)

It outputs 19 for the length, while the actual length of his name is 18, while outputs a type of str:

	19
	<type 'str'>

If we convert the string to UTF-8, using: unicodeValue = value.decode("utf-8", "strict"), then the same output would be:

	Philip José Farmer
	18
	<type 'unicode'>

If we convert the string to LATIN-1, using: unicodeValue = value.decode("latin-1", "strict"), then the output would be:

	Philip José Farmer
	19
	<type 'unicode'>


MySQLdb

The Connection() function takes an optional arguments named use_unicode, and charset (these only work on MySQL-4.1 and newer).

conn = mysql.connect(host='127.0.0.1',
                     user='user',
                     passwd='passwd',
                     db='db',
                     charset='utf8',
                     use_unicode=True)

These arguments are not used on ISFDB1, so we will also bypass these for now on ISFDB2.


MySQL

The current ISFDB character set of the MySQL database is latin1 (ISO-8859-1):

mysql> select default_character_set_name, default_collation_name from information_schema.schemata where schema_name='isfdb';
+----------------------------+------------------------+
| DEFAULT_CHARACTER_SET_NAME | DEFAULT_COLLATION_NAME |
+----------------------------+------------------------+
| latin1                     | latin1_swedish_ci      |
+----------------------------+------------------------+

That said, there are other MySQL charset variables to look at. On ISFDB1, we have:

mysql> show variables like '%character_set%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | latin1                     | 
| character_set_connection | latin1                     | 
| character_set_database   | latin1                     | 
| character_set_filesystem | binary                     | 
| character_set_results    | latin1                     | 
| character_set_server     | latin1                     | 
| character_set_system     | utf8                       | 
| character_sets_dir       | /usr/share/mysql/charsets/ | 
+--------------------------+----------------------------+

While on ISFDB2, MySQL defaulted these variables to:

mysql> show variables like '%character_set%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8mb4                    |
| character_set_connection | utf8mb4                    |
| character_set_database   | latin1                     |
| character_set_filesystem | binary                     |
| character_set_results    | utf8mb4                    |
| character_set_server     | utf8mb4                    |
| character_set_system     | utf8mb3                    |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+

These variables can be set using the mysql app by issuing the following commands:

  • set character_set_results = 'latin1';
  • set character_set_server = 'latin1';
  • set character_set_client = 'latin1';
  • set character_set_connection = 'latin1';

character_set_system is a read-only variable and cannot be changed at runtime. Changing the four above variables had no observable effect on the issue.