Difference between revisions of "User:Alvonruff/The Charset Problem"
Line 1: | Line 1: | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
==Apache== | ==Apache== |
Revision as of 09:22, 29 April 2022
Apache
Newer versions of Apache have the following default configuration parameter:
AddDefaultCharset UTF-8
ISBDB
There are two charset configs in common/isfdb.py:
There is an HTML session header (not observable in the HTML header):
print 'Content-type: text/html; charset=%s\n' % UNICODE
Additionally a <meta> tag is issued in the HTML flow:
print '<meta http-equiv="content-type" content="text/html; charset=%s" >' % UNICODE
The variable UNICODE is set to "ISO-8859-1" on isfdb1. Setting it to "UTF-8" fixes the output problem on isfdb2, but not the editing problem.
Python2.7
If we write a short test program as follows:
db = MySQLdb.connect(DBASEHOST, USERNAME, PASSWORD, conv=IsfdbConvSetup()) db.select_db(DBASE) authorID = int(sys.argv[1]) authorName = SQLgetAuthorName(authorID) print authorName
Then we get a result, that at first glance, looks like the correct output. So for instance, if I use an authorID of 26, this prints out Philip José Farmer. However, if I add the following code:
print len(authorName) print type(authorName)
It outputs 19 for the length, while the actual length of his name is 18, while outputs a type of str:
19 <type 'str'>
If we convert the string to UTF-8, using: unicodeValue = value.decode("utf-8", "strict"), then the same output would be:
Philip José Farmer 18 <type 'unicode'>
If we convert the string to LATIN-1, using: unicodeValue = value.decode("latin-1", "strict"), then the output would be:
Philip José Farmer 19 <type 'unicode'>
MySQLdb
The Connection() function takes an optional arguments named use_unicode, and charset (these only work on MySQL-4.1 and newer).
conn = mysql.connect(host='127.0.0.1', user='user', passwd='passwd', db='db', charset='utf8', use_unicode=True)
These arguments are not used on ISFDB1, so we will also bypass these for now on ISFDB2.
MySQL
The current ISFDB character set of the MySQL database is latin1 (ISO-8859-1):
mysql> select default_character_set_name, default_collation_name from information_schema.schemata where schema_name='isfdb'; +----------------------------+------------------------+ | DEFAULT_CHARACTER_SET_NAME | DEFAULT_COLLATION_NAME | +----------------------------+------------------------+ | latin1 | latin1_swedish_ci | +----------------------------+------------------------+
That said, there are other MySQL charset variables to look at. On ISFDB1, we have:
mysql> show variables like '%character_set%'; +--------------------------+----------------------------+ | Variable_name | Value | +--------------------------+----------------------------+ | character_set_client | latin1 | | character_set_connection | latin1 | | character_set_database | latin1 | | character_set_filesystem | binary | | character_set_results | latin1 | | character_set_server | latin1 | | character_set_system | utf8 | | character_sets_dir | /usr/share/mysql/charsets/ | +--------------------------+----------------------------+
While on ISFDB2, MySQL defaulted these variables to:
mysql> show variables like '%character_set%'; +--------------------------+----------------------------+ | Variable_name | Value | +--------------------------+----------------------------+ | character_set_client | utf8mb4 | | character_set_connection | utf8mb4 | | character_set_database | latin1 | | character_set_filesystem | binary | | character_set_results | utf8mb4 | | character_set_server | utf8mb4 | | character_set_system | utf8mb3 | | character_sets_dir | /usr/share/mysql/charsets/ | +--------------------------+----------------------------+
These variables can be set using the mysql app by issuing the following commands:
- set character_set_results = 'latin1';
- set character_set_server = 'latin1';
- set character_set_client = 'latin1';
- set character_set_connection = 'latin1';
character_set_system is a read-only variable and cannot be changed at runtime. Changing the four above variables had no observable effect on the issue.