Difference between revisions of "User:Alvonruff/The Charset Problem"
Line 7: | Line 7: | ||
* MySQLdb - ?? | * MySQLdb - ?? | ||
* MySQL - Set to latin1 | * MySQL - Set to latin1 | ||
+ | |||
+ | ==Apache== | ||
+ | |||
+ | Newer versions of Apache have the following default configuration parameter: | ||
+ | |||
+ | <pre> | ||
+ | AddDefaultCharset UTF-8 | ||
+ | </pre> | ||
+ | |||
+ | |||
+ | ==ISBDB== | ||
+ | |||
+ | There are two charset configs in ''common/isfdb.py'': | ||
+ | |||
+ | There is an HTML session header (not observable in the HTML header): | ||
+ | <pre> | ||
+ | print 'Content-type: text/html; charset=%s\n' % UNICODE | ||
+ | </pre> | ||
+ | |||
+ | Additionally a <meta> tag is issued in the HTML flow: | ||
+ | <pre> | ||
+ | print '<meta http-equiv="content-type" content="text/html; charset=%s" >' % UNICODE | ||
+ | </pre> | ||
+ | |||
+ | The variable UNICODE is set to "ISO-8859-1" on isfdb1. Setting it to "UTF-8" fixes the output problem on isfdb2, but not the editing problem. | ||
==Python2.7== | ==Python2.7== |
Revision as of 09:21, 29 April 2022
XXX
- Browser - The Browser just follows the html content-type indicator, as well as the <meta> tag. This definitely affects the appearance of the text, as this was one of the first hacks attempted at isfdb2.
- Apache - Apache now has a configurable charset. This defaults to utf-8, based on this entry in the config file: AddDefaultCharset UTF-8
- ISFDB Scripts - Whatever is stored in the UNICODE variable in localdefs.py, which is currently ISO-8859-1 (latin1)
- Python2.7 - Defaults to UTF-8
- MySQLdb - ??
- MySQL - Set to latin1
Apache
Newer versions of Apache have the following default configuration parameter:
AddDefaultCharset UTF-8
ISBDB
There are two charset configs in common/isfdb.py:
There is an HTML session header (not observable in the HTML header):
print 'Content-type: text/html; charset=%s\n' % UNICODE
Additionally a <meta> tag is issued in the HTML flow:
print '<meta http-equiv="content-type" content="text/html; charset=%s" >' % UNICODE
The variable UNICODE is set to "ISO-8859-1" on isfdb1. Setting it to "UTF-8" fixes the output problem on isfdb2, but not the editing problem.
Python2.7
If we write a short test program as follows:
db = MySQLdb.connect(DBASEHOST, USERNAME, PASSWORD, conv=IsfdbConvSetup()) db.select_db(DBASE) authorID = int(sys.argv[1]) authorName = SQLgetAuthorName(authorID) print authorName
Then we get a result, that at first glance, looks like the correct output. So for instance, if I use an authorID of 26, this prints out Philip José Farmer. However, if I add the following code:
print len(authorName) print type(authorName)
It outputs 19 for the length, while the actual length of his name is 18, while outputs a type of str:
19 <type 'str'>
If we convert the string to UTF-8, using: unicodeValue = value.decode("utf-8", "strict"), then the same output would be:
Philip José Farmer 18 <type 'unicode'>
If we convert the string to LATIN-1, using: unicodeValue = value.decode("latin-1", "strict"), then the output would be:
Philip José Farmer 19 <type 'unicode'>
MySQLdb
The Connection() function takes an optional arguments named use_unicode, and charset (these only work on MySQL-4.1 and newer).
conn = mysql.connect(host='127.0.0.1', user='user', passwd='passwd', db='db', charset='utf8', use_unicode=True)
These arguments are not used on ISFDB1, so we will also bypass these for now on ISFDB2.
MySQL
The current ISFDB character set of the MySQL database is latin1 (ISO-8859-1):
mysql> select default_character_set_name, default_collation_name from information_schema.schemata where schema_name='isfdb'; +----------------------------+------------------------+ | DEFAULT_CHARACTER_SET_NAME | DEFAULT_COLLATION_NAME | +----------------------------+------------------------+ | latin1 | latin1_swedish_ci | +----------------------------+------------------------+
That said, there are other MySQL charset variables to look at. On ISFDB1, we have:
mysql> show variables like '%character_set%'; +--------------------------+----------------------------+ | Variable_name | Value | +--------------------------+----------------------------+ | character_set_client | latin1 | | character_set_connection | latin1 | | character_set_database | latin1 | | character_set_filesystem | binary | | character_set_results | latin1 | | character_set_server | latin1 | | character_set_system | utf8 | | character_sets_dir | /usr/share/mysql/charsets/ | +--------------------------+----------------------------+
While on ISFDB2, MySQL defaulted these variables to:
mysql> show variables like '%character_set%'; +--------------------------+----------------------------+ | Variable_name | Value | +--------------------------+----------------------------+ | character_set_client | utf8mb4 | | character_set_connection | utf8mb4 | | character_set_database | latin1 | | character_set_filesystem | binary | | character_set_results | utf8mb4 | | character_set_server | utf8mb4 | | character_set_system | utf8mb3 | | character_sets_dir | /usr/share/mysql/charsets/ | +--------------------------+----------------------------+
These variables can be set using the mysql app by issuing the following commands:
- set character_set_results = 'latin1';
- set character_set_server = 'latin1';
- set character_set_client = 'latin1';
- set character_set_connection = 'latin1';
character_set_system is a read-only variable and cannot be changed at runtime. Changing the four above variables had no observable effect on the issue.