Reference Article
http://logzgh.itpub.net/post/3185/467401
When using Oracle in the past, collation methods were rarely a concern. However, in MySQL, ignoring collation can lead to problems.
Problem Description
A test table uses the latin1 character set.
SELECT to_id FROM test WHERE to_id='cn象_王';
Result:
+---------------+
| to_id |
+---------------+
| cn陶_陶 |
| cn象_王 |
+---------------+
2 rows in set (0.00 sec)
Querying for cn象_王 also retrieves cn陶_陶, which is clearly incorrect.
Checking Encoding
SELECT HEX('cn陶_陶');
Result:
+----------------+
| hex('cn陶_陶') |
+----------------+
| 636ECCD55FCCD5 |
+----------------+
1 row in set (0.00 sec)
SELECT HEX('cn象_王');
Result:
+----------------+
| hex('cn象_王') |
+----------------+
| 636ECFF35FCDF5 |
+----------------+
1 row in set (0.00 sec)
The encodings are indeed different, so why does MySQL consider these two records identical?
Identifying the Issue
The issue was traced to collation settings.
SHOW VARIABLES LIKE 'collation%';
Result:
| collation_connection | latin1_swedish_ci |
| collation_database | latin1_swedish_ci |
| collation_server | latin1_swedish_ci |
Manually changing these parameters to latin1_bin still produced the same result, which was puzzling.
MySQL Collation Naming Rules
MySQL collation names follow these rules:
- They start with the associated character set name, often include a language name, and end with
_ci(case-insensitive),_cs(case-sensitive), or_bin(binary).
For example, the latin1 character set has the following collations:
| Collation | Meaning |
|---|---|
| latin1_german1_ci | German DIN-1 |
| latin1_swedish_ci | Swedish/Finnish |
| latin1_danish_ci | Danish/Norwegian |
| latin1_german2_ci | German DIN-2 |
| latin1_bin | Binary for latin1 |
| latin1_general_ci | Multilingual (Western Europe) |
| latin1_general_cs | Multilingual (Western Europe ISO), case-sensitive |
| latin1_spanish_ci | Modern Spanish |
Solution
Rebuilding the table and explicitly setting the table-level collation to latin1_bin resolved the issue.
ALTER TABLE db_allot CONVERT TO CHARACTER SET latin1 COLLATE latin1_bin;
Recommendation
It is recommended to use the _bin type collation corresponding to the character set to avoid errors.
Personal Insights
The character set latin1 collate latin1_bin is an improvement over the old VARCHAR BINARY. The new version first defines the character set using character set and then specifies the collation as binary using _bin, ensuring correct Chinese queries.
Testing
Changing a field attribute to remove BINARY:
ALTER TABLE `comment_content_1_01` CHANGE `thread` `thread` VARCHAR(50) DEFAULT NULL;
Table structure becomes:
thread varchar(50) default NULL
This shows that character set latin1 collate latin1_bin is an improvement over the old VARCHAR BINARY.
Additionally, table-level collation can be set to latin1_bin without modifying individual field attributes.
Example
ALTER TABLE comment_content_1_01 CONVERT TO CHARACTER SET latin1 COLLATE latin1_bin;
Exported table structure:
CREATE TABLE comment_content_1_01 (
content_id int(11) NOT NULL auto_increment,
thread varchar(50) collate latin1_bin default NULL,
uname varchar(100) collate latin1_bin default NULL,
nick varchar(100) collate latin1_bin default NULL,
uid int(11) unsigned default NULL,
content text collate latin1_bin,
post_time datetime default NULL,
post_ip int(10) unsigned default NULL,
`status` enum('unaudit','normal','deleted') collate latin1_bin NOT NULL default 'unaudit',
PRIMARY KEY (content_id)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 COLLATE=latin1_bin;
Even if individual fields were not previously defined with collate, they are now all collate latin1_bin.
Original Link: https://www.snowpeak.fun/cn/article/detail/collation_in_mysql/