Table of Contents
This chapter covers issues of internationalization (MySQL's capabilities for adapting to local use) and localization (selecting particular local conventions):
MySQL support for character sets in SQL statements.
How to configure the server to support different character sets.
Selecting the language for error messages.
How to set the server's time zone and enable per-connection time zone support.
Selecting the locale for day and month names.
Improved support for character set handling was added to MySQL in
version 4.1. This support enables you to store data using a
variety of character sets and perform comparisons according to a
variety of collations. You can specify character sets at the
server, database, table, and column level. MySQL supports the use
of character sets for the MyISAM,
MEMORY, and (as of MySQL 4.1.2)
InnoDB storage engines. The
ISAM storage engine does not include character
set support; there are no plans to change this, because
ISAM is deprecated.
The NDBCluster storage engine in MySQL 4.1
(available beginning with MySQL 4.1.3-Max) provides limited
character set and collation support; see
Section 14.10, “Known Limitations of MySQL Cluster”.
This chapter discusses the following topics:
What are character sets and collations?
The multiple-level default system for character set assignment
Syntax for specifying character sets and collations
Affected functions and operations
Unicode support
The character sets and collations that are available, with notes
Character set issues affect not only data storage, but also
communication between client programs and the MySQL server. If you
want the client program to communicate with the server using a
character set different from the default, you'll need to indicate
which one. For example, to use the utf8 Unicode
character set, issue this statement after connecting to the
server:
SET NAMES 'utf8';
For more information about character set-related issues in client/server communication, see Section 9.1.4, “Connection Character Sets and Collations”.
A character set is a set of symbols and encodings. A collation is a set of rules for comparing characters in a character set. Let's make the distinction clear with an example of an imaginary character set.
Suppose that we have an alphabet with four letters:
“A”,
“B”,
“a”,
“b”. We give each letter a
number: “A” = 0,
“B” = 1,
“a” = 2,
“b” = 3. The letter
“A” is a symbol, the number 0 is
the encoding for
“A”, and the combination of all
four letters and their encodings is a
character set.
Suppose that we want to compare two string values,
“A” and
“B”. The simplest way to do this
is to look at the encodings: 0 for
“A” and 1 for
“B”. Because 0 is less than 1,
we say “A” is less than
“B”. What we've just done is
apply a collation to our character set. The collation is a set
of rules (only one rule in this case): “compare the
encodings.” We call this simplest of all possible
collations a binary collation.
But what if we want to say that the lowercase and uppercase
letters are equivalent? Then we would have at least two rules:
(1) treat the lowercase letters
“a” and
“b” as equivalent to
“A” and
“B”; (2) then compare the
encodings. We call this a
case-insensitive collation. It's a little
more complex than a binary collation.
In real life, most character sets have many characters: not just
“A” and
“B” but whole alphabets,
sometimes multiple alphabets or eastern writing systems with
thousands of characters, along with many special symbols and
punctuation marks. Also in real life, most collations have many
rules, not just for whether to distinguish lettercase, but also
for whether to distinguish accents (an “accent” is
a mark attached to a character as in German
“Ö”), and for
multiple-character mappings (such as the rule that
“Ö” =
“OE” in one of the two German
collations).
MySQL 4.1 can do these things for you:
Store strings using a variety of character sets
Compare strings using a variety of collations
Mix strings with different character sets or collations in the same server, the same database, or even the same table
Allow specification of character set and collation at any level
In these respects, not only is MySQL 4.1 far more flexible than MySQL 4.0, it also is far ahead of most other database management systems. However, to use these features effectively, you need to know what character sets and collations are available, how to change the defaults, and how they affect the behavior of string operators and functions.
The MySQL server can support multiple character sets. To list
the available character sets, use the SHOW CHARACTER
SET statement. A partial listing follows. For more
complete information, see Section 9.1.10, “Character Sets and Collations That MySQL Supports”.
mysql> SHOW CHARACTER SET;
+----------+-----------------------------+---------------------+--------+
| Charset | Description | Default collation | Maxlen |
+----------+-----------------------------+---------------------+--------+
| big5 | Big5 Traditional Chinese | big5_chinese_ci | 2 |
| dec8 | DEC West European | dec8_swedish_ci | 1 |
| cp850 | DOS West European | cp850_general_ci | 1 |
| hp8 | HP West European | hp8_english_ci | 1 |
| koi8r | KOI8-R Relcom Russian | koi8r_general_ci | 1 |
| latin1 | cp1252 West European | latin1_swedish_ci | 1 |
| latin2 | ISO 8859-2 Central European | latin2_general_ci | 1 |
| swe7 | 7bit Swedish | swe7_swedish_ci | 1 |
| ascii | US ASCII | ascii_general_ci | 1 |
| ujis | EUC-JP Japanese | ujis_japanese_ci | 3 |
| sjis | Shift-JIS Japanese | sjis_japanese_ci | 2 |
| hebrew | ISO 8859-8 Hebrew | hebrew_general_ci | 1 |
| tis620 | TIS620 Thai | tis620_thai_ci | 1 |
| euckr | EUC-KR Korean | euckr_korean_ci | 2 |
| koi8u | KOI8-U Ukrainian | koi8u_general_ci | 1 |
| gb2312 | GB2312 Simplified Chinese | gb2312_chinese_ci | 2 |
| greek | ISO 8859-7 Greek | greek_general_ci | 1 |
| cp1250 | Windows Central European | cp1250_general_ci | 1 |
| gbk | GBK Simplified Chinese | gbk_chinese_ci | 2 |
| latin5 | ISO 8859-9 Turkish | latin5_turkish_ci | 1 |
...
Any given character set always has at least one collation. It
may have several collations. To list the collations for a
character set, use the SHOW COLLATION
statement. For example, to see the collations for the
latin1 (cp1252 West European) character set,
use this statement to find those collation names that begin with
latin1:
mysql> SHOW COLLATION LIKE 'latin1%';
+---------------------+---------+----+---------+----------+---------+
| Collation | Charset | Id | Default | Compiled | Sortlen |
+---------------------+---------+----+---------+----------+---------+
| latin1_german1_ci | latin1 | 5 | | | 0 |
| latin1_swedish_ci | latin1 | 8 | Yes | Yes | 1 |
| latin1_danish_ci | latin1 | 15 | | | 0 |
| latin1_german2_ci | latin1 | 31 | | Yes | 2 |
| latin1_bin | latin1 | 47 | | Yes | 1 |
| latin1_general_ci | latin1 | 48 | | | 0 |
| latin1_general_cs | latin1 | 49 | | | 0 |
| latin1_spanish_ci | latin1 | 94 | | | 0 |
+---------------------+---------+----+---------+----------+---------+
The latin1 collations have the following
meanings:
| Collation | Meaning |
latin1_german1_ci | German DIN-1 |
latin1_swedish_ci | Swedish/Finnish |
latin1_danish_ci | Danish/Norwegian |
latin1_german2_ci | German DIN-2 |
latin1_bin | Binary according to latin1 encoding |
latin1_general_ci | Multilingual (Western European) |
latin1_general_cs | Multilingual (ISO Western European), case sensitive |
latin1_spanish_ci | Modern Spanish |
Collations have these general characteristics:
Two different character sets cannot have the same collation.
Each character set has one collation that is the
default collation. For example, the
default collation for latin1 is
latin1_swedish_ci. The output for
SHOW CHARACTER SET indicates which
collation is the default for each displayed character set.
There is a convention for collation names: They start with
the name of the character set with which they are
associated, they usually include a language name, and they
end with _ci (case insensitive),
_cs (case sensitive), or
_bin (binary).
In cases where a character set has multiple collations, it might not be clear which collation is most suitable for a given application. To avoid choosing the wrong collation, it can be helpful to perform some comparisons with representative data values to make sure that a given collation sorts values the way you expect.
There are default settings for character sets and collations at four levels: server, database, table, and column. The description in the following sections may appear complex, but it has been found in practice that multiple-level defaulting leads to natural and obvious results.
CHARACTER SET is used in clauses that specify
a character set. CHARSET may be used as a
synonym for CHARACTER SET.
Character set issues affect not only data storage, but also
communication between client programs and the MySQL server. If
you want the client program to communicate with the server using
a character set different from the default, you'll need to
indicate which one. For example, to use the
utf8 Unicode character set, issue this
statement after connecting to the server:
SET NAMES 'utf8';
For more information about character set-related issues in client/server communication, see Section 9.1.4, “Connection Character Sets and Collations”.
MySQL Server has a server character set and a server collation. These can be set at server startup on the command line or in an option file and changed at runtime.
Initially, the server character set and collation depend on
the options that you use when you start
mysqld. You can use
--character-set-server for the character set.
Along with it, you can add --collation-server
for the collation. If you don't specify a character set, that
is the same as saying
--character-set-server=latin1. If you specify
only a character set (for example, latin1)
but not a collation, that is the same as saying
--character-set-server=latin1
--collation-server=latin1_swedish_ci because
latin1_swedish_ci is the default collation
for latin1. Therefore, the following three
commands all have the same effect:
shell>mysqldshell>mysqld --character-set-server=latin1shell>mysqld --character-set-server=latin1 \--collation-server=latin1_swedish_ci
One way to change the settings is by recompiling. If you want
to change the default server character set and collation when
building from sources, use: --with-charset
and --with-collation as arguments for
configure. For example:
shell> ./configure --with-charset=latin1
Or:
shell>./configure --with-charset=latin1 \--with-collation=latin1_german1_ci
Both mysqld and configure verify that the character set/collation combination is valid. If not, each program displays an error message and terminates.
The server character set and collation are used as default
values if the database character set and collation are not
specified in CREATE DATABASE statements.
They have no other purpose.
The current server character set and collation can be
determined from the values of the
character_set_server and
collation_server system variables. These
variables can be changed at runtime.
Every database has a database character set and a database
collation. The CREATE DATABASE and
ALTER DATABASE statements have optional
clauses for specifying the database character set and
collation:
CREATE DATABASEdb_name[[DEFAULT] CHARACTER SETcharset_name] [[DEFAULT] COLLATEcollation_name] ALTER DATABASEdb_name[[DEFAULT] CHARACTER SETcharset_name] [[DEFAULT] COLLATEcollation_name]
All database options are stored in a text file named
db.opt that can be found in the database
directory.
The CHARACTER SET and
COLLATE clauses make it possible to create
databases with different character sets and collations on the
same MySQL server.
Example:
CREATE DATABASE db_name CHARACTER SET latin1 COLLATE latin1_swedish_ci;
MySQL chooses the database character set and database collation in the following manner:
If both CHARACTER SET
and
XCOLLATE
were specified, then character set
YX and collation
Y.
If CHARACTER SET
was specified
without XCOLLATE, then character set
X and its default collation.
If COLLATE
was specified without YCHARACTER SET,
then the character set associated with
Y and collation
Y.
Otherwise, the server character set and server collation.
The database character set and collation are used as default
values if the table character set and collation are not
specified in CREATE TABLE statements. They
have no other purpose.
The character set and collation for the default database can
be determined from the values of the
character_set_database and
collation_database system variables. The
server sets these variables whenever the default database
changes. If there is no default database, the variables have
the same value as the corresponding server-level system
variables, character_set_server and
collation_server.
Every table has a table character set and a table collation.
The CREATE TABLE and ALTER
TABLE statements have optional clauses for
specifying the table character set and collation:
CREATE TABLEtbl_name(column_list) [[DEFAULT] CHARACTER SETcharset_name] [COLLATEcollation_name]] ALTER TABLEtbl_name[[DEFAULT] CHARACTER SETcharset_name] [COLLATEcollation_name]
Example:
CREATE TABLE t1 ( ... ) CHARACTER SET latin1 COLLATE latin1_danish_ci;
MySQL chooses the table character set and collation in the following manner:
If both CHARACTER SET
and
XCOLLATE
were specified, then character set
YX and collation
Y.
If CHARACTER SET
was specified
without XCOLLATE, then character set
X and its default collation.
If COLLATE
was specified without YCHARACTER SET,
then the character set associated with
Y and collation
Y.
Otherwise, the database character set and collation.
The table character set and collation are used as default values if the column character set and collation are not specified in individual column definitions. The table character set and collation are MySQL extensions; there are no such things in standard SQL.
Every “character” column (that is, a column of
type CHAR, VARCHAR, or
TEXT) has a column character set and a
column collation. Column definition syntax for CREATE
TABLE and ALTER TABLE has
optional clauses for specifying the column character set and
collation:
col_name{CHAR | VARCHAR | TEXT} (col_length) [CHARACTER SETcharset_name] [COLLATEcollation_name]
Examples:
CREATE TABLE Table1
(
column1 VARCHAR(5) CHARACTER SET latin1 COLLATE latin1_german1_ci
);
ALTER TABLE Table1 MODIFY
column1 VARCHAR(5) CHARACTER SET latin1 COLLATE latin1_swedish_ci;
If you convert a column from one character set to another,
ALTER TABLE attempts to map the data
values, but if the character sets are incompatible, there may
be data loss.
MySQL chooses the column character set and collation in the following manner:
If both CHARACTER SET
and
XCOLLATE
were specified, then character set
YX and collation
Y are used.
If CHARACTER SET
was specified
without XCOLLATE, then character set
X and its default collation are
used.
If COLLATE
was specified without YCHARACTER SET,
then the character set associated with
Y and collation
Y.
Otherwise, the table character set and collation are used.
The CHARACTER SET and
COLLATE clauses are standard SQL.
Every character string literal has a character set and a collation.
A character string literal may have an optional character set
introducer and COLLATE clause:
[_charset_name]'string' [COLLATEcollation_name]
Examples:
SELECT 'string'; SELECT _latin1'string'; SELECT _latin1'string' COLLATE latin1_danish_ci;
For the simple statement SELECT
', the string has
the character set and collation defined by the
string'character_set_connection and
collation_connection system variables.
The
_
expression is formally called an
introducer. It tells the parser,
“the string that is about to follow uses character set
charset_nameX.” Because this has
confused people in the past, we emphasize that an introducer
does not change the string to the introducer character set
like CONVERT() would do. It
does not change the string's value, although padding may
occur. The introducer is just a signal. An introducer is also
legal before standard hex literal and numeric hex literal
notation
(x' and
literal'0x).
nnnn
Examples:
SELECT _latin1 x'AABBCC'; SELECT _latin1 0xAABBCC;
MySQL determines a literal's character set and collation in the following manner:
If both _X and COLLATE
were specified,
then character set YX and
collation Y are used.
If _X is specified but
COLLATE is not specified, then
character set X and its default
collation are used.
Otherwise, the character set and collation given by the
character_set_connection and
collation_connection system variables
are used.
Examples:
A string with latin1 character set and
latin1_german1_ci collation:
SELECT _latin1'Müller' COLLATE latin1_german1_ci;
A string with latin1 character set and
its default collation (that is,
latin1_swedish_ci):
SELECT _latin1'Müller';
A string with the connection default character set and collation:
SELECT 'Müller';
Character set introducers and the COLLATE
clause are implemented according to standard SQL
specifications.
An introducer indicates the character set for the following
string, but does not change now how the parser performs escape
processing within the string. Escapes are always interpreted
by the parser according to the character set given by
character_set_connection.
The following examples show that escape processing occurs
using character_set_connection even in the
presence of an introducer. The examples use SET
NAMES (which changes
character_set_connection, as discussed in
Section 9.1.4, “Connection Character Sets and Collations”), and display the
resulting strings using the
HEX() function so that the
exact string contents can be seen.
Example 1:
mysql>SET NAMES latin1;Query OK, 0 rows affected (0.01 sec) mysql>SELECT HEX('à\n'), HEX(_sjis'à\n');+------------+-----------------+ | HEX('à\n') | HEX(_sjis'à\n') | +------------+-----------------+ | E00A | E00A | +------------+-----------------+ 1 row in set (0.00 sec)
Here, “à” (hex value
E0) is followed by
“\n”, the escape sequence for
newline. The escape sequence is interpreted using the
character_set_connection value of
latin1 to produce a literal newline (hex
value 0A). This happens even for the second
string. That is, the introducer of _sjis
does not affect the parser's escape processing.
Example 2:
mysql>SET NAMES sjis;Query OK, 0 rows affected (0.00 sec) mysql>SELECT HEX('à\n'), HEX(_latin1'à\n');+------------+-------------------+ | HEX('à\n') | HEX(_latin1'à\n') | +------------+-------------------+ | E05C6E | E05C6E | +------------+-------------------+ 1 row in set (0.04 sec)
Here, character_set_connection is
sjis, a character set in which the sequence
of “à” followed by
“\” (hex values
05 and 5C) is a valid
multi-byte character. Hence, the first two bytes of the string
are interpreted as a single sjis character,
and the “\” is not intrepreted
as an escape character. The following
“n” (hex value
6E) is not interpreted as part of an escape
sequence. This is true even for the second string; the
introducer of _latin1 does not affect
escape processing.
Before MySQL 4.1, NCHAR and
CHAR were synonymous. Standard SQL defines
NCHAR or NATIONAL CHAR
as a way to indicate that a CHAR column
should use some predefined character set. MySQL 4.1 and up
uses utf8 as that predefined character set.
For example, these data type declarations are equivalent:
CHAR(10) CHARACTER SET utf8 NATIONAL CHARACTER(10) NCHAR(10)
As are these:
VARCHAR(10) CHARACTER SET utf8 NATIONAL VARCHAR(10) NCHAR VARCHAR(10) NATIONAL CHARACTER VARYING(10) NATIONAL CHAR VARYING(10)
You can use
N' (or
literal'n') to
create a string in the national character set. These
statements are equivalent:
literal'
SELECT N'some text'; SELECT n'some text'; SELECT _utf8'some text';
The following examples show how MySQL determines default character set and collation values.
Example 1: Table and Column Definition
CREATE TABLE t1
(
c1 CHAR(10) CHARACTER SET latin1 COLLATE latin1_german1_ci
) DEFAULT CHARACTER SET latin2 COLLATE latin2_bin;
Here we have a column with a latin1
character set and a latin1_german1_ci
collation. The definition is explicit, so that's
straightforward. Notice that there is no problem with storing
a latin1 column in a
latin2 table.
Example 2: Table and Column Definition
CREATE TABLE t1
(
c1 CHAR(10) CHARACTER SET latin1
) DEFAULT CHARACTER SET latin1 COLLATE latin1_danish_ci;
This time we have a column with a latin1
character set and a default collation. Although it might seem
natural, the default collation is not taken from the table
level. Instead, because the default collation for
latin1 is always
latin1_swedish_ci, column
c1 has a collation of
latin1_swedish_ci (not
latin1_danish_ci).
Example 3: Table and Column Definition
CREATE TABLE t1
(
c1 CHAR(10)
) DEFAULT CHARACTER SET latin1 COLLATE latin1_danish_ci;
We have a column with a default character set and a default
collation. In this circumstance, MySQL checks the table level
to determine the column character set and collation.
Consequently, the character set for column
c1 is latin1 and its
collation is latin1_danish_ci.
Example 4: Database, Table, and Column Definition
CREATE DATABASE d1
DEFAULT CHARACTER SET latin2 COLLATE latin2_czech_ci;
USE d1;
CREATE TABLE t1
(
c1 CHAR(10)
);
We create a column without specifying its character set and
collation. We're also not specifying a character set and a
collation at the table level. In this circumstance, MySQL
checks the database level to determine the table settings,
which thereafter become the column settings.) Consequently,
the character set for column c1 is
latin2 and its collation is
latin2_czech_ci.
Several character set and collation system variables relate to a client's interaction with the server. Some of these have been mentioned in earlier sections:
The server character set and collation can be determined
from the values of the
character_set_server and
collation_server system variables.
The character set and collation of the default database can
be determined from the values of the
character_set_database and
collation_database system variables.
Additional character set and collation system variables are involved in handling traffic for the connection between a client and the server. Every client has connection-related character set and collation system variables.
Consider what a “connection” is: It's what you make when you connect to the server. The client sends SQL statements, such as queries, over the connection to the server. The server sends responses, such as result sets, over the connection back to the client. This leads to several questions about character set and collation handling for client connections, each of which can be answered in terms of system variables:
What character set is the statement in when it leaves the client?
The server takes the character_set_client
system variable to be the character set in which statements
are sent by the client.
What character set should the server translate a statement to after receiving it?
For this, the server uses the
character_set_connection and
collation_connection system variables. It
converts statements sent by the client from
character_set_client to
character_set_connection (except for
string literals that have an introducer such as
_latin1 or _utf8).
collation_connection is important for
comparisons of literal strings. For comparisons of strings
with column values, collation_connection
does not matter because columns have their own collation,
which has a higher collation precedence.
What character set should the server translate to before shipping result sets or error messages back to the client?
The character_set_results system variable
indicates the character set in which the server returns
query results to the client. This includes result data such
as column values, and result metadata such as column names.
You can fine-tune the settings for these variables, or you can depend on the defaults (in which case, you can skip the rest of this section).
There are two statements that affect the connection character sets:
SET NAMES 'charset_name' SET CHARACTER SETcharset_name
SET NAMES indicates what character set the
client will use to send SQL statements to the server. Thus,
SET NAMES 'cp1251' tells the server
“future incoming messages from this client are in
character set cp1251.” It also
specifies the character set that the server should use for
sending results back to the client. (For example, it indicates
what character set to use for column values if you use a
SELECT statement.)
A SET NAMES '
statement is equivalent to these three statements:
x'
SET character_set_client =x; SET character_set_results =x; SET character_set_connection =x;
Setting character_set_connection to
x also sets
collation_connection to the default collation
for x. It is not necessary to set
that collation explicitly. To specify a particular collation for
the character sets, use the optional COLLATE
clause:
SET NAMES 'charset_name' COLLATE 'collation_name'
SET CHARACTER SET is similar to SET
NAMES but sets
character_set_connection and
collation_connection to
character_set_database and
collation_database. A SET CHARACTER
SET statement is
equivalent to these three statements:
x
SET character_set_client =x; SET character_set_results =x; SET collation_connection = @@collation_database;
Setting collation_connection also sets
character_set_connection to the character set
associated with the collation (equivalent to executing
SET character_set_connection =
@@character_set_database). It is not necessary to set
character_set_connection explicitly.
When a client connects, it sends to the server the name of the
character set that it wants to use. The server uses the name to
set the character_set_client,
character_set_results, and
character_set_connection system variables. In
effect, the server performs a SET NAMES
operation using the character set name.
With the mysql client, it is not necessary to
execute SET NAMES every time you start up if
you want to use a character set different from the default. You
can add the --default-character-set option
setting to your mysql statement line, or in
your option file. For example, the following option file setting
changes the three character set variables set to
koi8r each time you invoke
mysql:
[mysql] default-character-set=koi8r
Example: Suppose that column1 is defined as
CHAR(5) CHARACTER SET latin2. If you do not
say SET NAMES or SET CHARACTER
SET, then for SELECT column1 FROM
t, the server sends back all the values for
column1 using the character set that the
client specified when it connected. On the other hand, if you
say SET NAMES 'latin1' or SET
CHARACTER SET latin1 before issuing the
SELECT statement, the server converts the
latin2 values to latin1
just before sending results back. Conversion may be lossy if
there are characters that are not in both character sets.
If you do not want the server to perform any conversion of
result sets, set character_set_results to
NULL:
SET character_set_results = NULL;
ucs2 cannot be used as a client character
set, which means that it does not work for SET
NAMES or SET CHARACTER SET.
To see the values of the character set and collation system variables that apply to your connection, use these statements:
SHOW VARIABLES LIKE 'character_set%'; SHOW VARIABLES LIKE 'collation%';
You must also consider the environment within which your MySQL application executes. For example, if you will send statements using UTF-8 text taken from a file that you create in an editor, you should edit the file with the locale of your environment set to UTF-8 so that the file's encoding is correct and so that the operating system handles it correctly. For a script that executes in a Web environment, the script must handle the character encoding properly for its interaction with the MySQL server, and it must generate pages that correctly indicate the encoding so that browsers know now to display the content of the pages.
The following sections discuss various aspects of character set collations.
With the COLLATE clause, you can override
whatever the default collation is for a comparison.
COLLATE may be used in various parts of SQL
statements. Here are some examples:
With ORDER BY:
SELECT k FROM t1 ORDER BY k COLLATE latin1_german2_ci;
With AS:
SELECT k COLLATE latin1_german2_ci AS k1 FROM t1 ORDER BY k1;
With GROUP BY:
SELECT k FROM t1 GROUP BY k COLLATE latin1_german2_ci;
With aggregate functions:
SELECT MAX(k COLLATE latin1_german2_ci) FROM t1;
With DISTINCT:
SELECT DISTINCT k COLLATE latin1_german2_ci FROM t1;
With WHERE:
SELECT *
FROM t1
WHERE _latin1 'Müller' COLLATE latin1_german2_ci = k;
SELECT *
FROM t1
WHERE k LIKE _latin1 'Müller' COLLATE latin1_german2_ci;
With HAVING:
SELECT k FROM t1 GROUP BY k HAVING k = _latin1 'Müller' COLLATE latin1_german2_ci;
The COLLATE clause has high precedence
(higher than
||), so the
following two expressions are equivalent:
x || y COLLATE z x || (y COLLATE z)
The BINARY operator casts the string
following it to a binary string. This is an easy way to force
a comparison to be done byte by byte rather than character by
character. BINARY also causes trailing
spaces to be significant.
mysql>SELECT 'a' = 'A';-> 1 mysql>SELECT BINARY 'a' = 'A';-> 0 mysql>SELECT 'a' = 'a ';-> 1 mysql>SELECT BINARY 'a' = 'a ';-> 0
BINARY is
shorthand for
strCAST(.
str AS
BINARY)
The BINARY attribute in character column
definitions has a different effect. A character column defined
with the BINARY attribute is assigned the
binary collation of the column's character set. Every
character set has a binary collation. For example, the binary
collation for the latin1 character set is
latin1_bin, so if the table default
character set is latin1, these two column
definitions are equivalent:
CHAR(10) BINARY CHAR(10) CHARACTER SET latin1 COLLATE latin1_bin
The effect of BINARY as a column attribute
differs from its effect prior to MySQL 4.1. Formerly,
BINARY resulted in a column that was
treated as a binary string. A binary string is a string of
bytes that has no character set or collation, which differs
from a non-binary character string that has a binary
collation. For both types of strings, comparisons are based on
the numeric values of the string unit, but for non-binary
strings the unit is the character and some character sets
allow multi-byte characters.
Section 10.4.2, “The BINARY and VARBINARY Types”.
The use of CHARACTER SET binary in the
definition of a CHAR,
VARCHAR, or TEXT column
causes the column to be treated as a binary data type. For
example, the following pairs of definitions are equivalent:
CHAR(10) CHARACTER SET binary BINARY(10) VARCHAR(10) CHARACTER SET binary VARBINARY(10) TEXT CHARACTER SET binary BLOB
In the great majority of statements, it is obvious what
collation MySQL uses to resolve a comparison operation. For
example, in the following cases, it should be clear that the
collation is the collation of column x:
SELECT x FROM T ORDER BY x; SELECT x FROM T WHERE x = x; SELECT DISTINCT x FROM T;
However, when multiple operands are involved, there can be ambiguity. For example:
SELECT x FROM T WHERE x = 'Y';
Should this query use the collation of the column
x, or of the string literal
'Y'?
Standard SQL resolves such questions using what used to be
called “coercibility” rules. Basically, this
means: Both x and 'Y'
have collations, so which collation takes precedence? This can
be difficult to resolve, but the following rules cover most
situations:
An explicit COLLATE clause has a
coercibility of 0. (Not coercible at all.)
The concatenation of two strings with different collations has a coercibility of 1.
A column's collation has a coercibility of 2.
A “system constant” (the string returned by
functions such as USER()
or VERSION()) has a
coercibility of 3.
A literal's collation has a coercibility of 4.
NULL or an expression that is derived
from NULL has a coercibility of 5.
The preceding coercibility values are current as of MySQL 4.1.11. See the note later in this section for additional version-related information.
Those rules resolve ambiguities in the following manner:
Use the collation with the lowest coercibility value.
If both sides have the same coercibility, then:
If both sides are Unicode, or both sides are not Unicode, it is an error.
If one of the sides has a Unicode character set, and another side has a non-Unicode character set, the side with Unicode character set wins, and automatic character set conversion is applied to the non-Unicode side. For example, the following statement will not return an error:
SELECT CONCAT(utf8_column, latin1_column) FROM t1;
It will return a result, and the character set of the
result will be utf8. The collation
of the result will be the collation of
utf8_column. Values of
latin1_column will be automatically
converted to utf8 before
concatenating.
Although automatic conversion is not in the SQL standard, the SQL standard document does say that every character set is (in terms of supported characters) a “subset” of Unicode. Because it is a well-known principle that “what applies to a superset can apply to a subset,” we believe that a collation for Unicode can apply for comparisons with non-Unicode strings.
Examples:
column1 = 'A' | Use collation of column1 |
column1 = 'A' COLLATE x | Use collation of 'A' |
column1 COLLATE x = 'A' COLLATE y | Error |
The COERCIBILITY() function
can be used to determine the coercibility of a string
expression:
mysql>SELECT COERCIBILITY('A' COLLATE latin1_swedish_ci);-> 0 mysql>SELECT COERCIBILITY(VERSION());-> 3 mysql>SELECT COERCIBILITY('A');-> 4
See Section 11.10.3, “Information Functions”.
Before MySQL 4.1.11, there is no system constant or ignorable
coercibility. Functions such as
USER() have a coercibility of
2 rather than 3, and literals have a coercibility of 3 rather
than 4.
Each character set has one or more collations, but each
collation is associated with one and only one character set.
Therefore, the following statement causes an error message
because the latin2_bin collation is not
legal with the latin1 character set:
mysql> SELECT _latin1 'x' COLLATE latin2_bin;
ERROR 1253 (42000): COLLATION 'latin2_bin' is not valid
for CHARACTER SET 'latin1'
In some cases, expressions that worked before MySQL 4.1 fail in early versions of MySQL 4.1 if you do not take character set and collation into account. For example, before 4.1, this statement works as is:
mysql> SELECT SUBSTRING_INDEX(USER(),'@',1);
+-------------------------------+
| SUBSTRING_INDEX(USER(),'@',1) |
+-------------------------------+
| root |
+-------------------------------+
The statement also works as is in MySQL 4.1 as of 4.1.8: In
MySQL 4.1, usernames are stored using the
utf8 character set (see
Section 9.1.8, “UTF-8 for Metadata”). The literal string
'@' has the server character set
(latin1 by default). Although the character
sets are different, MySQL can coerce the
latin1 string to the character set (and
collation) of USER() without
data loss. It does so, performs the substring operation, and
returns a result that has a character set of
utf8.
However, in versions of MySQL 4.1 before 4.1.8, the statement fails:
mysql> SELECT SUBSTRING_INDEX(USER(),'@',1);
ERROR 1267 (HY000): Illegal mix of collations
(utf8_general_ci,IMPLICIT) and (latin1_swedish_ci,COERCIBLE)
for operation 'substr_index'
This happens because the automatic character set conversion of
'@' does not occur and the string operands
have different character sets (and thus different collations):
mysql> SELECT COLLATION(USER()), COLLATION('@');
+-------------------+-------------------+
| COLLATION(USER()) | COLLATION('@') |
+-------------------+-------------------+
| utf8_general_ci | latin1_swedish_ci |
+-------------------+-------------------+
One way to deal with this is to upgrade to MySQL 4.1.8 or
later. If that is not possible, you can tell MySQL to
interpret the literal string as utf8:
mysql> SELECT SUBSTRING_INDEX(USER(),_utf8'@',1);
+------------------------------------+
| SUBSTRING_INDEX(USER(),_utf8'@',1) |
+------------------------------------+
| root |
+------------------------------------+
Another way is to change the connection character set and
collation to utf8. You can do that with
SET NAMES 'utf8' or by setting the
character_set_connection and
collation_connection system variables
directly.
Example 1: Sorting German Umlauts
Suppose that column X in table
T has these latin1
column values:
Muffler Müller MX Systems MySQL
Suppose also that the column values are retrieved using the following statement:
SELECT X FROM T ORDER BY X COLLATE collation_name;
The following table shows the resulting order of the values if
we use ORDER BY with different collations:
latin1_swedish_ci | latin1_german1_ci | latin1_german2_ci |
| Muffler | Muffler | Müller |
| MX Systems | Müller | Muffler |
| Müller | MX Systems | MX Systems |
| MySQL | MySQL | MySQL |
The character that causes the different sort orders in this
example is the U with two dots over it
(ü), which the Germans call
“U-umlaut.”
The first column shows the result of the
SELECT using the Swedish/Finnish
collating rule, which says that U-umlaut sorts with Y.
The second column shows the result of the
SELECT using the German DIN-1 rule,
which says that U-umlaut sorts with U.
The third column shows the result of the
SELECT using the German DIN-2 rule,
which says that U-umlaut sorts with UE.
Example 2: Searching for German Umlauts
Suppose that you have three tables that differ only by the character set and collation used:
mysql>CREATE TABLE german1 (->c CHAR(10)->) CHARACTER SET latin1 COLLATE latin1_german1_ci;mysql>CREATE TABLE german2 (->c CHAR(10)->) CHARACTER SET latin1 COLLATE latin1_german2_ci;mysql>CREATE TABLE germanutf8 (->c CHAR(10)->) CHARACTER SET utf8 COLLATE utf8_unicode_ci;
Each table contains two records:
mysql>INSERT INTO german1 VALUES ('Bar'), ('Bär');mysql>INSERT INTO german2 VALUES ('Bar'), ('Bär');mysql>INSERT INTO germanutf8 VALUES ('Bar'), ('Bär');
Two of the above collations have an A = Ä
equality, and one has no such equality
(latin1_german2_ci). For that reason,
you'll get these results in comparisons:
mysql>SELECT * FROM german1 WHERE c = 'Bär';+------+ | c | +------+ | Bar | | Bär | +------+ mysql>SELECT * FROM german2 WHERE c = 'Bär';+------+ | c | +------+ | Bär | +------+ mysql>SELECT * FROM germanutf8 WHERE c = 'Bär';+------+ | c | +------+ | Bar | | Bär | +------+
This is not a bug but rather a consequence of the sorting that
latin1_german1_ci or
utf8_unicode_ci do (the sorting shown is
done according to the German DIN 5007 standard)