MySQL字符集更改指南:转换为utf8mb4(解决乱码)

目次

1. 引言

为什么您可能需要更改 MySQL 字符集

数据库字符集是一个重要的设置,它决定了文本数据如何被编码和处理以进行存储。在 MySQL 中,默认字符集通常是 latin1,这在使用日文或其他特殊字符时可能会导致问题。特别是在数据迁移或系统标准化期间,切换到合适的字符集变得至关重要。

常见问题及其原因

与 MySQL 字符集相关的典型问题包括以下内容。

  1. 乱码(mojibake)
  • 同一环境中混用了 utf8latin1
  • 客户端和服务器的字符集设置不匹配
  1. 搜索时的问题
  • 由于排序规则差异,无法返回预期的搜索结果
  • 排序顺序与预期不同
  1. 数据迁移期间的问题
  • 因为未使用 utf8mb4,无法保存表情符号和特殊符号
  • 导出/导入期间未正确处理字符集转换

文章目标和结构

本文全面解释了 MySQL 字符集更改,从基本概念如何更改设置以及故障排除

提纲

  1. MySQL 字符集的基本知识
  2. 如何检查当前字符集
  3. 如何更改 MySQL 字符集
  4. 更改后的故障排除
  5. 字符集更改对性能的影响
  6. 推荐设置(最佳实践)
  7. 常见问题解答(FAQ)

通过阅读本指南,您将加深对 MySQL 字符集的理解,并能够选择正确的设置并避免常见问题

2. 什么是 MySQL 字符集?理解基础知识

什么是字符集?

字符集(Character Set)是一组用于将字符存储和处理为数字数据的规则。例如,在存储日文字符“あ”时,UTF-8 将其表示为字节序列 E3 81 82,而 Shift_JIS 使用 82 A0

在 MySQL 中,您可以在数据库或表级别指定不同的字符集。通过选择合适的字符集,您可以防止乱码并使国际化更顺畅

常见字符集

Character SetCharacteristicsUse Case
utf8UTF-8 up to 3 bytesDoes not support some special characters (such as emoji)
utf8mb4UTF-8 up to 4 bytesSupports emoji and special characters (recommended)
latin1ASCII-compatibleUsed in older systems

什么是排序规则?

排序规则(Collation)是用于在字符集中比较和排序数据的规则集。例如,它定义了“A”和“a”是否被视为相同字符,以及如何确定排序顺序。

常用排序规则

CollationDescription
utf8_general_ciCase-insensitive, suitable for general use
utf8_unicode_ciUnicode-based collation (recommended)
utf8mb4_binBinary comparison (use when exact matches are required)

utf8utf8mb4 的区别

MySQL 的 utf8 实际上每个字符最多只能存储3 字节,因此无法处理某些特殊字符(例如表情符号或某些扩展的 CJK 字符)。相比之下,utf8mb4 支持每个字符最多4 字节,这就是为什么现代应用程序推荐使用 utf8mb4

Character SetMax BytesEmoji SupportRecommendation
utf83 bytes❌ Not supported❌ Not recommended
utf8mb44 bytes✅ Supported✅ Recommended

为什么应该从 utf8 切换到 utf8mb4

  1. 未来兼容性 : 现代系统越来越多地标准化使用 utf8mb4
  2. 存储特殊字符和表情符号 : 使用 utf8mb4 ,您可以安全处理 SNS 帖子和消息应用程序中的数据。
  3. 国际化 : 对于多语言系统,它降低了乱码的风险。

总结

  • 字符集决定了数据如何被存储和处理。
  • 排序规则决定了字符如何被比较。
  • MySQL 的 utf8 限制为 3 字节,因此推荐使用 utf8mb4
  • utf8mb4_unicode_ci 是通用用途的常用推荐排序规则。

3. 如何检查当前字符集

在更改 MySQL 字符集之前,检查当前设置非常重要
因为字符集可以在多个级别设置(数据库、表、列),您应该准确了解需要更改的位置。

如何检查当前字符集

检查 MySQL 服务器范围的字符集

First, check the default character set and collation settings for the entire MySQL server.

SHOW VARIABLES LIKE 'character_set%';
SHOW VARIABLES LIKE 'collation%';

示例输出:

+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8mb4                    |
| character_set_connection | utf8mb4                    |
| character_set_database   | utf8mb4                    |
| character_set_filesystem | binary                     |
| character_set_results    | utf8mb4                    |
| character_set_server     | utf8mb4                    |
| character_set_system     | utf8                        |
+--------------------------+----------------------------+

检查每个数据库的字符集

要检查特定数据库的字符集,请使用以下查询。

SELECT SCHEMA_NAME, DEFAULT_CHARACTER_SET_NAME, DEFAULT_COLLATION_NAME
FROM information_schema.SCHEMATA
WHERE SCHEMA_NAME = 'database_name';

示例输出

+----------------+----------------------+----------------------+
| SCHEMA_NAME    | DEFAULT_CHARACTER_SET_NAME | DEFAULT_COLLATION_NAME |
+----------------+----------------------+----------------------+
| my_database   | utf8mb4               | utf8mb4_unicode_ci   |
+----------------+----------------------+----------------------+

检查表的字符集

以下是检查特定表字符集的方法。

SHOW CREATE TABLE table_name;

示例输出

CREATE TABLE `users` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` varchar(255) DEFAULT NULL,
  `email` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COLLATE=latin1_swedish_ci;

检查点

  • DEFAULT CHARSET=latin1 → 不是 utf8mb4,因此需要更改
  • COLLATE=latin1_swedish_ci → 通常更适合切换为 utf8mb4_unicode_ci

检查列的字符集

要检查列级别的字符集,请运行以下 SQL。

SELECT COLUMN_NAME, CHARACTER_SET_NAME, COLLATION_NAME 
FROM information_schema.COLUMNS 
WHERE TABLE_SCHEMA = 'database_name' 
AND TABLE_NAME = 'table_name';

示例输出

+-------------+--------------------+----------------------+
| COLUMN_NAME | CHARACTER_SET_NAME | COLLATION_NAME       |
+-------------+--------------------+----------------------+
| name        | latin1             | latin1_swedish_ci    |
| email       | utf8mb4            | utf8mb4_unicode_ci   |
+-------------+--------------------+----------------------+

在此示例中,name 列使用 latin1,建议将其更改为 utf8mb4

摘要

  • MySQL 的字符集在多个层级(服务器、数据库、表、列)进行配置。
  • 通过检查每个层级的字符集,您可以进行相应的更改。
  • 使用 SHOW VARIABLESSHOW CREATE TABLE 等命令来全面了解当前配置。

4. 如何更改 MySQL 字符集

通过适当地更改 MySQL 字符集,您可以防止乱码并更顺畅地支持多语言数据。
本节将说明如何在各层级更新设置:全局服务器、数据库、表和列

更改全局服务器默认字符集

要更改全局服务器默认字符集,需要编辑 MySQL 配置文件(my.cnfmy.ini)。

步骤

  1. 打开配置文件
  • 在 Linux 上:bash sudo nano /etc/mysql/my.cnf
  • 在 Windows 上:wp:list /wp:list

    • 打开 C:\ProgramData\MySQL\MySQL Server X.X\my.ini
  1. 添加或更改字符集设置mysqld 部分下添加或更新以下行。
    [mysqld]
    character-set-server=utf8mb4
    collation-server=utf8mb4_unicode_ci
    
  1. Restart MySQL
    sudo systemctl restart mysql
    

On Windows:

net stop MySQL && net start MySQL
  1. Verify the change
    SHOW VARIABLES LIKE 'character_set_server';
    

Change the Character Set at the Database Level

ALTER DATABASE mydatabase CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

Verify the change

SELECT SCHEMA_NAME, DEFAULT_CHARACTER_SET_NAME, DEFAULT_COLLATION_NAME 
FROM information_schema.SCHEMATA 
WHERE SCHEMA_NAME = 'mydatabase';

Change the Character Set at the Table Level

ALTER TABLE users CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

Verify the change

SHOW CREATE TABLE users;

Change the Character Set at the Column Level

ALTER TABLE users MODIFY COLUMN name VARCHAR(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

Verify the change

SELECT COLUMN_NAME, CHARACTER_SET_NAME, COLLATION_NAME 
FROM information_schema.COLUMNS 
WHERE TABLE_SCHEMA = 'mydatabase' 
AND TABLE_NAME = 'users';

Post-Change Verification and the Importance of Backups

To preserve data integrity after changing the character set, follow these steps.

Back up your data

mysqldump -u root -p --default-character-set=utf8mb4 mydatabase > backup.sql

Re-check the settings

SHOW VARIABLES LIKE 'character_set%';
SHOW VARIABLES LIKE 'collation%';
SHOW CREATE TABLE users;

Insert and display test data

INSERT INTO users (name, email) VALUES ('Test User', 'test@example.com');
SELECT * FROM users;

Summary

  • Server-wide character set change : Edit my.cnf and set character-set-server=utf8mb4
  • Database character set change : ALTER DATABASE mydatabase CHARACTER SET utf8mb4
  • Table character set change : ALTER TABLE users CONVERT TO CHARACTER SET utf8mb4
  • Column character set change : ALTER TABLE users MODIFY COLUMN name VARCHAR(255) CHARACTER SET utf8mb4
  • After changes, always verify settings and test your data

5. Troubleshooting After Changing the Character Set

After changing the MySQL character set, you may encounter cases where the system does not behave correctly or stored data becomes garbled.
In this section, we explain common issues and how to fix them in detail.

Causes of Mojibake and How to Fix It

If mojibake occurs after changing the character set, the following causes are common.

CauseHow to CheckSolution
The client character set setting differsSHOW VARIABLES LIKE 'character_set_client';Run SET NAMES utf8mb4;
Existing data was stored using a different encodingSELECT HEX(column_name) FROM table_name;Use CONVERT() or re-export the data
The connection encoding is not correctConnect with mysql --default-character-set=utf8mb4Adjust the client-side character set configuration
Application settings (PHP/Python, etc.) are incorrectmysqli_set_charset($conn, 'utf8mb4');Standardize the application’s character set settings

Fix #1: Set the client character set correctly

SET NAMES utf8mb4;

Fix #2: Convert existing data properly

UPDATE users SET name = CONVERT(CAST(CONVERT(name USING latin1) AS BINARY) USING utf8mb4);

Notes When Converting from latin1 to utf8mb4

Safe procedure

  1. Back up current data
    mysqldump -u root -p --default-character-set=latin1 mydatabase > backup.sql
    
  1. Change the database character set
    ALTER DATABASE mydatabase CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
    
  1. Change the table character set
    ALTER TABLE users CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
    
  1. Re-import the data
    mysql -u root -p --default-character-set=utf8mb4 mydatabase < backup.sql
    

Data Is Not Searchable After the Change

Case #1: LIKE search does not work

SELECT * FROM users WHERE name COLLATE utf8mb4_unicode_ci LIKE '%Tanaka%';

Case #2: Sort order changed

SELECT * FROM users ORDER BY BINARY name;

Application-Side Measures

For PHP

mysqli_set_charset($conn, 'utf8mb4');

For Python (MySQL Connector)

import mysql.connector

conn = mysql.connector.connect(
    host="localhost",
    user="root",
    password="password",
    database="mydatabase",
    charset="utf8mb4"
)

For Node.js (MySQL2)

const mysql = require('mysql2');

const connection = mysql.createConnection({
  host: 'localhost',
  user: 'root',
  password: 'password',
  database: 'mydatabase',
  charset: 'utf8mb4'
});

Summary

  • Post-change issues generally fall into three categories: client settings, data conversion, and application settings.
  • To prevent mojibake, standardize the client character set using SET NAMES utf8mb4.
  • Watch for LIKE search and sort order changes, and specify COLLATE when needed.
  • Set utf8mb4 in your application as well to avoid encoding mismatches.

6. How Character Set Changes Affect Performance

When changing the MySQL character set to utf8mb4, there are several performance considerations, such as increased storage usage and index limitations.
In this section, we explain the impact and the best countermeasures.

Increased Storage Usage

Compared to MySQL’s utf8, utf8mb4 can use up to 4 bytes per character,
so the overall table size may increase.

Max bytes per character by character set

Character SetMax Bytes per Character
latin11 byte
utf83 bytes
utf8mb44 bytes

For example, with utf8, VARCHAR(255) is up to 765 bytes (255×3),
but with utf8mb4, it becomes up to 1020 bytes (255×4).

Countermeasure

ALTER TABLE posts MODIFY COLUMN title VARCHAR(100) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

Increased Index Size

MySQL enforces a maximum index key length.
After switching to utf8mb4, index entries become larger, and you may hit the limit—making indexes unusable.

Check index impact

SHOW INDEX FROM users;

Example error

ERROR 1071 (42000): Specified key was too long; max key length is 767 bytes

Countermeasure

ALTER TABLE users MODIFY COLUMN email VARCHAR(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

Impact on Query Performance

Changing the character set to utf8mb4 may affect query execution speed.

Operations that may be affected

  • LIKE searches over large datasets
  • ORDER BY processing
  • JOIN query performance

Countermeasure

CREATE INDEX idx_name ON users(name(100));

Memory Usage and Buffer Tuning

With utf8mb4, memory usage may increase.

Recommended settings

[mysqld]
innodb_buffer_pool_size = 1G
query_cache_size = 128M

Summary

  • Switching to utf8mb4 increases storage usage.
  • Index sizes increase and may exceed limits.
  • Query performance can be affected.
  • Because memory usage may increase, buffer sizes may need tuning.

7. Recommended Settings (Best Practices)

By setting MySQL character sets appropriately, you can maintain data integrity while optimizing performance.
In this section, we present recommended MySQL character set configurations and explain key points for an optimal setup.

Recommended MySQL Character Set Configuration

ItemRecommended SettingReason
Character Setutf8mb4Supports all Unicode characters including emoji and special characters
Collationutf8mb4_unicode_ciCase-insensitive and suitable for multilingual systems
Storage EngineInnoDBGood balance of performance and consistency
Indexed string lengthVARCHAR(191)Avoids exceeding MySQL index limits

Recommended my.cnf Settings

1. MySQL Server Character Set Settings

[mysqld]
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci
init-connect='SET NAMES utf8mb4'
skip-character-set-client-handshake
innodb_large_prefix = ON
innodb_file_format = Barracuda
innodb_file_per_table = 1
innodb_buffer_pool_size = 1G
query_cache_size = 128M

2. Client-Side Character Set Settings

[client]
default-character-set = utf8mb4

Recommended Database Settings

CREATE DATABASE mydatabase DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

To change an existing database character set:

ALTER DATABASE mydatabase CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

Recommended Table Settings

CREATE TABLE users (
  id INT AUTO_INCREMENT PRIMARY KEY,
  name VARCHAR(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci,
  email VARCHAR(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci,
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

更改现有表的字符集

ALTER TABLE users CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

utf8mb4_general_ciutf8mb4_unicode_ci 的差异

CollationCharacteristicsUse Case
utf8mb4_general_ciFaster comparisons, but less accuratePerformance-focused systems
utf8mb4_unicode_ciUnicode-standard, more accurate comparisonsGeneral-purpose use (recommended)

如果需要多语言支持或精确排序,请选择 utf8mb4_unicode_ci

索引优化

CREATE FULLTEXT INDEX idx_fulltext ON articles(content);

摘要

  • 推荐使用 utf8mb4 + utf8mb4_unicode_ci 的组合。
  • 统一服务器设置(my.cnf)和连接字符集。
  • 在数据库、表和列级别显式指定 utf8mb4
  • 使用 VARCHAR(191) 以避免索引键长度限制。
  • 使用 utf8mb4_unicode_ci 进行精确比较。

8. 常见问题

以下是关于更改 MySQL 字符集的常见实际问题。我们还会介绍 如何处理错误如何选择最佳设置

utf8utf8mb4 有何区别?

SHOW VARIABLES LIKE 'character_set_server';

更改 MySQL 字符集会导致数据丢失吗?

mysqldump -u root -p --default-character-set=utf8mb4 mydatabase > backup.sql

如果出现乱码(mojibake),该如何修复?

UPDATE users SET name = CONVERT(CAST(CONVERT(name USING latin1) AS BINARY) USING utf8mb4);

latin1 转换为 utf8mb4 时有哪些风险?

ALTER DATABASE mydatabase CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

切换到 utf8mb4 会影响性能吗?

ALTER TABLE users MODIFY COLUMN email VARCHAR(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

应该使用 utf8mb4_general_ci 还是 utf8mb4_unicode_ci

CollationCharacteristicsUse Case
utf8mb4_general_ciFaster comparisons, but less accuratePerformance-focused systems
utf8mb4_unicode_ciUnicode-standard, accurate comparisonsGeneral-purpose use (recommended)

切换到 utf8mb4 后查询会变慢吗?

CREATE FULLTEXT INDEX idx_fulltext ON articles(content);

摘要

推荐使用 utf8mb4。由于 utf8 的局限性,不推荐使用。
在进行更改之前,始终使用 SHOW VARIABLES 检查设置。
使用导出/导入工作流以防止乱码。
考虑索引限制,并在适当情况下使用 VARCHAR(191)
为提升性能,添加适当的索引。

最终说明

更改 MySQL 字符集不仅仅是一次简单的配置调整——它是一项关键任务,可能影响 数据完整性和性能。通过遵循正确的设置和流程,您可以安全、有效地迁移到 utf8mb4

🔹 请按照本文步骤正确配置字符集! 🔹