我正在设置一个新服务器,并希望在我的 Web 应用程序中完全支持 UTF-8。我过去曾在现有服务器上尝试过此操作,但似乎总是最终不得不退回到 ISO-8859-1。
我到底需要在哪里设置编码/字符集?我知道我需要配置 Apache、MySQL 和 PHP 来执行此操作 - 是否有一些我可以遵循的标准清单,或者也许可以对发生不匹配的位置进行故障排除?
这是针对新的 Linux 服务器,运行 MySQL 5、PHP、5 和 Apache 2。
I'm setting up a new server and want to support UTF-8 fully in my web application. I have tried this in the past on existing servers and always seem to end up having to fall back to ISO-8859-1.
Where exactly do I need to set the encoding/charsets? I'm aware that I need to configure Apache, MySQL, and PHP to do this — is there some standard checklist I can follow, or perhaps troubleshoot where the mismatches occur?
This is for a new Linux server, running MySQL 5, PHP, 5 and Apache 2.
发布评论
评论(13)
数据存储:
为数据库中的所有表和文本列指定
utf8mb4
字符集。这使得 MySQL 物理地存储和检索以 UTF-8 本机编码的值。alter table test charset utf8mb4;
)不会更改表列的字符集。alter table test CONVERT TO charset utf8mb4;
必须改为使用。utf8mb4_*
排序规则(没有任何显式字符集),MySQL 将隐式使用utf8mb4
编码。在旧版本的 MySQL (<5.5.3) 中,不幸的是,您将被迫仅使用
utf8
,它仅支持 Unicode 字符的子集。我希望我是在开玩笑。数据访问:
在您的应用程序代码(例如 PHP)中,无论使用哪种数据库访问方法,您都需要将连接字符集设置为
utf8mb4
。这样,当 MySQL 将数据传递给您的应用程序时,它不会从其本机 UTF-8 进行转换,反之亦然。某些驱动程序提供自己的机制来配置连接字符集,该机制既更新其自身的内部状态,又通知 MySQL 连接上要使用的编码 - 这通常是首选方法。在 PHP 中:
如果您使用 PDO 抽象PHP ≥ 5.3.6 的层,您可以在 charset “不关注noreferrer">DSN:
如果您使用mysqli,您可以调用
set_charset()
:< /p>如果你坚持使用普通的 mysql 但发生了要运行 PHP ≥ 5.2.3,您可以调用
mysql_set_charset
。如果驱动程序没有提供自己的机制来设置连接字符集,您可能必须发出查询来告诉 MySQL 您的应用程序希望如何对连接上的数据进行编码:
设置名称'utf8mb4'
。对于
utf8mb4
/utf8
的考虑同样适用,如上所述。输出:
Content-Type: text/html;字符集=utf-8
。您可以通过设置default_charset< 来实现此目的/code>
在 php.ini 中(首选),或手动使用
header()
函数。json_encode()
对输出进行编码时,您可能需要添加JSON_UNESCAPED_UNICODE
作为第二个参数,以避免使用 JSON Unicode 转义。输入:
mb_check_encoding()
确实这个技巧,但你必须虔诚地使用它。确实没有办法解决这个问题,因为恶意客户端可以以他们想要的任何编码提交数据,而且我还没有找到让 PHP 可靠地为您完成此操作的技巧。其他代码注意事项:
显然,您将提供的所有文件(PHP、HTML、JavaScript 等)都应使用有效的 UTF-8 进行编码。
您需要确保每次处理 UTF-8 字符串时都是安全的。不幸的是,这是最困难的部分。您可能希望广泛使用 PHP 的
mbstring
扩展。默认情况下,PHP 的内置字符串操作不是 UTF-8 安全的。您可以使用普通 PHP 字符串操作安全地执行一些操作(例如连接),但对于大多数事情,您应该使用等效的
mbstring
函数。要知道您在做什么(阅读:不要搞砸),您确实需要了解 UTF-8 以及它如何在尽可能最低的级别上工作。查看 utf8.com 中的任何链接,获取一些好的资源,以了解您需要了解的所有内容。< /p>
Data Storage:
Specify the
utf8mb4
character set on all tables and text columns in your database. This makes MySQL physically store and retrieve values encoded natively in UTF-8.alter table test charset utf8mb4;
) won't change the charset of the table columns.alter table test CONVERT TO charset utf8mb4;
has to be using instead.utf8mb4
encoding if autf8mb4_*
collation is specified (without any explicit character set).In older versions of MySQL (< 5.5.3), you'll unfortunately be forced to use simply
utf8
, which only supports a subset of Unicode characters. I wish I were kidding.Data Access:
In your application code (e.g. PHP), in whatever DB access method you use, you'll need to set the connection charset to
utf8mb4
. This way, MySQL does no conversion from its native UTF-8 when it hands data off to your application and vice versa.Some drivers provide their own mechanism for configuring the connection character set, which both updates its own internal state and informs MySQL of the encoding to be used on the connection—this is usually the preferred approach. In PHP:
If you're using the PDO abstraction layer with PHP ≥ 5.3.6, you can specify
charset
in the DSN:If you're using mysqli, you can call
set_charset()
:If you're stuck with plain mysql but happen to be running PHP ≥ 5.2.3, you can call
mysql_set_charset
.If the driver does not provide its own mechanism for setting the connection character set, you may have to issue a query to tell MySQL how your application expects data on the connection to be encoded:
SET NAMES 'utf8mb4'
.The same consideration regarding
utf8mb4
/utf8
applies as above.Output:
Content-Type: text/html; charset=utf-8
. You can achieve that either by settingdefault_charset
in php.ini (preferred), or manually usingheader()
function.json_encode()
, you may want to addJSON_UNESCAPED_UNICODE
as a second parameter to avoid using the JSON Unicode escaping.Input:
mb_check_encoding()
does the trick, but you have to use it religiously. There's really no way around this, as malicious clients can submit data in whatever encoding they want, and I haven't found a trick to get PHP to do this for you reliably.Other Code Considerations:
Obviously enough, all files you'll be serving (PHP, HTML, JavaScript, etc.) should be encoded in valid UTF-8.
You need to make sure that every time you process a UTF-8 string, you do so safely. This is, unfortunately, the hard part. You'll probably want to make extensive use of PHP's
mbstring
extension.PHP's built-in string operations are not by default UTF-8 safe. There are some things you can safely do with normal PHP string operations (like concatenation), but for most things you should use the equivalent
mbstring
function.To know what you're doing (read: not mess it up), you really need to know UTF-8 and how it works on the lowest possible level. Check out any of the links from utf8.com for some good resources to learn everything you need to know.
我想在 chazomaticus 的出色答案中添加一件事:
也不要忘记 META 标记(像这样,或者 HTML4 或 XHTML 版本):
这看起来微不足道,但 IE7 之前曾给我带来过问题。
我所做的一切都是正确的;数据库、数据库连接和Content-Type HTTP标头都设置为UTF-8,在所有其他浏览器中都运行良好,但Internet Explorer仍然坚持使用“西欧”编码。
结果发现该页面缺少 META 标签。添加就解决了这个问题。
W3C 实际上有一个相当大的专门讨论 I18N 的部分。他们有许多与此问题相关的文章 - 描述 HTTP、(X)HTML 和 CSS 方面的内容:
他们建议同时使用 HTTP 标头和 HTML 元标记(或者在 XHTML 充当 XML 的情况下使用 XML 声明)。
I'd like to add one thing to chazomaticus' excellent answer:
Don't forget the META tag either (like this, or the HTML4 or XHTML version of it):
That seems trivial, but IE7 has given me problems with that before.
I was doing everything right; the database, database connection and Content-Type HTTP header were all set to UTF-8, and it worked fine in all other browsers, but Internet Explorer still insisted on using the "Western European" encoding.
It turned out the page was missing the META tag. Adding that solved the problem.
The W3C actually has a rather large section dedicated to I18N. They have a number of articles related to this issue – describing the HTTP, (X)HTML and CSS side of things:
They recommend using both the HTTP header and HTML meta tag (or XML declaration in case of XHTML served as XML).
除了在 php.ini 中设置
default_charset
之外,您还可以在代码中、在任何输出之前使用header()
发送正确的字符集:在 PHP 中使用 Unicode 很容易只要您意识到大多数字符串函数不适用于 Unicode,并且有些函数可能会完全破坏字符串。 PHP 认为“字符”的长度为 1 个字节。有时这是可以的(例如, explode() 只查找字节序列并将其用作分隔符 - 因此您查找的实际字符并不重要)。但其他时候,当函数实际上被设计为处理字符时,PHP 并不知道您的文本具有通过 Unicode 找到的多字节字符。
一个值得检查的好库是 phputf8。这会重写所有“坏”函数,以便您可以安全地处理 UTF8 字符串。有一些扩展,例如 mb_string 扩展,尝试为您执行此操作,也是,但我更喜欢使用该库,因为它更便携(但我编写大众市场产品,所以这对我来说很重要)。但无论如何,phputf8 可以在幕后使用 mb_string 来提高性能。
In addition to setting
default_charset
in php.ini, you can send the correct charset usingheader()
from within your code, before any output:Working with Unicode in PHP is easy as long as you realize that most of the string functions don't work with Unicode, and some might mangle strings completely. PHP considers "characters" to be 1 byte long. Sometimes this is okay (for example, explode() only looks for a byte sequence and uses it as a separator -- so it doesn't matter what actual characters you look for). But other times, when the function is actually designed to work on characters, PHP has no idea that your text has multi-byte characters that are found with Unicode.
A good library to check into is phputf8. This rewrites all of the "bad" functions so you can safely work on UTF8 strings. There are extensions like the mb_string extension that try to do this for you, too, but I prefer using the library because it's more portable (but I write mass-market products, so that's important for me). But phputf8 can use mb_string behind the scenes, anyway, to increase performance.
我发现有人使用 PDO 时出现问题,答案是将此用于 PDO连接字符串:
I found an issue with someone using PDO and the answer was to use this for the PDO connection string:
就我而言,我使用的是
mb_split
,它使用正则表达式。因此,我还必须通过执行mb_regex_encoding('UTF-8');
手动确保正则表达式编码为 UTF-8;顺便说一句,我还通过运行
mb_internal_encoding() 发现
内部编码不是 UTF-8,我通过运行mb_internal_encoding("UTF-8");
更改了它。In my case, I was using
mb_split
, which uses regular expressions. Therefore I also had to manually make sure the regular expression encoding was UTF-8 by doingmb_regex_encoding('UTF-8');
As a side note, I also discovered by running
mb_internal_encoding()
that the internal encoding wasn't UTF-8, and I changed that by runningmb_internal_encoding("UTF-8");
.首先,如果您使用的是 5.3 之前的 PHP,则不行。你有很多问题需要解决。
我很惊讶没有人提到 intl 库,这个库有很好的支持对于Unicode、字素、字符串操作、本地化等等,请参见下文。
我将引用Elizabeth Smith关于PHP中Unicode支持的一些信息 slides at PHPBenelux'14
INTL
好:
不好:
mb_string
ICONV
stream_filter_append($fp, 'convert.iconv.ISO-2022-JP/EUC-JP')
数据库
其他一些问题
First of all, if you are in PHP before 5.3 then no. You've got a ton of problems to tackle.
I am surprised that none has mentioned the intl library, the one that has good support for Unicode, graphemes, string operations, localisation and many more, see below.
I will quote some information about Unicode support in PHP by Elizabeth Smith's slides at PHPBenelux'14
INTL
Good:
Bad:
mb_string
ICONV
stream_filter_append($fp, 'convert.iconv.ISO-2022-JP/EUC-JP')
DATABASES
Some other gotchas
我要添加到这些令人惊奇的答案中的唯一一件事是强调以 UTF-8 编码保存文件,我注意到浏览器接受此属性而不是将 UTF-8 设置为代码编码。任何像样的文本编辑器都会向您展示这一点。例如, Notepad++ 有一个文件编码的菜单选项,它会向您显示当前编码并允许您更改它。对于我的所有 PHP 文件,我都使用没有 BOM 的 UTF-8。
不久前,有人要求我为别人设计的 PHP 和 MySQL 应用程序添加 UTF-8 支持。我注意到所有文件都是用 ANSI 编码的,所以我必须使用 iconv 来转换所有文件,更改数据库表以使用UTF-8字符集和utf8_general_ci整理,在连接后将'SET NAMES utf8'添加到数据库抽象层(如果使用5.3.6或更早版本。否则,您必须在连接字符串中使用 charset=utf8)并更改字符串函数以使用等效的 PHP 多字节字符串函数。
The only thing I would add to these amazing answers is to emphasize on saving your files in UTF-8 encoding, I have noticed that browsers accept this property over setting UTF-8 as your code encoding. Any decent text editor will show you this. For example, Notepad++ has a menu option for file encoding, and it shows you the current encoding and enables you to change it. For all my PHP files I use UTF-8 without a BOM.
Sometime ago I had someone ask me to add UTF-8 support for a PHP and MySQL application designed by someone else. I noticed that all files were encoded in ANSI, so I had to use iconv to convert all files, change the database tables to use the UTF-8 character set and utf8_general_ci collate, add 'SET NAMES utf8' to the database abstraction layer after the connection (if using 5.3.6 or earlier. Otherwise, you have to use charset=utf8 in the connection string) and change string functions to use the PHP multibyte string functions equivalent.
我最近发现使用
strtolower()
可能会导致数据在特殊字符后被截断的问题。解决方案是使用
I recently discovered that using
strtolower()
can cause issues where the data is truncated after a special character.The solution was to use
在 PHP 中,您需要使用多字节函数,或打开 mbstring.func_overload。这样,如果您的字符占用超过一个字节,strlen 之类的东西就会起作用。
您还需要确定您的回复的字符集。您可以如上所述使用 AddDefaultCharset,也可以编写返回标头的 PHP 代码。 (或者您可以将 META 标记添加到 HTML 文档中。)
In PHP, you'll need to either use the multibyte functions, or turn on mbstring.func_overload. That way things like strlen will work if you have characters that take more than one byte.
You'll also need to identify the character set of your responses. You can either use AddDefaultCharset, as above, or write PHP code that returns the header. (Or you can add a META tag to your HTML documents.)
我刚刚遇到了同样的问题,并在 PHP 手册中找到了一个很好的解决方案。
我将所有文件的编码更改为 UTF8,然后更改为连接上的默认编码。这解决了所有问题。
查看源代码
I have just gone through the same issue and found a good solution at PHP manuals.
I changed all my files' encoding to UTF8 and then the default encoding on my connection. This solved all the problems.
View Source
如果您希望 MySQL 服务器决定字符集,而不是 PHP 作为客户端(旧行为;我认为首选),请尝试将
skip-character-set-client-handshake
添加到您的 < code>my.cnf,在[mysqld]
下,然后重新启动mysql
。如果您使用 UTF-8 以外的任何内容,这可能会导致问题。
If you want a MySQL server to decide the character set, and not PHP as a client (old behaviour; preferred, in my opinion), try adding
skip-character-set-client-handshake
to yourmy.cnf
, under[mysqld]
, and restartmysql
.This may cause trouble in case you're using anything other than UTF-8.
PHP 中的 Unicode 支持仍然是一团糟。虽然它能够转换 ISO 8859 字符串(它在内部使用)到 UTF-8,它缺乏本地处理 Unicode 字符串的能力,这意味着所有字符串处理函数都会破坏和损坏您的字符串。
因此,您必须使用单独的库来获得适当的 UTF-8 支持,或者自己重写所有字符串处理函数。
最简单的部分只是在 HTTP 标头和数据库等中指定字符集,但如果您的 PHP 代码不输出有效的 UTF-8,那么这一切都无关紧要。这是最困难的部分,而 PHP 在这方面几乎没有提供任何帮助。 (我认为 PHP 6 应该可以解决最糟糕的问题,但这还需要一段时间。)
Unicode support in PHP is still a huge mess. While it's capable of converting an ISO 8859 string (which it uses internally) to UTF-8, it lacks the capability to work with Unicode strings natively, which means all the string processing functions will mangle and corrupt your strings.
So you have to either use a separate library for proper UTF-8 support, or rewrite all the string handling functions yourself.
The easy part is just specifying the charset in HTTP headers and in the database and such, but none of that matters if your PHP code doesn't output valid UTF-8. That's the hard part, and PHP gives you virtually no help there. (I think PHP 6 is supposed to fix the worst of this, but that's still a while away.)
楼上的答案非常好。这是我在常规 Debian、PHP 和 MySQL 设置:
仅此而已!
The top answer is excellent. Here is what I had to on a regular Debian, PHP, and MySQL setup:
That was all!