在 PHP 中使用 UTF-8 字符集 - 是否需要 mb 函数?
在过去的几天里,我一直致力于将我的 PHP 代码库从 latin1 转换为 UTF-8。我读过两个主要的解决方案,要么用内置的多字节函数替换单字节函数,要么在 php.ini 文件中设置 mbstring.func_overload 值。
但后来我在堆栈溢出上遇到了 this 线程,其中 thomasrutter 的帖子似乎表明只要脚本和字符串文字以 UTF-8 编码,多字节函数实际上对于 UTF-8 来说并不是必需的。
我还没有找到任何其他证据证明这是否属实,如果事实证明我不需要将我的代码转换为 mb_functions 那么这将是一个真正的节省时间!有人能解释一下吗?
These past few days I've been working toward converting my PHP code base from latin1 to UTF-8. I've read the two main solutions are to either replace the single byte functions with the built in multibyte functions, or set the mbstring.func_overload value in the php.ini file.
But then I came across this thread on stack overflow, where the post by thomasrutter seems to indicate that the multibyte functions aren't actually necessary for UTF-8, as long as the script and string literals are encoded in UTF-8.
I haven't found any other evidence whether this is true or not, and if it turns out I don't need to convert my code to the mb_functions then that would be a real time saver! Anyone able to shed some light on this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
据我了解这个问题,只要你的所有数据都是 100% utf-8 - 这意味着用户输入,数据库,以及 PHP 文件本身的编码(如果其中有特殊字符) - 这是 < Strike>true 搜索和比较操作为true。正如 @ntd 指出的,非多字节 strlen() 在包含多字节字符的字符串上运行时将产生错误的结果。
这是一篇关于编码基础知识的精彩文章。
As far as I understand the issue, as long as all your data is 100% in utf-8 - and that means user input, database, and also the encoding of the PHP files themselves if you have special characters in them - this is
truetrue for search and comparison operations. As @ntd points out, a non-multibyte strlen() will produce wrong results when run on a string that contains multibyte characters.THis is a great article on the basics of encoding.
它们不是“必需的”,除非您使用它们替换的任何 函数(并且您可能正在使用其中至少一个)或明确需要扩展的功能,例如 HTTP 处理。
在努力实现 UTF-8 合规性时,我总是回到 PHP UTF-8 Cheatsheet 添加了一项内容:需要更新 PCRE 模式才能使用
u
修饰符。They aren't "necessary" unless you're using any of the functions they replace (and it's likely that you are using at least one of these) or otherwise explicitly need a feature of the extension such as HTTP handling.
When working towards UTF-8 compliance, I always fall back to the PHP UTF-8 Cheatsheet with one addition: PCRE patterns need to be updated to use the
u
modifier.一旦您检查或修改多字节字符串,您就需要使用 mb_* 函数。一个非常简单的例子说明了原因:
这会打印出:
As soon as you're examining or modifying a multibyte string, you need to use a mb_* function. A very quick example which demonstrates why:
This prints out:
thomasrutter 表示搜索不需要特殊处理。例如,如果您需要检查 UTF8 字符串的长度,我不知道如何使用普通的
strlen()
来做到这一点。thomasrutter indicates that the search does not need special handling. For example, if you need to check the length of an UTF8 string, I don't see how you can do that using plain
strlen()
.mb_strtoupper 等函数可能也是必要的。 strtoupper 不会将 á 转换为 Á。
Functions such as mb_strtoupper may be necessary, too. strtoupper won't convert á to Á.
有许多函数期望字符串是单字节(有些甚至假设它是 iso-8859-1)。在这些情况下,您需要了解自己在做什么,并可能使用替换函数。有一个相当全面的列表: http://www.phpwact.org/php/ i18n/utf-8
There are a number of functions that expect strings to be single byte (And some even presume that it is iso-8859-1). In these cases, you need to be aware of what you're doing and possibly use replacement functions. There is a fairly comprehensive list at: http://www.phpwact.org/php/i18n/utf-8
您可以使用 mbfunctions 库来扩展 PHP 中的多字节函数:
http://code.google.com /p/mbfunctions/
You could use the mbfunctions library that extends the multibyte functions in PHP:
http://code.google.com/p/mbfunctions/
你可以用这个
http://php.net/manual/en/mbstring.overload.php
设置在 php.ini 文件中,因此您无需更改代码。
但要小心,因为并不是所有的字符串函数都会自动改变。
这是一个:http://php.net/manual/en/function。 substr-replace.php
You can use this
http://php.net/manual/en/mbstring.overload.php
setting in php.ini file, so you don't need to change you code.
But be careful, because not all string function will be automatically changed.
This is one: http://php.net/manual/en/function.substr-replace.php