我应该重构所有框架以使用 mbstring 函数吗?
我目前使用 mbstring.func_overload = 7 来处理 UTF-8 字符集。
我正在考虑重构所有 func 调用以使用 mb_* 函数。
您认为这是必然的吗?或者在 PHP 6 或更高版本中,多字节问题将以另一种方式解决?
I currently use mbstring.func_overload = 7
to get working with UTF-8 charset.
I am thinking to refactor all func call to use mb_*
functions.
Do you think this is necessarily, or with PHP 6 or newer version the multibyte problem will be solved in another way?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果您正在使用其他人创建的库,则不建议使用。以下是三个原因。
1. 的一个很好的例子是使用 strlen 错误计算了 HTTP Content-Length 字段中的字节大小。原因是重载的strlen函数不返回字节数,而是返回字符数。您可以在 CakePHP 和 Zend_Http_Client。
编辑:
PHP 5.5 或 5.6 正在考虑弃用 mbstring.func_overload (来自 mbstring 维护者的邮件 2012 年 4 月)。所以现在您应该避免mbstring.func_overload。
对于各种平台处理多字节字符的推荐策略是直接使用 mbstring 或 intl 或 iconv。如果您确实需要后备函数来处理多字节字符,请使用function_exists()。
您可以在Wordpress和MediaWiki中查看案例。
一些 CMS 喜欢Drupal (unicocde.inc) 引入多字节抽象层。
我认为抽象层不是一个好主意。
原因是在很多情况下所需的处理多字节函数的数量低于 10 个,并且 umultibyte 函数易于使用,但如果安装了这些模块,则会降低将处理切换到 mbstring 或 intl 或 iconv 时的性能。
处理多字节字符的最低要求是mb_substr()并处理无效的字节序列。
您可以在上述 CMS 中看到 mb_substr() 的回退函数的情况。
我在以下位置回答了有关处理无效字节序列的问题: 用问号替换无效的 UTF-8 字符,mbstring.substitute_character 似乎
Not recommended if you are using the libraries other people create. Here are three reasons.
Good example of 1. is miscaliculation of bytesize in HTTP Content-Length field by using strlen. The cause is that the overloaded strlen function does not return the number of bytes but number of characters. You can see real world issues in CakePHP and Zend_Http_Client.
Edit:
deprecating mbstring.func_overload is under consideration in PHP 5.5 or 5.6 (from mbstring maintainer's mail in 2012 April). So now you should avoid mbstring.func_overload.
The recommended policy of handling mutibyte characters for various platforms is to use mbstring or intl or iconv directlly. If you really need fallback functions for handling multibyte characters, use function_exists().
You can see the cases in Wordpress and MediaWiki.
Some of CMSes like Drupal (unicocde.inc) introduce mutibyte abstraction layer.
I think the abstraction layer is not good idea.
The reason is that the number of handling multibyte functions needed in a lot of case is under 10 and umultibyte functions are easy to use and decrease perfomance for switching the handling to mbstring or intl or iconv if these module are installed.
The minimum requirement for handling multibyte characters is mb_substr() and handling invalid byte sequence.
You can see the cases of a fallback function for mb_substr() in the above CMSes.
I answered about handling invalid byte sequence in the following place: Replacing invalid UTF-8 characters by question marks, mbstring.substitute_character seems
的,当然。不过,您可以使用字符串做很多事情。 UTF-8 向后兼容 ASCII。如果您只想对字符串的 ASCII 字符进行操作,则可能会产生影响,也可能不会产生影响。这取决于您需要如何处理您的字符串。
如果您想要一个直接的答案:否,您不应该将每个函数重构为
mb_
函数,因为这可能有点矫枉过正。您是否应该检查您的用例,多字节 UTF-8 字符串是否会影响结果并进行相应的重构?是的。Yes, of course. There are many things you can do with strings though. UTF-8 is backwards compatible with ASCII. If you only want to operate on the ASCII characters of a string, it may or may not make a difference. It depends on what you need to do with your strings.
If you want a direct answer: No, you should not refactor every function to an
mb_
function, because it's likely overkill. Should you check your use cases whether a multi-byte UTF-8 string may impact results and refactor accordingly? Yes.