PHP 多字节字符串函数
今天我遇到了 php 函数 strpos()
的问题,因为即使正确的结果显然是 0,它也返回 FALSE。这是因为一个参数是用 UTF-8 编码的,而另一个参数(来源)是一个 HTTP GET 参数)显然不是。
现在我注意到使用 mb_strpos 函数解决了我的问题。
我现在的问题是:通常使用 PHP 多字节字符串函数来避免将来出现这些问题是否明智? 我是否应该完全避免使用传统的 strpos、strlen、ereg 等函数?
注意:我不想在 php.ini 中设置 mbstring.func_overload 全局,因为这会在使用 PEAR 库时导致其他问题。 我正在使用 PHP4。
Today I ran into a problem with the php function strpos()
because it returned FALSE even if the correct result was obviously 0. This was because one parameter was encoded in UTF-8, but the other (origin is a HTTP GET parameter) obviously not.
Now I have noticed that using the mb_strpos
function solved my problem.
My question is now: Is it wisely to use the PHP multibyte string functions generally to avoid theses problems in future? Should I avoid the traditional strpos
, strlen
, ereg
, etc., etc. functions at all?
Notice: I don't want to set mbstring.func_overload
global in php.ini, because this leads to other problems when using the PEAR library. I am using PHP4.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
这取决于您使用的字符编码。 在单字节字符编码或 UTF-8(字符内的单个字节永远不会被误认为另一个字符)中,只要您正在搜索的字符串和您用于搜索的字符串位于同一位置编码然后您可以继续使用常规字符串搜索功能。
如果您使用 UTF-8 以外的多字节编码(这不会阻止字符中的单个字节像其他字符一样出现),则使用常规字符串搜索函数进行字符串搜索永远不会安全。 您可能会发现误报。 这是因为 PHP 在 strpos 等函数中的字符串比较是按字节进行的,除了专门为防止此问题而设计的 UTF-8 之外,多字节编码都存在以下问题:字符中的任何后续字节都由多个字节可能匹配不同字符的一部分。
如果您正在搜索的字符串和您正在搜索的字符串具有不同的字符编码,则始终需要进行转换。 否则,您会发现对于在其他编码中以不同方式表示的任何字符串,它将始终返回 false。 您应该对输入进行此类转换:决定您的应用程序将使用的字符编码,并在应用程序内保持一致。 每当您收到不同编码的输入时,请在输入时进行转换。
It depends on the character encoding you are using. In single-byte character encodings, or UTF-8 (where a single byte inside a character can never be mistaken for another character), then as long as the string you are searching in and the string you are using to search are in the same encoding then you can continue to use the regular string search functions.
If you are using a multi-byte encoding other than UTF-8, which does not prevent single bytes within a character from appearing like other characters, then it is never safe to do a string search using the regular string search functions. You may find false positives. This is because PHP's string comparison in functions such as strpos is per-byte, and with the exception of UTF-8 which is specifically designed to prevent this problem, multi-byte encodings suffer the problem that any subsequent byte in a character made up of more than one byte may match part of a different character.
If the string you are searching in and the string you are searching for are of different character encodings, then conversion will always be necessary. Otherwise you'll find that for any string that would be represented differently in the other encoding, it will always return false. You should do such conversion on input: decide on a character encoding your app will use, and be consistent within the application. Any time you receive input in a different encoding, convert on the way in.
5.2 之前的 PHP 版本中的 mb_ * 函数存在一些问题。 因此,如果您的代码在具有不同 PHP 版本的多个平台上运行,则可能会出现奇怪的行为。 此外,mb_strpos函数相当慢,它必须跳过offset参数指定的字符数才能获得内部使用的真实字节位置。 在取决于 strpos/mb_strpos 功能的循环中,这可能成为主要瓶颈。
There have been some problems with the mb_ * functions in PHP versions prior to 5.2. So if your code is going on multiple platforms with different versions of PHP, strange behavior can occur. Furthermore the mb_ strpos function is rather slow, it has to skip the number of characters specified by the offset parameter to get the real byte position used internally. In loops depending on the strpos/mb_strpos functionality this can become a major bottleneck.
如果您在任何地方都使用相同的编码,那么通常不会有问题。 我所有的页面都使用 UTF-8,但实际上从未遇到过这个问题。 最后,它实际上归结为页面和数据库指定相同的编码。
例如:
在大多数情况下,这意味着应用程序的所有数据源将以相同的编码传递数据,因此您将避免此类问题。
顺便说一句,随着 PHP 6 的出现,这一切都会变得更好,因为它将包含完整的 unicode 支持。
If you use the same encoding everywhere it generally isn't a problem. I use UTF-8 for all my pages, and have never actually encountered this problem. In the end it really comes down to specifying the same encoding for the pages and the database.
For example:
In most cases this means that all the data sources for the application will deliver data in the same encoding, and thus you'll avoid this kind of problems.
This will all be much better with the advent PHP 6, btw, since it will include full unicode-support.
您不一定必须使用 mb_strpos,但您确实需要确保应用程序中的所有数据都相同:要么是 mb_string,要么是一种特定编码的纯字符串。 (通常是 UTF-8。)
如果您确保您的页面是 UTF-8,并且您的表单提交被解释为 UTF-8,并且您的数据库存储 UTF-8,那么通常就可以了。 索引字符串操作(特别是截断)可能会破坏 UTF-8 序列,这很烦人,但通常不会造成灾难性的后果。 如果您确实需要这种级别的支持,mb_strings 是您唯一的选择(但当然您必须确保应用程序和库以及 PHP 版本的所有部分都可以正确处理它们)。
现在,在 PHP 中开发能够正确处理 Unicode 的网站并不是一件很有趣的事情:与 Python 和 .NET 等语言相比,它对 Unicode 的支持非常差。 希望 PHP6 能够改善这一情况。
You don't necessarily have to use mb_strpos, but you do need to make sure that all the data in your app is the same: either an mb_string, or a plain string in one particular encoding. (Usually UTF-8.)
If you make sure your pages are UTF-8, and your form submissions are interpreted as UTF-8, and your database stores UTF-8, you'll generally be OK. Indexed string operations (in particular truncations) can break a UTF-8 sequence, which is annoying but not generally disastrous. If you do need that level of support, mb_strings are your only option (but of course you have to make sure that all parts of your app and libraries and PHP version can cope with them properly).
Developing sites that handle Unicode correctly in PHP isn't too much fun right now: its Unicode support is very poor compared to languages like Python and .NET. It is hoped PHP6 will improve matters.
我建议使用以下 PHP UTF-8 库:
http://sourceforge.net/projects/phputf8
将其与您的应用程序捆绑在一起,不需要 mbstring 扩展名,从而放宽了应用程序的要求,但您仍然可以获得 UTF-8 字符串函数。
I would recommend using the following PHP UTF-8 library:
http://sourceforge.net/projects/phputf8
Bundling it with your application loosens your application's requirements by not requiring the mbstring extension, but you still get UTF-8 string functions.