strpos 在 PHP 中搜索 unicode(并处理内联 UTF-8)
我在处理在另一个字符串(干草堆)中简单搜索两个字符的 unicode 字符串(针)时遇到问题,该字符串可能是也可能不是 UTF-8
部分问题是我不知道如何指定在 strpos
中使用的代码,我不知道 PHP 是否必须在对该代码的任何特殊支持的情况下进行编译,或者我是否必须使用我正在使用的 mb_strpos
试图避免,因为它也可能不可用。
IE。例如针是 U+56DE U+590D
(没有空格)
使用 preg_match 可能是 preg_match("@\x{56DE}\x{590D}@",$haystack )
但这实际上需要 @u
,它可能不可用,并且我得到一个 Compilation failed: character value in \x{...}equence is太大
无论如何。
无论如何我不想使用 preg_match 因为它可能比 strpos 慢得多(还有其他序列需要搜索)。
我可以将 U+56DE U+590D
转换为其单字节序列(可能是 5-6 个字符),然后通过 strpos 搜索它吗? 我不知道如何如果是的话将其转换为字节。
如何在 PHP 中指定内联 unicode?我的意思是在 PRCE 之外?
$blah="\u56DE\u590D";
不起作用?
感谢您的任何想法!
I am having a problem dealing with a simple search for a two character unicode string (the needle) inside another string (the haystack) that may or may not be UTF-8
Part of the problem is I don't know how to specify the code for use in strpos
, and I don't know if PHP has to be compiled with any special support for the code, or if I have to use mb_strpos
which I am trying to avoid since it also might not be available.
ie. for example the needle is U+56DE U+590D
(without the space)
With preg_match it might be preg_match("@\x{56DE}\x{590D}@",$haystack)
but that actually requires @u
which might not be available and I get a Compilation failed: character value in \x{...} sequence is too large
anyway.
I don't want to use preg_match anyway as it might be significantly slower than strpos (there are other sequences that have to be searched).
Can I convert U+56DE U+590D
into its single byte sequence (possibly 5-6 characters) and then search for it via strpos? I can't figure out how to convert it to bytes if so.
How do you specify unicode inline in PHP anyway? I mean outside of PRCE ?
$blah="\u56DE\u590D";
doesn't work?
Thanks for any ideas!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
首先,你的问题结构很糟糕。它在几个点上有几个问题。如果您使用更清晰的结构,您可能会得到更多答案:1)描述您想要完成的任务,2)限制/要求,3)您考虑的策略,4)您发现这种策略的困难/还有更好的吗?
也就是说,我将从最后开始:
不,该语言对 unicode 一无所知。在 PHP 中,字符串是字节数组。因此,如何在 PHP 脚本中表达 unicode 代码点取决于您要使用的编码。对于 UTF-8,将为
"\xE5\x9B\x9E\xE5\xA4\x8D"
,对于 UTF-16 大尾数将为"\x56\xDE\x59\x0D ”,等等。
对于第一部分,是的,即将
U+56DE U+590D
转换为字节,需要澄清。这些是 UTF-16 代码单元还是 Unicode 代码点?例如,First, your question is poorly structured. It has several questions at several points. You would probably get more answers if you used a more clear structure: 1) describe the task you're trying to accomplish, 2) the limitations/requirements, 3) the strategy you considered, 4) the difficulties you found with such strategy/is there a better one.
That said, I'll start by the end:
No. The language doesn't know anything about unicode. In PHP, strings are byte arrays. Therefore, how you express a unicode code points in a PHP script depends on the encoding you want to use. For UTF-8, it would be
"\xE5\x9B\x9E\xE5\xA4\x8D"
, for UTF-16 big endian would be"\x56\xDE\x59\x0D"
, and so on.For, the first part, yes, i.e., converting
U+56DE U+590D
into bytes, a clarification is needed. Are these UTF-16 code units or Unicode code points? For instance, how is????
represented?U+D869 U+uDED6
orU+2A6D6
? If they are unicode code units, it's trivial to encode them into UTF-16. For UTF-16 big endian, it's just"\x56\xDE\x59\x0D"
. Otherwise, it's still trivial to encode them UTF-32, but it takes a little more work to do the same in UTF-16 (or UTF-8).For the second part, keep reading.
What are you trying to do? Why do you need to find a position in a string?
strpos
will give you a byte offset for a given string (again, interpreted in binary form). Are you trying to clip a string?strpos
(or evenmb_strpos
) mean trouble in Unicode – a glyph can be constituted by several code units, so you risk clipping part of a glyph. I can't advise you more unless you tell what you're trying to do.您写道“可能不可用”。我建议你尝试 mb_strpos。
You wrote 'might not be available'. I suggest you to try mb_strpos.