strpos 在 PHP 中搜索 unicode(并处理内联 UTF-8)

发布于 2024-09-15 17:30:56 字数 774 浏览 0 评论 0原文

我在处理在另一个字符串(干草堆)中简单搜索两个字符的 unicode 字符串(针)时遇到问题,该字符串可能是也可能不是 UTF-8

部分问题是我不知道如何指定在 strpos 中使用的代码,我不知道 PHP 是否必须在对该代码的任何特殊支持的情况下进行编译,或者我是否必须使用我正在使用的 mb_strpos试图避免,因为它也可能不可用。

IE。例如针是 U+56DE U+590D (没有空格)

使用 preg_match 可能是 preg_match("@\x{56DE}\x{590D}@",$haystack ) 但这实际上需要 @u ,它可能不可用,并且我得到一个 Compilation failed: character value in \x{...}equence is太大 无论如何。

无论如何我不想使用 preg_match 因为它可能比 strpos 慢得多(还有其他序列需要搜索)。

我可以将 U+56DE U+590D 转换为其单字节序列(可能是 5-6 个字符),然后通过 strpos 搜索它吗? 我不知道如何如果是的话将其转换为字节。

如何在 PHP 中指定内联 unicode?我的意思是在 PRCE 之外?

$blah="\u56DE\u590D"; 不起作用?

感谢您的任何想法!

I am having a problem dealing with a simple search for a two character unicode string (the needle) inside another string (the haystack) that may or may not be UTF-8

Part of the problem is I don't know how to specify the code for use in strpos, and I don't know if PHP has to be compiled with any special support for the code, or if I have to use mb_strpos which I am trying to avoid since it also might not be available.

ie. for example the needle is U+56DE U+590D (without the space)

With preg_match it might be preg_match("@\x{56DE}\x{590D}@",$haystack)
but that actually requires @u which might not be available and I get a Compilation failed: character value in \x{...} sequence is too large anyway.

I don't want to use preg_match anyway as it might be significantly slower than strpos (there are other sequences that have to be searched).

Can I convert U+56DE U+590D into its single byte sequence (possibly 5-6 characters) and then search for it via strpos? I can't figure out how to convert it to bytes if so.

How do you specify unicode inline in PHP anyway? I mean outside of PRCE ?

$blah="\u56DE\u590D"; doesn't work?

Thanks for any ideas!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

表情可笑 2024-09-22 17:30:56

首先,你的问题结构很糟糕。它在几个点上有几个问题。如果您使用更清晰的结构,您可能会得到更多答案:1)描述您想要完成的任务,2)限制/要求,3)您考虑的策略,4)您发现这种策略的困难/还有更好的吗?

也就是说,我将从最后开始:

$blah="\u56DE\u590D"; 不起作用?

不,该语言对 unicode 一无所知。在 PHP 中,字符串是字节数组。因此,如何在 PHP 脚本中表达 unicode 代码点取决于您要使用的编码。对于 UTF-8,将为 "\xE5\x9B\x9E\xE5\xA4\x8D",对于 UTF-16 大尾数将为 "\x56\xDE\x59\x0D ”,等等。

我可以将U+56DE U+590D转换为其单字节序列(可能是5-6个字符),然后通过strpos搜索它吗?如果是的话,我不知道如何将其转换为字节。

对于第一部分,是的,即将U+56DE U+590D转换为字节,需要澄清。这些是 UTF-16 代码单元还是 Unicode 代码点?例如,

First, your question is poorly structured. It has several questions at several points. You would probably get more answers if you used a more clear structure: 1) describe the task you're trying to accomplish, 2) the limitations/requirements, 3) the strategy you considered, 4) the difficulties you found with such strategy/is there a better one.

That said, I'll start by the end:

$blah="\u56DE\u590D"; doesn't work?

No. The language doesn't know anything about unicode. In PHP, strings are byte arrays. Therefore, how you express a unicode code points in a PHP script depends on the encoding you want to use. For UTF-8, it would be "\xE5\x9B\x9E\xE5\xA4\x8D", for UTF-16 big endian would be "\x56\xDE\x59\x0D", and so on.

Can I convert U+56DE U+590D into its single byte sequence (possibly 5-6 characters) and then search for it via strpos? I can't figure out how to convert it to bytes if so.

For, the first part, yes, i.e., converting U+56DE U+590D into bytes, a clarification is needed. Are these UTF-16 code units or Unicode code points? For instance, how is ???? represented? U+D869 U+uDED6 or U+2A6D6? If they are unicode code units, it's trivial to encode them into UTF-16. For UTF-16 big endian, it's just "\x56\xDE\x59\x0D". Otherwise, it's still trivial to encode them UTF-32, but it takes a little more work to do the same in UTF-16 (or UTF-8).

For the second part, keep reading.

Part of the problem is I don't know how to specify the code for use in strpos, and I don't know if PHP has to be compiled with any special support for the code, or if I have to use mb_strpos which I am trying to avoid since it also might not be available.

What are you trying to do? Why do you need to find a position in a string? strpos will give you a byte offset for a given string (again, interpreted in binary form). Are you trying to clip a string? strpos (or even mb_strpos) mean trouble in Unicode – a glyph can be constituted by several code units, so you risk clipping part of a glyph. I can't advise you more unless you tell what you're trying to do.

往事随风而去 2024-09-22 17:30:56

您写道“可能不可用”。我建议你尝试 mb_strpos

You wrote 'might not be available'. I suggest you to try mb_strpos.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文