strpos 在 PHP 中搜索 unicode（并处理内联 UTF-8）

发布于 2024-09-15 17:30:56 字数 774 浏览 6 评论 0原文

我在处理在另一个字符串（干草堆）中简单搜索两个字符的 unicode 字符串（针）时遇到问题，该字符串可能是也可能不是 UTF-8

部分问题是我不知道如何指定在 strpos 中使用的代码，我不知道 PHP 是否必须在对该代码的任何特殊支持的情况下进行编译，或者我是否必须使用我正在使用的 mb_strpos试图避免，因为它也可能不可用。

IE。例如针是 U+56DE U+590D （没有空格）

使用 preg_match 可能是 preg_match("@\x{56DE}\x{590D}@",$haystack ） 但这实际上需要 @u ，它可能不可用，并且我得到一个 Compilation failed: character value in \x{...}equence is太大 无论如何。

无论如何我不想使用 preg_match 因为它可能比 strpos 慢得多（还有其他序列需要搜索）。

我可以将 U+56DE U+590D 转换为其单字节序列（可能是 5-6 个字符），然后通过 strpos 搜索它吗？ 我不知道如何如果是的话将其转换为字节。

如何在 PHP 中指定内联 unicode？我的意思是在 PRCE 之外？

$blah="\u56DE\u590D"; 不起作用？

感谢您的任何想法！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

表情可笑 2024-09-22 17:30:56

首先，你的问题结构很糟糕。它在几个点上有几个问题。如果您使用更清晰的结构，您可能会得到更多答案：1）描述您想要完成的任务，2）限制/要求，3）您考虑的策略，4）您发现这种策略的困难/还有更好的吗？

也就是说，我将从最后开始：

$blah="\u56DE\u590D"; 不起作用？

不，该语言对 unicode 一无所知。在 PHP 中，字符串是字节数组。因此，如何在 PHP 脚本中表达 unicode 代码点取决于您要使用的编码。对于 UTF-8，将为 "\xE5\x9B\x9E\xE5\xA4\x8D"，对于 UTF-16 大尾数将为 "\x56\xDE\x59\x0D ”，等等。

我可以将U+56DE U+590D转换为其单字节序列（可能是5-6个字符），然后通过strpos搜索它吗？如果是的话，我不知道如何将其转换为字节。

对于第一部分，是的，即将U+56DE U+590D转换为字节，需要澄清。这些是 UTF-16 代码单元还是 Unicode 代码点？例如，

First, your question is poorly structured. It has several questions at several points. You would probably get more answers if you used a more clear structure: 1) describe the task you're trying to accomplish, 2) the limitations/requirements, 3) the strategy you considered, 4) the difficulties you found with such strategy/is there a better one.

That said, I'll start by the end:

$blah="\u56DE\u590D"; doesn't work?

No. The language doesn't know anything about unicode. In PHP, strings are byte arrays. Therefore, how you express a unicode code points in a PHP script depends on the encoding you want to use. For UTF-8, it would be "\xE5\x9B\x9E\xE5\xA4\x8D", for UTF-16 big endian would be "\x56\xDE\x59\x0D", and so on.

Can I convert U+56DE U+590D into its single byte sequence (possibly 5-6 characters) and then search for it via strpos? I can't figure out how to convert it to bytes if so.

For, the first part, yes, i.e., converting U+56DE U+590D into bytes, a clarification is needed. Are these UTF-16 code units or Unicode code points? For instance, how is ???? represented? U+D869 U+uDED6 or U+2A6D6? If they are unicode code units, it's trivial to encode them into UTF-16. For UTF-16 big endian, it's just "\x56\xDE\x59\x0D". Otherwise, it's still trivial to encode them UTF-32, but it takes a little more work to do the same in UTF-16 (or UTF-8).

For the second part, keep reading.

Part of the problem is I don't know how to specify the code for use in strpos, and I don't know if PHP has to be compiled with any special support for the code, or if I have to use mb_strpos which I am trying to avoid since it also might not be available.

What are you trying to do? Why do you need to find a position in a string? strpos will give you a byte offset for a given string (again, interpreted in binary form). Are you trying to clip a string? strpos (or even mb_strpos) mean trouble in Unicode – a glyph can be constituted by several code units, so you risk clipping part of a glyph. I can't advise you more unless you tell what you're trying to do.

回复收藏 0 原文