当前位置：文江博客话题详情

PHP str_word_count() 多字节安全吗？

发布于 2024-12-18 10:26:34 字数 548 浏览 3 评论 0原文

我想在 UTF-8 字符串上使用 str_word_count() 。

这在 PHP 中安全吗？在我看来，应该是这样的（特别是考虑到没有 mb_str_word_count() ）。

但在 php.net 上，有很多人通过展示他们自己的“多字节兼容”版本的函数< /a>.

所以我想我想知道...

鉴于 str_word_count 只是计算由 " " （空格）分隔的所有字符序列，它在多字节上应该是安全的字符串，即使它不一定知道字符序列，对吧？
UTF-8 中是否存在等效的非 ASCII " "（空格）的“空格”字符？#

我猜这就是问题所在。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

浮萍、无处依 2024-12-25 10:26:34

我想说你猜对了。事实上，UTF-8 中存在不属于 US-ASCII 的空格字符。给你一个这样的空格的例子：

Unicode 字符 'NO-BREAK SPACE ' (U+00A0): UTF-8 格式的 2 个字节：0xC2 0xA0 (c2a0)

也许还有：

Unicode 字符 '下一行 (NEL)' (U+0085) ：UTF-8 格式的 2 个字节：0xC2 0x85 (c285)
Unicode 字符“行分隔符”(U+2028)：3 字节在 UTF-8 中： 0xE2 0x80 0xA8 (e280a8)
Unicode 字符“段落分隔符”(U+2029)：3 字节在 UTF-8 中： 0xE2 0x80 0xA8 (e280a8)

无论如何，第一个 - 'NO-BREAK SPACE' (U+00A0) - 是一个很好的例子，因为它也是 Latin-X 字符集的一部分。 PHP 手册已经提供了一个提示，即 str_word_count 取决于语言环境。

如果我们想对此进行测试，我们可以将语言环境设置为 UTF-8，传入包含 \xA0 序列的无效字符串，如果这仍然算作断词字符，则该函数显然不是 UTF-8 安全的，因此不是多字节安全的（与问题相同，未定义）：

<?php
/**
 * is PHP str_word_count() multibyte safe?
 * @link https://stackoverflow.com/q/8290537/367456
 */

echo 'New Locale: ', setlocale(LC_ALL, 'en_US.utf8'), "\n\n";

$test   = "aword\xA0bword aword";
$result = str_word_count($test, 2);

var_dump($result);

输出：

New Locale: en_US.utf8

array(3) {
  [0]=>
  string(5) "aword"
  [6]=>
  string(5) "bword"
  [12]=>
  string(5) "aword"
}

As this演示显示，该函数完全无法满足其在手册页上给出的语言环境承诺（我对此并不感到奇怪或抱怨，最常见的是，如果您读到某个函数是 PHP 中特定于语言环境的，请运行您的生活并发现一个不是），我在这里利用它来证明它对 UTF-8 字符编码没有任何作用。

对于 UTF-8，您应该查看 PCRE 扩展：

在 PCRE/PHP 中匹配 Unicode 字母字符

PCRE 有一个对 PHP 中的 Unicode 和 UTF-8 有很好的了解。如果您仔细设计正则表达式模式，它也可能会相当快。

I'd say you guess right. And indeed there are space characters in UTF-8 which are not part of US-ASCII. To give you an example of such spaces:

Unicode Character 'NO-BREAK SPACE' (U+00A0): 2 Bytes in UTF-8: 0xC2 0xA0 (c2a0)

And perhaps as well:

Unicode Character 'NEXT LINE (NEL)' (U+0085): 2 Bytes in UTF-8: 0xC2 0x85 (c285)
Unicode Character 'LINE SEPARATOR' (U+2028): 3 Bytes in UTF-8: 0xE2 0x80 0xA8 (e280a8)
Unicode Character 'PARAGRAPH SEPARATOR' (U+2029): 3 Bytes in UTF-8: 0xE2 0x80 0xA8 (e280a8)

Anyway, the first one - the 'NO-BREAK SPACE' (U+00A0) - is a good example as it is also part of Latin-X charsets. And the PHP manual already provides a hint that str_word_count would be locale dependent.

If we want to put this to a test, we can set the locale to UTF-8, pass in an invalid string containing a \xA0 sequence and if this still counts as word-breaking character, that function is clearly not UTF-8 safe, hence not multibyte safe (as same non-defined as per the question):

<?php
/**
 * is PHP str_word_count() multibyte safe?
 * @link https://stackoverflow.com/q/8290537/367456
 */

echo 'New Locale: ', setlocale(LC_ALL, 'en_US.utf8'), "\n\n";

$test   = "aword\xA0bword aword";
$result = str_word_count($test, 2);

var_dump($result);

Output:

New Locale: en_US.utf8

array(3) {
  [0]=>
  string(5) "aword"
  [6]=>
  string(5) "bword"
  [12]=>
  string(5) "aword"
}

As this demo shows, that function totally fails on the locale promise it gives on the manual page (I do not wonder nor moan about this, most often if you read that a function is locale specific in PHP, run for your life and find one that is not) which I exploit here to demonstrate that it by no means does anything regarding the UTF-8 character encoding.

Instead for UTF-8 you should take a look into the PCRE extension:

Matching Unicode letter characters in PCRE/PHP

PCRE has a good understanding of Unicode and UTF-8 in PHP in specific. It can also be quite fast if you craft the regular expression pattern carefully.

回复收藏 0 原文

番薯 2024-12-25 10:26:34

关于“模板答案” - 我没有得到“更快地工作”的需求。我们在这里讨论的不是长时间或大量计数，所以谁在乎是否需要多花几毫秒呢？

然而，使用软连字符的 str_word_count：

function my_word_count($str) {
  return str_word_count(str_replace("\xC2\xAD",'', $str));
}

一个符合断言的函数（但可能不比 str_word_count 快）：

function my_word_count($str) {
  $mystr = str_replace("\xC2\xAD",'', $str);        // soft hyphen encoded in UTF-8
  return preg_match_all('~[\p{L}\'\-]+~u', $mystr); // regex expecting UTF-8
}

preg 函数本质上与已经提出的相同，除了 a) 它已经返回一个计数，因此不需要提供匹配，这应该会使其更快，并且b）确实不应该有 iconv 后备，IMO。

关于评论：

我可以看到你的 PCRE 功能比我的差（性能）
preg_word_count() 因为需要一个你不需要的 str_replace ：
'~[^\p{L}\'-\xC2\xAD]+~u' 工作正常（！）。

我认为不同的事情，字符串替换只会删除多字节字符，但你的正则表达式将处理 \\xC2 和 \\xAD 它们可能以任何顺序出现，这是错误的。考虑一个注册符号，即\xC2\xAE。

然而，现在我考虑到有效的 UTF-8 的工作方式，这并不重要，所以应该同样可以很好地使用。所以我们可以只拥有该功能，

function my_word_count($str) {
  return preg_match_all('~[\p{L}\'\-\xC2\xAD]+~u', $str); // regex expecting UTF-8
}

而不需要任何匹配或其他替换。

关于str_word_count(str_replace("\xC2\xAD",'', $str));，如果稳定的话
使用UTF8，很好，但似乎不是。

如果您阅读此线程，如果您坚持使用有效的 UTF-8 字符串，您就会知道 str_replace 是安全的。我在你的链接中没有看到任何相反的证据。

About the "template answer" - I don't get the demand "working faster". We're not talking about long times or lot of counts here, so who cares if it takes some milliseconds longer or not?

However, a str_word_count working with soft hyphen:

function my_word_count($str) {
  return str_word_count(str_replace("\xC2\xAD",'', $str));
}

a function that complies with the asserts (but is probably not faster than str_word_count):

function my_word_count($str) {
  $mystr = str_replace("\xC2\xAD",'', $str);        // soft hyphen encoded in UTF-8
  return preg_match_all('~[\p{L}\'\-]+~u', $mystr); // regex expecting UTF-8
}

The preg function is essentially the same what's already proposed, except that a) it already returns a count so no need to supply matches, which should make it faster and b) there really should not be iconv fallback, IMO.

About a comment:

I can see that your PCRE functions are wrost (performance) than my
preg_word_count() because need a str_replace that you not need:
'~[^\p{L}\'-\xC2\xAD]+~u' works fine (!).

I considered that a different thing, string replace will only remove the multibyte character, but regex of yours will deal with \\xC2 and \\xAD in any order they might appear, which is wrong. Consider a registered sign, which is \xC2\xAE.

However, now that I think about it due to the way valid UTF-8 works, it wouldn't really matter, so that should be usable equally well. So we can just have the function

function my_word_count($str) {
  return preg_match_all('~[\p{L}\'\-\xC2\xAD]+~u', $str); // regex expecting UTF-8
}

without any need for matches or other replacements.

About str_word_count(str_replace("\xC2\xAD",'', $str));, if is stable
with UTF8, is good, but seems is not.

If you read this thread, you'll know str_replace is safe if you stick to valid UTF-8 strings. I didn't see any evidence in your link of the contrary.

回复收藏 0 原文

多情癖 2024-12-25 10:26:34

编辑（显示新线索）：有一个可能的解决方案，使用 str_word_count() 与 PHP v5.1！

function my_word_count($str, $myLangChars="àáãâçêéíîóõôúÀÁÃÂÇÊÉÍÎÓÕÔÚ") { 
    return str_word_count($str, 0, $myLangChars);
}

但不是 100%，因为我尝试添加到 $myLangChars \xC2\xAD （SHy - 软连字符字符），它必须是任何语言中的单词组件，并且它不起作用（参见）。

另一个，不是那么快，但是完整而灵活的解决方案（从这里提取），基于PCRE库，但有一个选项模仿非有效 UTF8 上的 str_word_count() 行为：（

 /**
  * Like str_word_count() but showing how preg can do the same.
  * This function is most flexible but not faster than str_word_count.
  * @param $wRgx the "word regular expression" as defined by user.
  * @param $triggError changes behaviour causing error event.
  * @param $OnBadUtfTryAgain when true mimic the str_word_count behaviour.
  * @return 0 or positive integer as word-count, negative as PCRE error.
  */
 function preg_word_count($s,$wRgx='/[-\'\p{L}\xC2\xAD]+/u', $triggError=true,
                          $OnBadUtfTryAgain=true) {
   if ( preg_match_all($wRgx,$s,$m) !== false )
      return count($m[0]);
   else {
      $lastError = preg_last_error();
      $chkUtf8 = ($lastError==PREG_BAD_UTF8_ERROR);
      if ($OnBadUtfTryAgain && $chkUtf8) 
         return preg_word_count(
            iconv('CP1252','UTF-8',$s), $wRgx, $triggError, false
         );
      elseif ($triggError) trigger_error(
         $chkUtf8? 'non-UTF8 input!': "error PCRE_code-$lastError",
         E_USER_NOTICE
         );
      return -$lastError;
   }
 }

模板答案）帮助获得赏金！

（这不是一个答案，而是赏金帮助，因为我无法编辑也无法重复问题）

我们想要计算UTF-8拉丁文本中的“现实世界单词”。

为了赏金，我们需要：

一个符合下面的 assert 且比 str_word_count 更快的函数；
或 str_word_count 使用 SHy 字符（如何操作？）；
或 preg_word_count 工作得更快（使用 preg_replace？单词分隔符正则表达式？）。

断言

假设存在“多字节安全”函数my_word_count()，则以下断言必须为真：

assert_options(ASSERT_ACTIVE, 1);

$text = "1,2,3,4=0 (1 2 3 4)=0 (... ,.)=0  (2.5±0.1; 0.5±0.2)=0";
assert( my_word_count($text)==0 ); // no word there 

$text = "(one two,three;four)=4 (five-six se\xC2\xADven)=2";
assert( my_word_count($text)==6 ); // hyphen merges two words 

$text = "(um±dois três)=3 (àáãâçêéíîóõôúÀÁÃÂÇÊÉÍÎÓÕÔÚ)=1";
assert( my_word_count($text)==4 ); // a UTF8 case 

$text = "(ÍSÔ9000-X, ISÔ 9000-X, ÍSÔ-9000-X)=6"; //Codes are words?
assert( my_word_count($text)==6 ); // suppose no: X is another word

EDITED (to show new clues): there are a possible solution using str_word_count() with PHP v5.1!

function my_word_count($str, $myLangChars="àáãâçêéíîóõôúÀÁÃÂÇÊÉÍÎÓÕÔÚ") { 
    return str_word_count($str, 0, $myLangChars);
}

but not is 100% because I try to add to $myLangChars \xC2\xAD (the SHy - SOFT HYPHEN character), that must be a word component in any language, and it not works (see).

Another, not so fast, but complete and flexible solution (extracted from here), based on PCRE library, but with an option to mimic the str_word_count() behaviour on non-valid-UTF8:

 /**
  * Like str_word_count() but showing how preg can do the same.
  * This function is most flexible but not faster than str_word_count.
  * @param $wRgx the "word regular expression" as defined by user.
  * @param $triggError changes behaviour causing error event.
  * @param $OnBadUtfTryAgain when true mimic the str_word_count behaviour.
  * @return 0 or positive integer as word-count, negative as PCRE error.
  */
 function preg_word_count($s,$wRgx='/[-\'\p{L}\xC2\xAD]+/u', $triggError=true,
                          $OnBadUtfTryAgain=true) {
   if ( preg_match_all($wRgx,$s,$m) !== false )
      return count($m[0]);
   else {
      $lastError = preg_last_error();
      $chkUtf8 = ($lastError==PREG_BAD_UTF8_ERROR);
      if ($OnBadUtfTryAgain && $chkUtf8) 
         return preg_word_count(
            iconv('CP1252','UTF-8',$s), $wRgx, $triggError, false
         );
      elseif ($triggError) trigger_error(
         $chkUtf8? 'non-UTF8 input!': "error PCRE_code-$lastError",
         E_USER_NOTICE
         );
      return -$lastError;
   }
 }

(TEMPLATE ANSWER) help for bounty!

(this is not an answer, is a help for bounty, because I can not edit neither to duplicate the question)

We want to count "real-world words" in a UTF-8 latim text.

FOR BOUNTY, WE NEED:

a function that comply the asserts below and is faster than str_word_count;
or str_word_count working with SHy character (how to?);
or preg_word_count working faster (using preg_replace? word-separator regular expression?).

ASSERTS

Supose that a "multibyte safe" function my_word_count() exists, then the following asserts must be true:

assert_options(ASSERT_ACTIVE, 1);

$text = "1,2,3,4=0 (1 2 3 4)=0 (... ,.)=0  (2.5±0.1; 0.5±0.2)=0";
assert( my_word_count($text)==0 ); // no word there 

$text = "(one two,three;four)=4 (five-six se\xC2\xADven)=2";
assert( my_word_count($text)==6 ); // hyphen merges two words 

$text = "(um±dois três)=3 (àáãâçêéíîóõôúÀÁÃÂÇÊÉÍÎÓÕÔÚ)=1";
assert( my_word_count($text)==4 ); // a UTF8 case 

$text = "(ÍSÔ9000-X, ISÔ 9000-X, ÍSÔ-9000-X)=6"; //Codes are words?
assert( my_word_count($text)==6 ); // suppose no: X is another word

回复收藏 0 原文