PHP 中 mb_detect_order() 的奇怪行为

发布于 2024-09-02 13:24:41 字数 875 浏览 2 评论 0原文

我想检测一些文本的编码(使用 PHP)。 为此,我使用 mb_detect_encoding() 函数。

问题是,如果我使用 mb_detect_order() 函数更改可能编码的顺序,该函数会返回不同的结果。

考虑以下示例

$html = <<< STR
ちょっとのアクセスで落ちてしまったり、サーバー障害が多いレンタルサーバーを選ぶとあなたのビジネス等にかなりの影響がでてしまう可能性があります。特に商売をされている個人の方、法人の方は気をつけるようにしてください
STR;
mb_detect_order(array('UTF-8','EUC-JP', 'SJIS', 'eucJP-win', 'SJIS-win', 'JIS', 'ISO-2022-JP','ISO-8859-1','ISO-8859-2'));
$originalEncoding = mb_detect_encoding($str);
die($originalEncoding); // $originalEncoding = 'UTF-8'

,但是如果更改 mb_detect_order() 中的编码顺序,结果将会不同:

mb_detect_order(array('EUC-JP','UTF-8', 'SJIS', 'eucJP-win', 'SJIS-win', 'JIS', 'ISO-2022-JP','ISO-8859-1','ISO-8859-2'));        
die($originalEncoding); // $originalEncoding = 'EUC-JP'



所以我的问题是:
为什么会发生这种情况?
PHP 中有没有一种方法可以正确且明确地检测文本编码?

I would like to detect encoding of some text (using PHP).
For that purpose i use mb_detect_encoding() function.

The problem is that the function returns different results if i change the order of possible encodings with mb_detect_order() function.

Consider the following example

$html = <<< STR
ちょっとのアクセスで落ちてしまったり、サーバー障害が多いレンタルサーバーを選ぶとあなたのビジネス等にかなりの影響がでてしまう可能性があります。特に商売をされている個人の方、法人の方は気をつけるようにしてください
STR;
mb_detect_order(array('UTF-8','EUC-JP', 'SJIS', 'eucJP-win', 'SJIS-win', 'JIS', 'ISO-2022-JP','ISO-8859-1','ISO-8859-2'));
$originalEncoding = mb_detect_encoding($str);
die($originalEncoding); // $originalEncoding = 'UTF-8'

However if you change the order of encodings in mb_detect_order() the results will be different:

mb_detect_order(array('EUC-JP','UTF-8', 'SJIS', 'eucJP-win', 'SJIS-win', 'JIS', 'ISO-2022-JP','ISO-8859-1','ISO-8859-2'));        
die($originalEncoding); // $originalEncoding = 'EUC-JP'

So my questions are:
Why is that happening ?
Is there a way in PHP to correctly and unambiguously detect encoding of text ?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

仙女山的月亮 2024-09-09 13:24:41

这就是我所期望发生的事情。

检测算法可能只是继续按顺序尝试您在 mb_detect_order 中指定的编码,然后返回字节流有效的第一个编码。

更智能的东西需要统计方法(我认为机器学习是常用的)。

编辑:参见例如 这篇文章以获得更智能的方法。

由于其重要性,自动字符集检测已在 Mozilla 或 Internet Explorer 等主要 Internet 应用程序中实现。它们非常准确和快速,但实施过程中会根据具体情况应用许多特定领域的知识。与他们的方法相反,我们的目标是一种可以统一应用于每个字符集的简单算法,并且该算法基于完善的标准机器学习技术。我们还研究了语言和字符集检测之间的关系,并比较了基于字节的算法和基于字符的算法。我们使用朴素贝叶斯 (NB) 和支持向量机 (SVM)。

That's what I would expect to happen.

The detection algorithm probably just keeps trying, in order, the encodings you specified in mb_detect_order and then returns the first one under which the bytestream would be valid.

Something more intelligent requires statistical methods (I think machine learning is commonly used).

EDIT: See e.g. this article for more intelligent methods.

Due to its importance, automatic charset detection is already implemented in major Internet applications such as Mozilla or Internet Explorer. They are very accurate and fast, but the implementation applies many domain specific knowledges in case-by-case basis. As opposed to their methods, we aimed at a simple algorithm which can be uniformly applied to every charset, and the algorithm is based on well-established, standard machine learning techniques. We also studied the relationship between language and charset detection, and compared byte-based algorithms and character-based algorithms. We used Naive Bayes (NB) and Support Vector Machine (SVM).

两相知 2024-09-09 13:24:41

并不真地。不同的编码通常有大面积的重叠,如果您正在测试的字符串完全存在于重叠区域内,则两种编码都是可以接受的。

例如,对于字母 az,utf-8 和 ISO-8859-1 是相同的。字符串“hello”在两种编码中都具有相同的字节序列。

这正是为什么首先有 mb_detect_order() 函数的原因,因为它允许您说出当这些冲突发生时您希望发生什么。您希望“hello”为 utf-8 还是 ISO-8859-1?

Not really. The different encodings often have large areas of overlap, and if your string that you are testing exists entirly inside that overlap, then both encoding are acceptable.

For example, utf-8 and ISO-8859-1 are the same for the letters a-z. The string "hello" would have an identical sequence of bytes in both encodings.

This is exactly why there is an mb_detect_order() function in the first place, as it allows you to say what you would prefer to happen when these clashes happen. Would you like "hello" to be utf-8 or ISO-8859-1?

能否归途做我良人 2024-09-09 13:24:41

请记住,mb_detect_encoding() 不知道数据采用什么编码。您可能会看到一个字符串,但函数本身只能看到一个字节流。这样一来,它需要猜测编码是什么 - 例如,如果字节仅在 0-127 范围内,则为 ASCII;如果存在 ASCII 字节和 128+ 字节仅成对或更多存在,则为 UTF-8,等等。

正如您可以想象的那样,考虑到这种情况,可靠地检测编码是相当困难的。

就像 rihk 所说,这就是mb_detect_order() 函数的用途是 - 您基本上是在提供您对数据可能是什么的最佳猜测。您经常使用 UTF-8 文件吗?那么很可能您的内容不太可能是 UTF-16,即使 mb_detect_encoding() 可以这样猜测。

您可能还想查看 Artefacto链接 更深入的了解。

示例案例

Keep in mind mb_detect_encoding() does not know what encoding the data is in. You may see a string, but the function itself only sees a stream of bytes. Going by that, it needs to guess what the encoding is - e.g. ASCII would be if bytes are only in the 0-127 range, UTF-8 would be if there are ASCII bytes and 128+ bytes that exist only in pairs or more, and so forth.

As you can imagine, given that context, it's quite difficult to detect an encoding reliably.

Like rihk said, this is what the mb_detect_order() function is for - you're basically supplying your best guess what the data is likely to be. Do you work with UTF-8 files frequently? Then chances are your stuff isn't likely to be UTF-16 even if mb_detect_encoding() could guess it as that.

You might also want to check out Artefacto's link for a more in-depth view.

Example case: Internet Explorer uses some interesting encoding guessing if nothing is specified (@link, Section: 'To automatically detect a website's language') that's caused strange behaviours on websites that took encoding for granted in the past. You can probably find some amusing stuff on that if you google around. It makes for a nice show-case how even statistical methods can backfire horribly, and why encoding-guessing in general is problematic.

神也荒唐 2024-09-09 13:24:41

mb_detect_encoding 查看 mb_detect_order() 中的第一个字符集条目,然后循环遍历输入 $html,逐个字符匹配该字符是否属于字符集的有效字符集。如果每个字符都匹配,则返回 true;如果任何字符失败,它将移至 mb_detect_order() 中的下一个字符集并重试。

维基百科字符集列表是查看组成每个字符集的字符的好地方。

由于这些字符集值重叠(字符 x8fA1EF 存在于“UTF-8”和“EUC-JP”中),因此即使它在每个字符集中是完全不同的字符,也会被视为匹配。因此,除非任何字符值存在于一个字符集中,但不存在于另一个字符集中,否则 mb_detect_encoding 无法识别哪个字符集无效;并将返回数组列表中可能有效的第一个字符集。

据我所知,没有确定的方法来识别字符集。如果您对可能遇到的字符集有合理的了解,并根据每个字符集中的间隙(无效字符)对列表进行相应的排序,那么 PHP 的“最佳猜测”方法会有所帮助。
最好的解决方案是“了解”字符集。如果您从另一个页面抓取 html,请在该页面的标题中查找字符集标识符。

如果你真的想变得聪明,你可以尝试识别编写 html 的语言,也许使用 trigrams 或 n-grams 或类似的内容,如 这篇关于 PHP/ir 的文章

mb_detect_encoding looks at the first charset entry in your mb_detect_order() and then loops through your input $html matching character by character whether that character falls within the valid set of characters for the charset. If every character matches, then it returns true; if any character fails, it moves on to the next charset in the mb_detect_order() and tries again.

The wikipedia list of charsets is a good place to see the characters that make up each charset.

Because these charset values overlap (char x8fA1EF exists in both 'UTF-8' and in 'EUC-JP') this will be considered a match even though it's a totally different character in each character set. So unless any of the character values exist in one charset, but not in another, then mb_detect_encoding can't identify which of the charsets is invalid; and will return the first charset from your array list which could be valid.

As far as I'm aware, there is no surefire way of identifying a charset. PHP's "best guess" method can be helped if you have a reasonable idea of what charsets you are likely to encounter, and order your list accordingly based on the gaps (invalid characters) in each charset.
The best solution is to "know" the charset. If you are scraping your html from another page, look for the charset identifier in the header of that page.

If you really want to be clever, you can try and identify the language in which the html is written, perhaps using trigrams or n-grams or similar as described in this article on PHP/ir.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文