___ 编码为 UTF-8 - 有最终解决方案吗？

发布于 2024-09-05 07:30:41 字数 344 浏览 11 评论 0原文

我浏览过网络、浏览过 SO、浏览过 PHP 文档等等。

没有标准的解决方案似乎是一个荒谬的问题。如果你得到一个未知的字符集，并且它有奇怪的字符（例如英文引号），是否有标准方法将它们转换为 UTF-8？

我见过许多使用大量函数和检查的混乱解决方案，但没有一个绝对有效。

有没有人想出自己的功能或始终有效的解决方案？

编辑

许多人回答说“这是无法解决的”或类似性质的东西。我现在明白了，但除了非常有限的 utf8_encode 之外，没有人给出任何有效的解决方案。有什么方法可以解决这个问题？什么是最佳方法？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

暖心男生 2024-09-12 07:30:41

不。人们应该始终知道字符串采用的字符集。使用嗅探函数猜测字符集是不可靠的（尽管在大多数情况下，在西方世界，它通常是 ISO-8859-1 和 UTF 之间的混淆） -8）。

但为什么必须处理未知的字符集呢？对此没有通用的解决方案，因为一般问题首先就不应该存在。每个网页和数据源都可以而且应该有一个字符集定义，如果没有，应该请求该资源的管理员添加一个。

（听起来不像个自作聪明的人，但这是解决这个问题的唯一方法。）

回复收藏 0 原文

春夜浅 2024-09-12 07:30:41

您之所以看到这个问题有如此多复杂的解决方案，是因为根据定义它是无法解决的。对文本字符串进行编码的过程是不确定的。
可以构建产生相同字节流的文本和编码的不同组合。因此，严格逻辑上讲，不可能从字节流中确定编码、字符集和文本。

实际上，使用启发式方法可以获得“足够接近”的结果，因为您在野外会遇到一组有限的编码，并且通过足够大的样本，程序可以确定最可能的编码。结果是否足够好取决于应用。

我确实想评论一下用户生成数据的问题。从网页发布的所有数据都具有已知的编码（POST 带有开发人员为页面定义的编码）。如果用户将文本粘贴到表单字段中，浏览器将根据源数据的编码（操作系统已知）和页面编码来解释文本，并在必要时对其进行转码。检测服务器上的编码为时已晚 - 因为浏览器可能已经根据假定的编码修改了字节流。

例如，如果我在德语键盘上输入字母 Ä 并将其发布到 UTF-8 编码的页面上，则将有 2 个字节 (xC3 x84) 发送到服务器。这是表示字母 C 和 d 的有效 EBCDIC 字符串。这也是一个有效的 ANSI 字符串，表示 2 个字符 à 和 „。然而，无论我如何尝试，都不可能将 ANSI 编码的字符串粘贴到浏览器表单中并期望它被解释为 UTF-8 - 因为操作系统知道我正在粘贴 ANSI（我复制了文本（我在其中创建了 ANSI 编码文本文件）并将其转码为 UTF-8，从而产生字节流 xC3 x83 xE2 x80 x9E。

我的观点是，如果用户设法发布垃圾，可以说是因为它在粘贴到浏览器表单时已经是垃圾了，因为客户端没有对字符集、编码等的适当支持。
由于字符编码是不确定的，因此您不能期望存在一种简单的方法来揭示这种情况。

不幸的是，对于上传的文件，问题仍然存在。我看到的唯一可靠的解决方案是向用户显示文件的一部分，并询问它是否被正确解释，然后循环使用一堆不同的编码，直到出现这种情况为止。

或者我们可以开发一种启发式方法来查看某些字符在不同语言中的出现情况。假设我上传了包含两个字节 xC3 x84 的文本文件。没有其他信息 - 文件中只有两个字节。该方法可以发现字母 ä 在德语文本中相当常见，但字母 à 和 „ 在一起在任何语言中都不常见，从而确定我的文件的编码确实是 UTF-8。这种启发式方法必须处理的粗略复杂程度，它可以使用的统计和语言事实越多，其结果就越可靠。

The reason why you saw so many complicated solutions for this problem is because by definition it is not solvable. The process of encoding a string of text is non-deterministic.
It is possible to construct different combinations of text and encodings that result in the same byte stream. Therefore, it is not possible, strictly logically speaking, to determine the encoding, character set, and the text from a byte stream.

In reality, it is possible to achieve results that are "close enough" using heuristic methods, because there is a finite set of encodings that you'll encounter in the wild, and with a large enough sample a program can determine the most likely encoding. Whether the results are good enough depends on the application.

I do want to comment on the question of user-generated data. All data posted from a web page has a known encoding (the POST comes with an encoding that the developer has defined for the page). If a user pastes text into a form field, the browser will interpret the text based on encoding of the source data (as known by the operating system) and the page encoding, and transcode it if necessary. It is too late to detect the encoding on the server - because the browser may have modified the byte stream based on the assumed encoding.

For instance, if I type the letter Ä on my German keyboard and post it on a UTF-8 encoded page, there will be 2 bytes (xC3 x84) that are sent to the server. This is a valid EBCDIC string that represents the letter C and d. This is also a valid ANSI string that represents the 2 characters Ã and „. It is, however, not possible, no matter what I try, to paste an ANSI-encoded string into a browser form and expect it to be interpreted as UTF-8 - because the operating system knows that I am pasting ANSI (I copied the text from Textpad where I created an ANSI-encoded text file) and will transcode it to UTF-8, resulting in the byte stream xC3 x83 xE2 x80 x9E.

My point is that if a user manages to post garbage, it is arguably because it was already garbage at the time it was pasted into a browser form, because the client did not have the proper support for the character set, the encoding, whatever.
Because character encoding is non-deterministic, you cannot expect that there exist a trivial method to uncover from such a situation.

Unfortunately, for uploaded files the problem remains. The only reliable solution that I see is to show the user a section of the file and ask if it was interpreted correctly, and cycle through a bunch of different encodings until this is the case.

Or we could develop a heuristic method that looks at the occurance of certain characters in various languages. Say I uploaded my text file that contains the two bytes xC3 x84. There is no other information - just two bytes in the file. This method could find out that the letter Ä is fairly common in German text, but the letters Ã and „ together are uncommon in any language, and thus determine that the encoding of my file is indeed UTF-8. This roughy is the level of complexity that such a heuristic method has to deal with, and the more statistical and linguistic facts it can use, the more reliable will its results be.

回复收藏 0 原文

岁月如刀 2024-09-12 07:30:41

Pekka 关于不可靠性的说法是正确的，但如果您需要一个解决方案并愿意承担风险，并且您有可用的 mbstring 库，那么以下代码片段应该可以工作：

function forceToUtf8($string) {
    if (!mb_check_encoding($string)) {
        return false;
    }
    return mb_convert_encoding($string, 'UTF-8', mb_detect_encoding($string));
}

Pekka is right about the unreliability, but if you need a solution and are willing to take the risk, and you have the mbstring library available, this snippet should work:

function forceToUtf8($string) {
    if (!mb_check_encoding($string)) {
        return false;
    }
    return mb_convert_encoding($string, 'UTF-8', mb_detect_encoding($string));
}

回复收藏 0 原文