文本编辑器应该支持哪些常见的字符编码?

发布于 2024-08-18 17:30:35 字数 134 浏览 2 评论 0原文

我有一个可以加载 ASCII 和 Unicode 文件的文本编辑器。它通过在文件开头查找 BOM 和/或在前 256 个字节中搜索字符 > 来自动检测编码。 0x7f。

还应该支持哪些其他编码,以及哪些特征可以使该编码易于自动检测?

I have a text editor that can load ASCII and Unicode files. It automatically detects the encoding by looking for the BOM at the beginning of the file and/or searching the first 256 bytes for characters > 0x7f.

What other encodings should be supported, and what characteristics would make that encoding easy to auto-detect?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

昔日梦未散 2024-08-25 17:30:35

肯定是UTF-8。请参阅http://www.joelonsoftware.com/articles/Unicode.html

据我所知,没有一种有保证的方法可以自动检测到这一点(尽管通过扫描可以将误诊的可能性降低到很小)。

Definitely UTF-8. See http://www.joelonsoftware.com/articles/Unicode.html.

As far as I know, there's no guaranteed way to detect this automatically (although the probability of a mistaken diagnosis can be reduced to a very small amount by scanning).

醉梦枕江山 2024-08-25 17:30:35

我不了解编码,但请确保它可以支持多种不同的行结束标准! (\n vs \r\n)

如果您还没有查看 Mich Kaplan 的博客,我建议您这样做:http://blogs.msdn.com/michkap/

具体而言,本文可能有用:http://www.siao2.com/2007/04/22/2239345.aspx

I don't know about encodings, but make sure it can support the multiple different line ending standards! (\n vs \r\n)

If you haven't checked out Mich Kaplan's blog yet, I suggest doing so: http://blogs.msdn.com/michkap/

Specifically this article may be useful: http://www.siao2.com/2007/04/22/2239345.aspx

屋顶上的小猫咪 2024-08-25 17:30:35

您无法检测编码。你能做的最好的事情就是像 IE 这样的东西,它取决于不同语言的字母分布,以及语言的标准字符。但这充其量只是一个遥远的目标。

我建议您使用一些大型字符集库(查看 iconv 等项目)并将所有这些内容提供给用户。但不用担心自动检测。只需允许用户选择他对默认字符集的偏好,默认情况下该字符集本身就是 UTF-8。

There is no way how you can detect an encoding. The best thing you could do is something like IE and depend on letter distributions in different languages, as well as standard characters for a language. But that's a long shot at best.

I would advise getting your hands on some large library of character sets (check out projects like iconv) and make all of those available to the user. But don't bother auto-detecting. Simply allow the user to select his preference of a default charset, which itself would be UTF-8 by default.

蝶舞 2024-08-25 17:30:35

西方用户必须支持 Latin-1 (ISO-8859-1) 及其 Windows 扩展 CP-1252。有人可能会说 UTF-8 是一种更好的选择,但人们通常没有这种选择。中国用户需要 GB-18030,请记住,除了 UTF-8 编码的 Unicode 之外,日本人、俄罗斯人、希腊人也有自己的编码。

至于检测,大多数编码都无法安全检测。在某些(如 Latin-1)中,某些字节值是无效的。在 UTF-8 中,任何字节值都可以出现,但不是每个字节值序列。然而,在实践中,您不会自己进行解码,而是使用编码/解码库,尝试解码并捕获错误。那么为什么不支持这个库支持的所有编码呢?

您还可以开发启发式方法,例如对特定编码进行解码,然后测试奇怪字符或字符组合或此类字符的频率的结果。但这永远不会安全,我同意 Vilx 的观点——你不应该打扰。根据我的经验,人们通常知道一个文件具有某种编码,或者只有两种或三种可能。因此,如果他们看到您选择了错误的选择,他们可以轻松适应。看看其他编辑器。最聪明的解决方案并不总是最好的,特别是当人们习惯了其他程序时。

Latin-1 (ISO-8859-1) and its Windows extension CP-1252 must definitely be supported for western users. One could argue that UTF-8 is a superior choice, but people often don't have that choice. Chinese users would require GB-18030, and remember there are Japanese, Russians, Greeks too who all have there own encodings beside UTF-8-encoded Unicode.

As for detection, most encodings are not safely detectable. In some (like Latin-1), certain byte values are just invalid. In UTF-8, any byte value can occur, but not every sequence of byte values. In practice, however, you would not do the decoding yourself, but use an encoding/decoding library, try to decode and catch errors. So why not support all encodings that this library supports?

You could also develop heuristics, like decoding for a specific encoding and then test the result for strange characters or character combinations or frequency of such characters. But this would never be safe, and I agree with Vilx- that you shouldn't bother. In my experience, people normally know that a file has a certain encoding, or that only two or three are possible. So if they see you chose the wrong one, they can easily adapt. And have a look at other editors. The most clever solution is not always the best, especially if people are used to other programs.

七婞 2024-08-25 17:30:35

UTF-16 在纯文本文件中并不常见。 UTF-8 更为常见,因为它向后兼容 ASCII,并且在 XML 等标准中进行了指定。

1)检查各种Unicode编码的BOM。如果找到,请使用该编码。
2) 如果没有 BOM,请检查文件文本是否为有效的 UTF-8,读取直到达到足够的非 ASCII 样本(因为许多文件几乎都是 ASCII,但可能有一些重音字符或智能引号)或文件结束。如果 UTF-8 有效,则使用 UTF-8。
3) 如果不是 Unicode,它可能是当前平台的默认代码页。
4) 有些编码很容易检测,例如日语 Shift-JIS 将大量使用表示平假名和片假名的前缀字节 0x82 和 0x83。
5) 如果程序的猜测被证明是错误的,则为用户提供更改编码的选项。

UTF-16 is not very common in plain text files. UTF-8 is much more common because it is back compatible with ASCII and is specified in standards like XML.

1) Check for BOM of various Unicode encodings. If found, use that encoding.
2) If no BOM, check if file text is valid UTF-8, reading until you reach a sufficient non-ASCII sample (since many files are almost all ASCII but may have a few accented characters or smart quotes) or the file ends. If valid UTF-8, use UTF-8.
3) If not Unicode it's probably current platform default codepage.
4) Some encodings are easy to detect, for example Japanese Shift-JIS will have heavy use of the prefix bytes 0x82 and 0x83 indicating hiragana and katakana.
5) Give user option to change encoding if program's guess turns out to be wrong.

蝶…霜飞 2024-08-25 17:30:35

无论您做什么,请使用超过 256 个字节进行嗅探测试。正确处理很重要,所以为什么不检查整个文档呢?或者至少前 100KB 左右。

尝试使用 UTF-8 和明显的 UTF-16(大量交替的 0 字节),然后回退到当前区域设置的 ANSI 代码页。

Whatever you do, use more than 256 bytes for a sniff test. It's important to get it right, so why not check the whole doc? Or at least the first 100KB or so.

Try UTF-8 and obvious UTF-16 (lots of alternating 0 bytes), then fall back to the ANSI codepage for the current locale.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文