如何避免无意中将 UTF-8 文件编码为 ASCII/ANSI?
在编辑编码为 UTF-8 w/o [spurious] BOM 的文件的过程中,内容可能会缺少 ASCII 或 ANSI 范围之外的任何 Unicode 字符。下次重新打开该文件时,某些文本编辑器 (Notepad++) 会将其解释为 ASCII/ANSI 编码并按原样打开。不知道更改的用户将继续编辑,现在添加非 ANSI Unicode 字符,但由于保存在 ANSI 中而无用。可以存在菜单选项 (Notepad++) 以将 ANSI 文件打开为 UTF-8 w/o BOM,但会导致无意中用 Unicode 编码覆盖 ANSI 文件的反向问题。
In the process of editing a file encoded as UTF-8 w/o [spurious] BOM the content might become devoid of any Unicode characters outside the ASCII or ANSI ranges. At the next reopening of the file, some text editors (Notepad++) will interpret it as ASCII/ANSI encoded and open it as such. Unaware of the change the user will continue editing, now adding non-ANSI Unicode characters, rendered however useless, since saved in ANSI. A menu option can exist (Notepad++) to open ANSI files as UTF-8 w/o BOM, but leading to the reverse issue of inadvertently overriding ANSI files with Unicode encoding.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
一种解决方法是将 ANSI 范围之外的字符添加到文件中的注释中。根据解码算法,它可能会强制编辑器 (Notepad++) 将文件识别为以 UTF-8 w/o BOM 编码的文件。
例如,在 HTML 文档中,您可以在标头中的字符集定义后添加这样的 Unicode 注释,此处为 U+05D0 HEBREW LETTER ALEF:
One workaround is to add a character outside the ANSI range to a comment in the file. Depending on the decoding algorithm, it might force the editor (Notepad++) to recognize the file as encoded in UTF-8 w/o BOM.
In a HTML document for example you could follow the charset definition in the header with such a Unicode comment, here the U+05D0 HEBREW LETTER ALEF:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <!-- א -->
当文件看起来相同时,您建议编辑如何区分 ASCII/ANSI 和 UTF-8 w/o BOM 之间的区别?
如果您希望保证将 UTF-8 识别为 UTF-8,请添加 BOM,或强制文件包含 UTF-8 字符。
How would you suggest that an editor tell the difference between ASCII/ANSI and UTF-8 w/o BOM, when the files look the same?
If you want guaranteed recognition of UTF-8 as UTF-8, either add the BOM, or force the file to contain UTF-8 characters.
如果可能的话,将您的编辑器配置为始终使用 UTF-8,如果没有,请向编辑器的创建者投诉。 IMO 不推荐使用不针对 unicode 的字符集,并且应将其视为此类。
无论如何,仅使用 ASCII 空间(7 位)中的字符的文件在 UTF-8 中几乎是相同的,因此,如果您必须以 ASCII 编码提供某些内容,请不要键入任何 unicode 字符。
Configure your editor to always use UTF-8 if possible, if not, complain to the creators of your editor. Charsets not targeting unicode are, IMO, deprecated and should be treated as such.
Files using only characters in the ASCII space (the 7-bit one) would be pretty much the same in UTF-8 anyway, so if you HAVE to deliver something in ASCII encoding, just don't type any unicode characters.