字符集和“首选MIME名称”
HTTP Accept-Encoding
标头包含可接受字符集的原子,MIME Content-Type
标头中的 charset=
字段包含以下原子:以下数据的字符集。
我的问题如下:这些原子必须与首选 MIME 编码名称或字符集名称匹配,还是可以与字符集的任何别名匹配?
http://www.iana.org/assignments/character-sets 使用的别名和首选 MIME 编码。
我计划使用 iconv 转换为平台本机宽 UTF,并且我不想以 (iconv_alias, { list-of-aliases }) per 的形式以字段数组的形式输入条目字符集。相反,一个简单的 (alias, iconv_alias) 2 元组。
The HTTP Accept-Encoding
header contains atoms of acceptable character sets, and the charset=
field in a MIME Content-Type
header contains an atom of the character set for the following data.
My question is the following: must these atoms match the preferred MIME encoding name or charset name, or can they match any alias of a charset?
Alias and preferred MIME encoding as used by http://www.iana.org/assignments/character-sets.
I'm planning on using iconv to convert to platform-native wide UTF, and I don't want to make the entry in the form of an array of fields in the form of (iconv_alias, { list-of-aliases }) per charset. Rather, a simple (alias, iconv_alias) 2-tuple.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
是的,这些原子必须与首选 MIME 字符集名称匹配,而不仅仅是任何别名。
解释。
我很惊讶地发现 Google“MIME 字符集列表”没有返回比 http 更清晰或更有用的内容://www.iana.org/assignments/character-sets 您在 2011 年提到的 IANA 的“字符集”页面,这似乎仍然是最好的。该页面上的 Ctrl+F“MIME”仅返回一个命中:“首选 MIME 名称”,这是您和我正在查找的列的标题:实际使用的字符集名称列表,数量非常少全球每天发送的 350 万亿封(即数百万封)电子邮件中存在异常或错误。这种高水平的合规性以及如此巨大的可靠性,在我看来是由于电子邮件系统及其实际规范 MIME 从一开始就得到了仔细的关注和智能的实施,从而产生了 2 个基本特征:
规定了字符集每个文档中只有一次,即每封电子邮件的 MIME 标头中,大多数人会忽略它的存在,因为它不会引起问题
允许的字符集名称列表明显很短,坚实、不变,没有人会偏离它。
以下是一些:
us-ascii、iso-8859-1(最初的网络标准)、utf-8(现在的实际标准)、Shift JIS、EUC、ISO-2022-JP,...
网页未得到可靠处理就字符集而言,有两个主要原因(IMO):
因此我的建议是:
笔记。这是一个大问题,因为 UTF-8 是在 1980-90 年代强制实施的,出于政治原因,但也有许多明显的技术和人为原因。稍后会详细介绍这一点。与此同时,UTF-8 对英语没有什么好处或坏处,对欧洲语言没有什么坏处,但对其他语言(如阿拉伯语和亚洲语言)来说却很复杂,有时甚至是其他语言(如西里尔语)。
凡尔赛,2023 年 4 月 27 日星期四 18:12:10 +0200,已编辑(拼写错误)18h21m50
Yes, these atoms must IMO match the preferred MIME charset name, and not just any alias.
Explanations.
I was surprised to see that Google "List of MIME charsets" returned nothing clearer or more useful than the http://www.iana.org/assignments/character-sets "Character Sets" page from IANA that you mentioned in 2011 and that still seems the best. Ctrl+F"MIME" on that page just returns ONE hit: "Preferred MIME Name", which is the title of the very column you and I are looking for: the list of the charset names that are ACTUALLY used, with very small number of exceptions or errors, in the 350 trillions (i.e. millions of millions) email messages sent EVERY DAY WORLD WIDE. This high level of compliance hence reliability on such a giant number, is due IMO to the email system and its actual norm MIME having got from the beginning a careful attention and intelligent implementation, resulting in 2 fundamental traits:
the charset is stated ONCE AND ONLY ONCE in each document, namely in the MIME header of each email message, whose existence is ignored from most people since it doesn't cause problems
the list of the charset names allowed is visibly SHORT, SOLID, INVARIABLE, and nobody strays from it.
Hereafter are a few:
us-ascii, iso-8859-1 (initially the standard of the net), utf-8 (now the actual standard), Shift JIS, EUC, ISO-2022-JP, ...
The WEB PAGES are NOT as reliably treated in matter of charsets, for 2 main reasons IMO:
Whence my propositions:
Note. There is a big problem, due to the UTF-8 having been imposed in the 1980-90s, with political reasons, against many obvious technical and human reasons. More on this later. Meanwhile UTF-8 brings nothing good or bad in English, little bad in European languages, much complications in other languages like Arabic and Asian ones, even sometimes is others like cyrillic.
Versailles, Thu 27 Apr 2023 18:12:10 +0200, edited (typo) 18h21m50