字符集和“首选MIME名称”

发布于 2024-12-18 11:50:27 字数 475 浏览 5 评论 0原文

HTTP Accept-Encoding 标头包含可接受字符集的原子，MIME Content-Type 标头中的 charset= 字段包含以下原子：以下数据的字符集。

我的问题如下：这些原子必须与首选 MIME 编码名称或字符集名称匹配，还是可以与字符集的任何别名匹配？

http://www.iana.org/assignments/character-sets 使用的别名和首选 MIME 编码。

我计划使用 iconv 转换为平台本机宽 UTF，并且我不想以 (iconv_alias, { list-of-aliases }) per 的形式以字段数组的形式输入条目字符集。相反，一个简单的 (alias, iconv_alias) 2 元组。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

忆梦 2024-12-25 11:50:27

是的，这些原子必须与首选 MIME 字符集名称匹配，而不仅仅是任何别名。
解释。
我很惊讶地发现 Google“MIME 字符集列表”没有返回比 http 更清晰或更有用的内容://www.iana.org/assignments/character-sets 您在 2011 年提到的 IANA 的“字符集”页面，这似乎仍然是最好的。该页面上的 Ctrl+F“MIME”仅返回一个命中：“首选 MIME 名称”，这是您和我正在查找的列的标题：实际使用的字符集名称列表，数量非常少全球每天发送的 350 万亿封（即数百万封）电子邮件中存在异常或错误。这种高水平的合规性以及如此巨大的可靠性，在我看来是由于电子邮件系统及其实际规范 MIME 从一开始就得到了仔细的关注和智能的实施，从而产生了 2 个基本特征：

规定了字符集每个文档中只有一次，即每封电子邮件的 MIME 标头中，大多数人会忽略它的存在，因为它不会引起问题
允许的字符集名称列表明显很短，坚实、不变，没有人会偏离它。
以下是一些：
us-ascii、iso-8859-1（最初的网络标准）、utf-8（现在的实际标准）、Shift JIS、EUC、ISO-2022-JP，...

网页未得到可靠处理就字符集而言，有两个主要原因（IMO）：

每天仅创建 120 亿个网页（比电子邮件少 3,000 倍），因此对此给予的关注是较新（网页自 1993 年才存在，距 1971 年电子邮件 22 年），并且相对而言要小得多
在每个网页中，许多设置在两个不同的位置和上下文中进行说明，HTML 标头（多个，但对应于单个 mime）电子邮件标题），但也在文档内部多次。在数百万甚至数十亿作家的群体中，这不可避免地会带来很多误解、分歧、困惑、犹豫、不兼容，甚至错误。

因此我的建议是：

是的，我们都应该只使用 IANA“首选 MIME 名称”，这些名称已经足够广泛，
最重要的标准化机构，据我所知 IANA 和 IETF 在这件事上，可以收集和建立一个独特的文件和列表，精心设计，从上面当前的 IANA 列表开始，但更加清晰和简短，并将其声明为规范，广泛传播，以便每个人都知道。

笔记。这是一个大问题，因为 UTF-8 是在 1980-90 年代强制实施的，出于政治原因，但也有许多明显的技术和人为原因。稍后会详细介绍这一点。与此同时，UTF-8 对英语没有什么好处或坏处，对欧洲语言没有什么坏处，但对其他语言（如阿拉伯语和亚洲语言）来说却很复杂，有时甚至是其他语言（如西里尔语）。
凡尔赛，2023 年 4 月 27 日星期四 18:12:10 +0200，已编辑（拼写错误）18h21m50

Yes, these atoms must IMO match the preferred MIME charset name, and not just any alias.
Explanations.
I was surprised to see that Google "List of MIME charsets" returned nothing clearer or more useful than the http://www.iana.org/assignments/character-sets "Character Sets" page from IANA that you mentioned in 2011 and that still seems the best. Ctrl+F"MIME" on that page just returns ONE hit: "Preferred MIME Name", which is the title of the very column you and I are looking for: the list of the charset names that are ACTUALLY used, with very small number of exceptions or errors, in the 350 trillions (i.e. millions of millions) email messages sent EVERY DAY WORLD WIDE. This high level of compliance hence reliability on such a giant number, is due IMO to the email system and its actual norm MIME having got from the beginning a careful attention and intelligent implementation, resulting in 2 fundamental traits:

the charset is stated ONCE AND ONLY ONCE in each document, namely in the MIME header of each email message, whose existence is ignored from most people since it doesn't cause problems
the list of the charset names allowed is visibly SHORT, SOLID, INVARIABLE, and nobody strays from it.
Hereafter are a few:
us-ascii, iso-8859-1 (initially the standard of the net), utf-8 (now the actual standard), Shift JIS, EUC, ISO-2022-JP, ...

The WEB PAGES are NOT as reliably treated in matter of charsets, for 2 main reasons IMO:

Only 12 billions web pages are created everyday (3,000x less than email messages), whence the attention given to this is newer (web pages only have existed since 1993, 22 years after email 1971), and, relatively, much smaller
In each web page, many settings are stated in TWO different places and contexts, the HTML headers (MULTIPLE but corresponding to the SINGLE mime header of email) BUT also MULTIPLE TIMES in the INSIDE the document. Which inevitably, in a crowd of millions or maybe billions writers, brings a lot of misunderstanding, divergences, confusion, hesitation, incompatibilities, and even errors.

Whence my propositions:

YES we all should use ONLY the IANA "Preferred MIME Names", that are already widely sufficient
the most important standardization bodies, AFAIK IANA and IETF in this matter, could gather and build an UNIQUE document and list, carefully designed, starting with the current IANA list above but much clearer and shorter, and state it as a norm, widely mediatized so everyone knows it.

Note. There is a big problem, due to the UTF-8 having been imposed in the 1980-90s, with political reasons, against many obvious technical and human reasons. More on this later. Meanwhile UTF-8 brings nothing good or bad in English, little bad in European languages, much complications in other languages like Arabic and Asian ones, even sometimes is others like cyrillic.
Versailles, Thu 27 Apr 2023 18:12:10 +0200, edited (typo) 18h21m50

回复收藏 0 原文

~没有更多了~