对于内容类型建议字符数据的 HTTP 响应,如果未指定,客户端应采用哪种字符集?
如果 Content-Type 标头中未指定字符集参数,则 RFC2616 第 3.7.1 节< /a> 似乎暗示子类型“文本”的媒体类型应假定为 ISO8859-1:
当没有明确的字符集参数时 由发送者提供,媒体子类型 “文本”类型的定义为 默认字符集值为 通过 HTTP 接收时为“ISO-8859-1”。
字符集以外的数据 “ISO-8859-1”或其子集必须是 用适当的字符集标记 值。
然而,我经常看到应用程序提供带有 Content-Type 值的 Javascript 文件,例如“application/x-javascript”(即没有字符集参数),即使这些脚本包含非 ASCII UTF-8 字符,如果解释这些字符,这些字符也会被损坏如 ISO8859-1。
这似乎不会给客户带来问题。客户端如何知道将字节解释为 UTF-8?对于其他字符数据子类型是否存在暗示 UTF-8 应为默认值的规则?这是在哪里记录的?
If no charset parameter is specified in the Content-Type header, RFC2616 section 3.7.1 seems to imply ISO8859-1 should be assumed for media types of subtype "text":
When no explicit charset parameter is
provided by the sender, media subtypes
of the "text" type are defined to have
a default charset value of
"ISO-8859-1" when received via HTTP.Data in character sets other than
"ISO-8859-1" or its subsets MUST be
labeled with an appropriate charset
value.
However, I routinely see applications that serve up Javascript files with Content-Type values like "application/x-javascript" (i.e. no charset param), even when these scripts contain non-ASCII UTF-8 characters, which would be corrupt if interpreted as ISO8859-1.
This does not seem to pose problems to clients. How do clients know to interpret the bytes as UTF-8? Is there a rule for other character-data subtypes that implies UTF-8 should be the default? Where is this documented?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
我检查过的所有主要浏览器(IE、FF 和 Opera)都完全忽略了这部分中的 RFC 规范。
如果您对通过数据自动检测字符集的算法感兴趣,请查看 Mozilla Firefox 链接。
关于内容类型的一个小注释:只有文本具有字符集。可以合理地假设浏览器处理 application/x-javascript 与处理 text/javascript 相同(IE6 除外,但这是另一个主题)。
Internet Explorer 将使用默认字符集(可能存储在注册表中),如下所示:
来源:http:// msdn.microsoft.com/en-us/library/ms537500%28VS.85%29.aspx
Mozilla Firefox 尝试自动检测字符集,如下所示:
来源:http://www.mozilla.org/projects/ intl/UniversalCharsetDetection.html
Opera 也使用自动检测,如文档所示:
来源:http://www.opera.com/docs/specs /opera9/
All major browsers I've checked (IE, FF and Opera) completely ignore the RFC specification in this part.
If you are interested in the algorithm to auto-detect charset by data, look at Mozilla Firefox link.
Just a small note about content types: Only text has character sets. It's reasonable to assume that browsers handle application/x-javascript the same as they handle text/javascript ( except IE6, but that's another subject ).
Internet Explorer will use the default charset (probably stored at registry), as noted:
Source: http://msdn.microsoft.com/en-us/library/ms537500%28VS.85%29.aspx
Mozilla Firefox attempts to auto-detect the charset, as pointed here:
Source: http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html
Opera uses auto-detection too, as documented:
Source: http://www.opera.com/docs/specs/opera9/
如 RFC 4329 中所述,
application/javascript
也可以有一个charset
参数。另一个问题是浏览器实现的处理。抱歉,但未经测试。As described in RFC 4329, also
application/javascript
can have acharset
parameter. The other question is the handling of browser implementations. Sorry, but not tested.如果没有
charset
参数,可以在内容中指定字符编码。以下是几种内容类型采用的一些方法:HTML - 通过 元标记:
HTML5变体:
XML(XHTML、KML) - 通过XML 声明:
文本 - 通过 字节顺序标记。例如,对于UTF-8,文件的前三个字节采用十六进制:
与与文档关联的字符集不同,另请注意,非 ASCII 字符可以使用 ASCII 字符序列进行编码各种方法:
HTML - 通过字符引用:
XML - 通过字符引用:
JSON - 通过转义机制:
现在,对于HTTP 1.1协议, RFC 2616 是这样描述字符集的:
因此,我对上述内容的解释是,不能采用默认字符集,除了类型为“文本”的媒体子类型。当然,我们生活在现实世界中,实施者并不总是遵守规则。正如接受的答案中所述,各个网络浏览器供应商已经实施了自己的策略来确定文档字符集。没有明确指定。人们可以假设其他客户端(例如 Google Earth)的供应商也实施自己的策略。
In the absense of the
charset
parameter, the character encoding can be specified in the content. Here are some approaches taken by several content types:HTML - Via the meta tag:
HTML5 variant:
XML (XHTML, KML) - Via the XML declaration:
Text - Via the Byte order mark. For example, for UTF-8 the first three bytes of a file in hexadecimal:
As distinct from the character set associated with the document, note also that non-ASCII characters can be encoded via ASCII character sequences using various approaches:
HTML - Via character references:
XML - Via character references:
JSON - Via the escaping mechanism:
Now, with respect to the the HTTP 1.1 protocol, RFC 2616 says this about charset:
So, my interpretation of the above is that one cannot assume a default character set except for media subtypes of the type "text." Of course, we live in the real world and implementers do not always follow the rules. As described in the accepted answer, the various web browser vendors have implemented their own strategies for determining the document character set when it is not explicitly specified. One can assume that vendors of other clients (e.g., Google Earth) also implement their own strategies.
RFC 4329 将“application/javascript”媒体类型定义为替换“text/javascript”、“application/x-javascript”和其他类似类型。当没有明确的“charset”参数可用并且数据前面没有 Unicode BOM 时,第 4.2 节将默认字符编码设置为 UTF-8。
RFC 4329 defines the "application/javascript" media type as a replacement for "text/javascript", "application/x-javascript", and other similar types. Section 4.2 establishes the default character encoding to be UTF-8 when no explicit "charset" parameter is available and no Unicode BOM is present at the front of the data.
对于 XMLHttpRequest 来说有点特殊,如下所述: http://www.w3.org/TR/ XMLHttpRequest/
It's a bit special for XMLHttpRequest and is described here: http://www.w3.org/TR/XMLHttpRequest/
指出显而易见的事情:“application/x-javascript”不是“text”的子类型。
此外,RFC 2616 中的文本已过时。 HTTP/1.1 的下一个修订版将不会定义默认值。请参阅 RFC 6657 了解更多信息。
Pointing out the obvious: "application/x-javascript" is not a subtype of "text".
Also, the text in RFC 2616 is outdated. The next revision of HTTP/1.1 will not define a default. See RFC 6657 for further information.