对于内容类型建议字符数据的 HTTP 响应,如果未指定,客户端应采用哪种字符集?

发布于 2024-08-22 23:05:11 字数 568 浏览 11 评论 0原文

如果 Content-Type 标头中未指定字符集参数,则 RFC2616 第 3.7.1 节< /a> 似乎暗示子类型“文本”的媒体类型应假定为 ISO8859-1:

当没有明确的字符集参数时 由发送者提供,媒体子类型 “文本”类型的定义为 默认字符集值为 通过 HTTP 接收时为“ISO-8859-1”。

字符集以外的数据 “ISO-8859-1”或其子集必须是 用适当的字符集标记 值。

然而,我经常看到应用程序提供带有 Content-Type 值的 Javascript 文件,例如“application/x-javascript”(即没有字符集参数),即使这些脚本包含非 ASCII UTF-8 字符,如果解释这些字符,这些字符也会被损坏如 ISO8859-1。

这似乎不会给客户带来问题。客户端如何知道将字节解释为 UTF-8?对于其他字符数据子类型是否存在暗示 UTF-8 应为默认值的规则?这是在哪里记录的?

If no charset parameter is specified in the Content-Type header, RFC2616 section 3.7.1 seems to imply ISO8859-1 should be assumed for media types of subtype "text":

When no explicit charset parameter is
provided by the sender, media subtypes
of the "text" type are defined to have
a default charset value of
"ISO-8859-1" when received via HTTP.

Data in character sets other than
"ISO-8859-1" or its subsets MUST be
labeled with an appropriate charset
value.

However, I routinely see applications that serve up Javascript files with Content-Type values like "application/x-javascript" (i.e. no charset param), even when these scripts contain non-ASCII UTF-8 characters, which would be corrupt if interpreted as ISO8859-1.

This does not seem to pose problems to clients. How do clients know to interpret the bytes as UTF-8? Is there a rule for other character-data subtypes that implies UTF-8 should be the default? Where is this documented?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

等待我真够勒 2024-08-29 23:05:11

我检查过的所有主要浏览器(IE、FF 和 Opera)都完全忽略了这部分中的 RFC 规范

如果您对通过数据自动检测字符集的算法感兴趣,请查看 Mozilla Firefox 链接。

关于内容类型的一个小注释:只有文本具有字符集。可以合理地假设浏览器处理 application/x-javascript 与处理 text/javascript 相同(IE6 除外,但这是另一个主题)。

Internet Explorer 将使用默认字符集(可能存储在注册表中),如下所示:

默认情况下,Internet Explorer 使用
HTTP 中指定的字符集
服务器返回的内容类型
确定这个翻译。如果这个
未给出参数,互联网
资源管理器使用的字符集
由元元素指定
文档。 它使用用户的
如果没有元元素,则偏好设置

指定。

来源http:// msdn.microsoft.com/en-us/library/ms537500%28VS.85%29.aspx

Mozilla Firefox 尝试自动检测字符集,如下所示:

本文提出了三种类型的自动检测方法来确定文档的编码无需显式字符集声明

来源http://www.mozilla.org/projects/ intl/UniversalCharsetDetection.html

Opera 也使用自动检测,如文档所示:

如果传输协议提供编码名称,则使用该名称。如果没有,Opera 将在页面中查找字符集声明。 如果缺少此项,Opera 将尝试自动检测编码,使用域名查看脚本是否为 CJK 脚本,如果是,则确定是哪一个。 Opera 还可以自动检测 UTF-8。

来源http://www.opera.com/docs/specs /opera9/

All major browsers I've checked (IE, FF and Opera) completely ignore the RFC specification in this part.

If you are interested in the algorithm to auto-detect charset by data, look at Mozilla Firefox link.

Just a small note about content types: Only text has character sets. It's reasonable to assume that browsers handle application/x-javascript the same as they handle text/javascript ( except IE6, but that's another subject ).

Internet Explorer will use the default charset (probably stored at registry), as noted:

By default, Internet Explorer uses the
character set specified in the HTTP
content type returned by the server to
determine this translation. If this
parameter is not given, Internet
Explorer uses the character set
specified by the meta element in the
document. It uses the user's
preferences
if no meta element is
specified.

Source: http://msdn.microsoft.com/en-us/library/ms537500%28VS.85%29.aspx

Mozilla Firefox attempts to auto-detect the charset, as pointed here:

This paper presents three types of auto-detection methods to determine encodings of documents without explicit charset declaration.

Source: http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html

Opera uses auto-detection too, as documented:

If the transport protocol provides an encoding name, that is used. If not, Opera will look at the page for a charset declaration. If this is missing, Opera will attempt to auto-detect the encoding, using the domain name to see if the script is a CJK script, and if so which one. Opera can also auto-detect UTF-8.

Source: http://www.opera.com/docs/specs/opera9/

绝不服输 2024-08-29 23:05:11

RFC 4329 中所述,application/javascript 也可以有一个 charset 参数。另一个问题是浏览器实现的处理。抱歉,但未经测试。

As described in RFC 4329, also application/javascript can have a charset parameter. The other question is the handling of browser implementations. Sorry, but not tested.

浅笑依然 2024-08-29 23:05:11

如果没有charset参数,可以在内容中指定字符编码。以下是几种内容类型采用的一些方法:

HTML - 通过 元标记

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

HTML5变体:

<meta charset="utf-8">

XML(XHTML、KML) - 通过XML 声明

<?xml version="1.0" encoding="UTF-8"?>

文本 - 通过 字节顺序标记。例如,对于UTF-8,文件的前三个字节采用十六进制:

EF BB BF

与与文档关联的字符集不同,另请注意,非 ASCII 字符可以使用 ASCII 字符序列进行编码各种方法:

HTML - 通过字符引用

&#nnnn;
&#xhhhh;

XML - 通过字符引用

&
&defined-entity;

JSON - 通过转义机制

\u005C
\uD834\uDD1E

现在,对于HTTP 1.1协议, RFC 2616 是这样描述字符集的

“charset”参数与某些媒体类型一起使用来定义
数据的字符集(第 3.4 节)。当没有明确的字符集时
参数由发送方提供,“文本”类型的媒体子类型
被定义为具有默认字符集值“ISO-8859-1”
通过 HTTP 接收。 “ISO-8859-1”以外的字符集中的数据或
它的子集必须用适当的字符集值标记。看
第 3.4.1 节了解兼容性问题。

因此,我对上述内容的解释是,不能采用默认字符集,除了类型为“文本”的媒体子类型。当然,我们生活在现实世界中,实施者并不总是遵守规则。正如接受的答案中所述,各个网络浏览器供应商已经实施了自己的策略来确定文档字符集。没有明确指定。人们可以假设其他客户端(例如 Google Earth)的供应商也实施自己的策略。

In the absense of the charset parameter, the character encoding can be specified in the content. Here are some approaches taken by several content types:

HTML - Via the meta tag:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

HTML5 variant:

<meta charset="utf-8">

XML (XHTML, KML) - Via the XML declaration:

<?xml version="1.0" encoding="UTF-8"?>

Text - Via the Byte order mark. For example, for UTF-8 the first three bytes of a file in hexadecimal:

EF BB BF

As distinct from the character set associated with the document, note also that non-ASCII characters can be encoded via ASCII character sequences using various approaches:

HTML - Via character references:

&#nnnn;
&#xhhhh;

XML - Via character references:

&
&defined-entity;

JSON - Via the escaping mechanism:

\u005C
\uD834\uDD1E

Now, with respect to the the HTTP 1.1 protocol, RFC 2616 says this about charset:

The "charset" parameter is used with some media types to define the
character set (section 3.4) of the data. When no explicit charset
parameter is provided by the sender, media subtypes of the "text" type
are defined to have a default charset value of "ISO-8859-1" when
received via HTTP. Data in character sets other than "ISO-8859-1" or
its subsets MUST be labeled with an appropriate charset value. See
section 3.4.1 for compatibility problems.

So, my interpretation of the above is that one cannot assume a default character set except for media subtypes of the type "text." Of course, we live in the real world and implementers do not always follow the rules. As described in the accepted answer, the various web browser vendors have implemented their own strategies for determining the document character set when it is not explicitly specified. One can assume that vendors of other clients (e.g., Google Earth) also implement their own strategies.

浪推晚风 2024-08-29 23:05:11

RFC 4329 将“application/javascript”媒体类型定义为替换“text/javascript”、“application/x-javascript”和其他类似类型。当没有明确的“charset”参数可用并且数据前面没有 Unicode BOM 时,第 4.2 节将默认字符编码设置为 UTF-8。

RFC 4329 defines the "application/javascript" media type as a replacement for "text/javascript", "application/x-javascript", and other similar types. Section 4.2 establishes the default character encoding to be UTF-8 when no explicit "charset" parameter is available and no Unicode BOM is present at the front of the data.

只涨不跌 2024-08-29 23:05:11

对于 XMLHttpRequest 来说有点特殊,如下所述: http://www.w3.org/TR/ XMLHttpRequest/

It's a bit special for XMLHttpRequest and is described here: http://www.w3.org/TR/XMLHttpRequest/

肩上的翅膀 2024-08-29 23:05:11

指出显而易见的事情:“application/x-javascript”不是“text”的子类型。

此外,RFC 2616 中的文本已过时。 HTTP/1.1 的下一个修订版将不会定义默认值。请参阅 RFC 6657 了解更多信息。

Pointing out the obvious: "application/x-javascript" is not a subtype of "text".

Also, the text in RFC 2616 is outdated. The next revision of HTTP/1.1 will not define a default. See RFC 6657 for further information.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文