Web 服务器如何知道发布给它们的表单中使用的字符集?

发布于 2024-11-16 04:36:24 字数 410 浏览 4 评论 0原文

当 Web 服务器获取表单的 POST 时,将其解析为参数值对非常简单。但是,如果这些值包含浏览器编码的非英语字符,则浏览器必须知道所使用的字符集才能对其进行解码。

我检查了两个帖子发送的请求。一种是从使用 UTF-8 的页面完成的,另一种是从使用 Windows-1255 的页面完成的。相同的文本有不同的编码。 AFAIK,Content-type 标头可以在 application/x-www-form-urlencoded 之后包含一个字符集,但事实并非如此(使用 Firefox)。

在 servlet 中,当您使用 request.getParameter() 时,您应该获得解码后的值。 servlet 容器是如何做到这一点的?它是否总是押注于 UTF-8,使用一些启发式方法,还是我缺少某种确定性的方法?

When a web server gets a POST of a form, parsing it into param-value(s) pairs is quite straightforward. However, if the values contain non-English chars that have been encoded by the browser, it must know the charset used in order to decode them.

I've examined the requests sent by two posts. One was done from a page using UTF-8, and one from a page using Windows-1255. The same text was encoded differently. AFAIK, the Content-type header could contain a charset after the application/x-www-form-urlencoded, but it wasn't (using Firefox).

In a servlet, when you use request.getParameter(), you're supposed to get the decoded value. How does the servlet container do that? Does it always bet on UTF-8, use some heuristics, or is there some deterministic way I'm missing?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

感受沵的脚步 2024-11-23 04:36:24

来自 Serlvet 3.0 规范,第 3.10 节请求数据编码(重点是我的)

目前,许多浏览器不随 ContentType 标头发送 char 编码限定符,从而使读取的字符编码的确定保持开放状态
HTTP 请求。容器用于创建请求的默认编码
如果未指定,则请求读取器和解析 POST 数据必须为“ISO-8859-1”
按客户要求。然而,为了向开发者表明,在这种情况下,
客户端发送字符编码失败,容器返回null
getCharacterEncoding 方法。

如果客户端没有设置字符编码,且请求数据采用字符编码
与上述默认编码不同,可能会发生损坏。到
为了解决这种情况,一个新方法 setCharacterEncoding(String enc)
已添加到 ServletRequest 接口。开发者可以重写
容器通过调用此方法提供的字符编码。一定是
在解析任何发布数据或读取请求中的任何输入之前调用。呼唤
此方法一旦读取数据,不会影响编码。

在实践中,我发现在响应中设置字符集会影响后续 POST 中使用的字符集。为了更加确定,您可以编写一个调用 setCharacterEncoding 在每个请求对象使用之前。

您可能还会发现此线程很有用 - 检测HTTP POST 请求

From the Serlvet 3.0 Spec, section 3.10 Request Data Encoding (emphasis mine)

Currently, many browsers do not send a char encoding qualifier with the ContentType header, leaving open the determination of the character encoding for reading
HTTP requests. The default encoding of a request the container uses to create the
request reader and parse POST data must be “ISO-8859-1” if none has been specified
by the client request. However, in order to indicate to the developer, in this case, the
failure of the client to send a character encoding, the container returns null from
the getCharacterEncoding method.

If the client hasn’t set character encoding and the request data is encoded with a
different encoding than the default as described above, breakage can occur. To
remedy this situation, a new method setCharacterEncoding(String enc) has
been added to the ServletRequest interface. Developers can override the
character encoding supplied by the container by calling this method. It must be
called prior to parsing any post data or reading any input from the request. Calling
this method once data has been read will not affect the encoding.

In practice, I find that setting the charset in a response influences the charset used in the subsequent POST. To be extra sure, you can write a Servlet Filter that calls the setCharacterEncoding on every request object before it is used.

You may also find this thread useful - Detecting the character encoding of an HTTP POST request

手心的海 2024-11-23 04:36:24

用于指定字符集的适当标头是Accept-Charset。

最新的 Linux 版 Chrome,例如,吐槽:
Accept-Charset:ISO-8859-1,utf-8;q=0.7,*;q=0.3

每个请求的

http://www.w3.org/Protocols/rfc2616/rfc2616-第 14.2 节sec14.html 指出:

Accept-Charset 请求标头字段可用于指示响应可接受哪些字符集。该字段允许客户端能够理解更全面或专用的字符集,并向能够以这些字符集表示文档的服务器表明该能力。

(...)

如果没有 Accept-Charset 标头
目前,默认是任何
字符集是可以接受的。如果一个
存在 Accept-Charset 标头,并且
如果服务器无法发送响应
这是可以接受的
Accept-Charset header,然后是服务器
应发送错误响应
406(不可接受)状态代码,
尽管发送了不可接受的
也允许响应。

因此,如果您从客户端收到这样的标头,则 q 最高的值可能是您从其接收的编码。

The apropriate header for specifying charsets is Accept-Charset.

Latest Chrome for linux, e.g., spits:
Accept-Charset:ISO-8859-1,utf-8;q=0.7,*;q=0.3

on each request.

Section 14.2 from http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html states:

The Accept-Charset request-header field can be used to indicate what character sets are acceptable for the response. This field allows clients capable of understanding more comprehensive or special- purpose character sets to signal that capability to a server which is capable of representing documents in those character sets.

(...)

If no Accept-Charset header is
present, the default is that any
character set is acceptable. If an
Accept-Charset header is present, and
if the server cannot send a response
which is acceptable according to the
Accept-Charset header, then the server
SHOULD send an error response with the
406 (not acceptable) status code,
though the sending of an unacceptable
response is also allowed.

So if you receive such a header from a client, the value with highest q can be the encoding you're receiving from it.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文