multipart/form-data,字段的默认字符集是什么?

发布于 2024-09-30 13:45:41 字数 504 浏览 4 评论 0原文

如果没有给出字符集,应该使用什么默认编码来解码多部分/表单数据? RFC2388 规定:

4.5 表单数据中文本的字符集

多部分/表单数据的每个部分都应该有一个内容- 类型。如果字段元素是文本,则字符集 文本参数表示使用的字符编码。

例如,带有文本字段的表单,用户在其中键入“Joe 欠”100',其中欧元符号可能会返回表单数据 如:

<前><代码>--AaB03x 内容处置:形式数据;名称=“字段1” 内容类型:文本/纯文本;字符集=windows-1250 内容传输编码:引用-可打印>> 乔欠=80100。 --AaB03x

就我而言,未设置字符集,我不知道如何解码该文本/纯文本部分中的数据。由于我不想强制执行非标准行为,所以我想问这种情况下的预期行为是什么。 RFC 似乎没有解释这一点,所以我有点迷失了。

谢谢你!

what is the default encoding one should use to decode multipart/form-data if no charset is given? RFC2388 states:

4.5 Charset of text in form data

Each part of a multipart/form-data is supposed to have a content-
type. In the case where a field element is text, the charset
parameter for the text indicates the character encoding used.

For example, a form with a text field in which a user typed 'Joe owes
<eu>100' where <eu> is the Euro symbol might have form data returned
as:

--AaB03x
content-disposition: form-data; name="field1"
content-type: text/plain;charset=windows-1250
content-transfer-encoding: quoted-printable>>

Joe owes =80100.
--AaB03x

In my case, the charset isn't set and I don't know how to decode the data within that text/plain section. As I do not want to enforce something that isn't standard behavior I'm asking what the expected behavior in this case is. The RFC does not seem to explain this so I'm kinda lost.

Thank you!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

你与昨日 2024-10-07 13:45:41

这显然在 HTML5 中发生了变化(参见 http://dev .w3.org/html5/spec-preview/constraints.html#multipart-form-data)。

生成的 multipart/form-data 资源中与非文件字段对应的部分不得指定 Content-Type 标头。

那么字符集是在哪里指定的呢?据我从编码算法来看,唯一的位置是在名为 _charset_ 的表单数据集条目中。

如果您的表单没有名为 _charset_ 的隐藏输入,会发生什么情况?我已经在 Chrome 28 中对此进行了测试,发送一份以 UTF-8 编码的表单和一份以 ISO-8859-1 编码的表单,并检查发送的标头和有效负载,并且我没有在任何地方看到给出的字符集(即使文本编码肯定发生了变化) )。如果我在表单中包含一个空的 _charset_ 字段,Chrome 会使用正确的字符集类型填充该字段。我想任何服务器端代码都必须查找该 _charset_ 字段才能弄清楚?

我在编写一个使用 FormData 对象的 XMLHttpRequest.send 的 Chrome 扩展时遇到了这个问题,该对象 无论源文档编码是什么,始终以 UTF-8 进行编码

设请求实体主体为以 data 作为表单数据集、以 utf-8 作为显式字符编码运行 multipart/form-data 编码算法的结果。

令 mime 类型为“multipart/form-data;”、U+0020 空格字符、“boundary=”以及由 multipart/form-data 编码算法生成的 multipart/form-data 边界字符串的串联.

正如我之前发现的,在 POST 请求中的任何位置都没有指定 charset=utf-8,除非您在表单中包含一个空的 _charset_ 字段,在这种情况下,该字段将自动填充为“utf-8” ”。

这是我对事情现状的理解。我欢迎对我的假设进行任何更正!

This apparently has changed in HTML5 (see http://dev.w3.org/html5/spec-preview/constraints.html#multipart-form-data).

The parts of the generated multipart/form-data resource that correspond to non-file fields must not have a Content-Type header specified.

So where is the character set specified? As far as I can tell from the encoding algorithm, the only place is within a form data set entry named _charset_.

If your form does not have a hidden input named _charset_, what happens? I've tested this in Chrome 28, sending a form encoded in UTF-8 and one in ISO-8859-1 and inspecting the sent headers and payload, and I don't see charset given anywhere (even though the text encoding definitely changes). If I include an empty _charset_ field in the form, Chrome populates that with the correct charset type. I guess any server-side code must look for that _charset_ field to figure it out?

I ran into this problem while writing a Chrome extension that uses XMLHttpRequest.send of a FormData object, which always gets encoded in UTF-8 no matter what the source document encoding is.

Let the request entity body be the result of running the multipart/form-data encoding algorithm with data as form data set and with utf-8 as the explicit character encoding.

Let mime type be the concatenation of "multipart/form-data;", a U+0020 SPACE character, "boundary=", and the multipart/form-data boundary string generated by the multipart/form-data encoding algorithm.

As I found earlier, charset=utf-8 is not specified anywhere in the POST request, unless you include an empty _charset_ field in the form, which in this case will automatically get populated with "utf-8".

This is my understanding of the state of things. I welcome any corrections to my assumptions!

〆凄凉。 2024-10-07 13:45:41

我猜 HTTP 1.1 的默认字符集是 ISO-8859-1 (Latin1)这也适用于这里。

3.7.1 规范化和文本默认值

--snip--

“charset”参数与某些媒体类型一起使用来定义数据的字符集(第 3.4 节)。当发送方未提供显式字符集参数时,“文本”类型的媒体子类型被定义为在通过 HTTP 接收时具有默认字符集值“ISO-8859-1”。除“ISO-8859-1”或其子集之外的字符集中的数据必须使用适当的字符集值进行标记。有关兼容性问题,请参阅第 3.4.1 节。

The default charset for HTTP 1.1 is ISO-8859-1 (Latin1), I would guess that this also applies here.

3.7.1 Canonicalization and Text Defaults

--snip--

The "charset" parameter is used with some media types to define the character set (section 3.4) of the data. When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP. Data in character sets other than "ISO-8859-1" or its subsets MUST be labeled with an appropriate charset value. See section 3.4.1 for compatibility problems.

秋心╮凉 2024-10-07 13:45:41

感谢@owlman 的详细解释。

这里只是一些更多信息:

上传请求有效负载片段:

------WebKitFormBoundarydZAwJIasnBbGaUqM
Content-Disposition: form-data; name="file"; filename="xxx.txt"
Content-Type: text/plain

如果“xxx.txt”中使用 UTF-8 编码包含一些 UNICODE 字符,则 Resin(自 4.0.40 起)无法正确解码它,但 Jetty(9.x) ) 能。

我认为Resin行为的原因是Content-type没有指定任何编码,因此Resin使用“ISO8859-1”解码文件名,这可能会导致乱码。

我做了一些谷歌搜索:

https://mail-archives.apache.org/mod_mbox/struts-user/200310.mbox/%[电子邮件受保护]%3E

看来 Resin 的行为符合 Servlet Spec 2.3

并且我无法从 http://www.caucho.com/resin-4.0/reference.xtp
这可以改变 Resin 的这种行为。

Thanks to the detailed explanation by @owlman.

Just some more info here:

Upload request payload fragment:

------WebKitFormBoundarydZAwJIasnBbGaUqM
Content-Disposition: form-data; name="file"; filename="xxx.txt"
Content-Type: text/plain

If "xxx.txt" has some UNICODE char in it using UTF-8 encoding, Resin(as of 4.0.40) can't decode it correctly, but Jetty(9.x) can.

I think the reason for Resin's behavior is that the Content-type doesn't specify any encoding, so Resin decode file name using "ISO8859-1", which may result in garbled characters.

I did some googling:

https://mail-archives.apache.org/mod_mbox/struts-user/200310.mbox/%[email protected]%3E

It seems that Resin's behavior is according to Servlet Spec 2.3

And I can't find any settings from http://www.caucho.com/resin-4.0/reference.xtp
which can change this behavior for Resin.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文