multipart/form-data,字段的默认字符集是什么?
如果没有给出字符集,应该使用什么默认编码来解码多部分/表单数据? RFC2388 规定:
4.5 表单数据中文本的字符集
多部分/表单数据的每个部分都应该有一个内容- 类型。如果字段元素是文本,则字符集 文本参数表示使用的字符编码。
例如,带有文本字段的表单,用户在其中键入“Joe 欠”
<前><代码>--AaB03x 内容处置:形式数据;名称=“字段1” 内容类型:文本/纯文本;字符集=windows-1250 内容传输编码:引用-可打印>> 乔欠=80100。 --AaB03x100',其中 欧元符号可能会返回表单数据 如:
就我而言,未设置字符集,我不知道如何解码该文本/纯文本部分中的数据。由于我不想强制执行非标准行为,所以我想问这种情况下的预期行为是什么。 RFC 似乎没有解释这一点,所以我有点迷失了。
谢谢你!
what is the default encoding one should use to decode multipart/form-data if no charset is given? RFC2388 states:
4.5 Charset of text in form data
Each part of a multipart/form-data is supposed to have a content-
type. In the case where a field element is text, the charset
parameter for the text indicates the character encoding used.For example, a form with a text field in which a user typed 'Joe owes
<eu>100' where <eu> is the Euro symbol might have form data returned
as:--AaB03x content-disposition: form-data; name="field1" content-type: text/plain;charset=windows-1250 content-transfer-encoding: quoted-printable>> Joe owes =80100. --AaB03x
In my case, the charset isn't set and I don't know how to decode the data within that text/plain section. As I do not want to enforce something that isn't standard behavior I'm asking what the expected behavior in this case is. The RFC does not seem to explain this so I'm kinda lost.
Thank you!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这显然在 HTML5 中发生了变化(参见 http://dev .w3.org/html5/spec-preview/constraints.html#multipart-form-data)。
那么字符集是在哪里指定的呢?据我从编码算法来看,唯一的位置是在名为 _charset_ 的表单数据集条目中。
如果您的表单没有名为 _charset_ 的隐藏输入,会发生什么情况?我已经在 Chrome 28 中对此进行了测试,发送一份以 UTF-8 编码的表单和一份以 ISO-8859-1 编码的表单,并检查发送的标头和有效负载,并且我没有在任何地方看到给出的字符集(即使文本编码肯定发生了变化) )。如果我在表单中包含一个空的 _charset_ 字段,Chrome 会使用正确的字符集类型填充该字段。我想任何服务器端代码都必须查找该 _charset_ 字段才能弄清楚?
我在编写一个使用 FormData 对象的 XMLHttpRequest.send 的 Chrome 扩展时遇到了这个问题,该对象 无论源文档编码是什么,始终以 UTF-8 进行编码。
正如我之前发现的,在 POST 请求中的任何位置都没有指定 charset=utf-8,除非您在表单中包含一个空的 _charset_ 字段,在这种情况下,该字段将自动填充为“utf-8” ”。
这是我对事情现状的理解。我欢迎对我的假设进行任何更正!
This apparently has changed in HTML5 (see http://dev.w3.org/html5/spec-preview/constraints.html#multipart-form-data).
So where is the character set specified? As far as I can tell from the encoding algorithm, the only place is within a form data set entry named _charset_.
If your form does not have a hidden input named _charset_, what happens? I've tested this in Chrome 28, sending a form encoded in UTF-8 and one in ISO-8859-1 and inspecting the sent headers and payload, and I don't see charset given anywhere (even though the text encoding definitely changes). If I include an empty _charset_ field in the form, Chrome populates that with the correct charset type. I guess any server-side code must look for that _charset_ field to figure it out?
I ran into this problem while writing a Chrome extension that uses XMLHttpRequest.send of a FormData object, which always gets encoded in UTF-8 no matter what the source document encoding is.
As I found earlier, charset=utf-8 is not specified anywhere in the POST request, unless you include an empty _charset_ field in the form, which in this case will automatically get populated with "utf-8".
This is my understanding of the state of things. I welcome any corrections to my assumptions!
我猜 HTTP 1.1 的默认字符集是 ISO-8859-1 (Latin1)这也适用于这里。
--snip--
The default charset for HTTP 1.1 is ISO-8859-1 (Latin1), I would guess that this also applies here.
--snip--
感谢@owlman 的详细解释。
这里只是一些更多信息:
上传请求有效负载片段:
如果“xxx.txt”中使用 UTF-8 编码包含一些 UNICODE 字符,则 Resin(自 4.0.40 起)无法正确解码它,但 Jetty(9.x) ) 能。
我认为Resin行为的原因是Content-type没有指定任何编码,因此Resin使用“ISO8859-1”解码文件名,这可能会导致乱码。
我做了一些谷歌搜索:
https://mail-archives.apache.org/mod_mbox/struts-user/200310.mbox/%[电子邮件受保护]%3E
看来 Resin 的行为符合 Servlet Spec 2.3
并且我无法从 http://www.caucho.com/resin-4.0/reference.xtp
这可以改变 Resin 的这种行为。
Thanks to the detailed explanation by @owlman.
Just some more info here:
Upload request payload fragment:
If "xxx.txt" has some UNICODE char in it using UTF-8 encoding, Resin(as of 4.0.40) can't decode it correctly, but Jetty(9.x) can.
I think the reason for Resin's behavior is that the Content-type doesn't specify any encoding, so Resin decode file name using "ISO8859-1", which may result in garbled characters.
I did some googling:
https://mail-archives.apache.org/mod_mbox/struts-user/200310.mbox/%[email protected]%3E
It seems that Resin's behavior is according to Servlet Spec 2.3
And I can't find any settings from http://www.caucho.com/resin-4.0/reference.xtp
which can change this behavior for Resin.