假设解码后的百分比编码 URI 转换为 UTF-8 是否安全？

发布于 2024-12-09 03:07:35 字数 461 浏览 3 评论 0原文

RFC 3986 规定新的 URI 方案应先编码为 UTF-8，然后再进行百分比编码。但是，这不适用于以前的 URI 版本。

可以安全地假设所有多字节、百分比编码的 URI 在通过 urldecode() 传递后都会变成 UTF-8 编码的字符串吗？

例如，如果 $_SERVER['REQUEST_URI'] 的内容按如下方式进行百分比编码：

/b%C3%BCch/w%C3%B6rterb%C3%BCch

在将此字符串传递给 urldecode() 后，我应该有一个多字节细绳。但我怎么知道字符串是什么编码呢？在上面的示例中，它是 UTF-8，但始终这样假设是否安全？

如果这样假设不安全，是否有其他方法（除了 mb_detect_encoding）来检测字符串的编码？我检查了请求标头，它们似乎没有任何帮助。

原文

RFC 3986 states that new URI scheme should be encoded to UTF-8 first before being percent encoded. However, this does not apply to previous URI versions.

Is it safe to assume that all multibyte, percent encoded URI turns into UTF-8 encoded string after being passed through urldecode()?

For example, if the contents of $_SERVER['REQUEST_URI'] is being percent encoded as such:

/b%C3%BCch/w%C3%B6rterb%C3%BCch

After I pass this string to urldecode(), I should have a multibyte string. But how do I know in what encoding the string is? In the above example, it's UTF-8, but is it safe to always assume so?

If it's not safe to assume so, is there a way (other than mb_detect_encoding) to detect the encoding of the string? I've checked request headers, they don't seem to have anything helpful.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

守望孤独 2024-12-16 03:07:35

感谢您的所有评论和回答！在发布问题后，我自己做了一些挖掘，并想在这里写下来作为参考。如果这个答案有误，请告诉我。

跳到最后直接进入结论。

来自有关国际字符和字符编码的 JETTY 文档 ,
从“URL 中的国际字符”部分中，我发现了这些
段落：

由于缺乏标准，不同的浏览器对所使用的字符编码采取了不同的方法。有些使用页面的编码，有些使用 UTF-8。各个标准机构起草了一些草案，建议 UTF-8 将成为标准编码。旧版本的 jetty（例如 4.0.x 系列）使用 UTF-8 作为默认值，以期采用标准。由于标准尚未出台，jetty-4.1.x 恢复为默认编码 ISO-8859-1。
W3C 组织的 HTML 标准现在建议使用 UTF-8：http://www.w3.org/TR/html40/appendix/notes.html#non-ascii-chars 相应jetty-6系列使用默认值UTF-8。

在链接的 HTML 4.0 规范上，有确实是一个推荐
让客户端先将非 ASCII 字符编码为 UTF-8
对它进行百分比编码，所以我们知道它是来自
自 HTML 4.0 以来的 W3C。

页面上使用的示例是这样的：

<A href="http://foo.org/Håkon">...</A>

虽然稍后指出应将相同的编码应用于
片段部分，没有说是否也适用于查询
细绳。

在浏览器中输入 URL

Firefox

正如 Pekka 已经提到的，基于在此链接上 Firefox
迟至 2007 年才发送 ISO-8859-1 编码的 URI。阅读链接，
这似乎是 Firefox 的默认行为 < 3.0。我是
不确定这是否也适用于 Firefox < Mac OS X 中的 3.0，
因为Mac 中的默认编码是 UTF-8。

我在 Windows XP 和 Firefox 6 中测试了 Firefox 3.6.13
Windows 7 和 Mac OS X。Mac 版本发送所有内容
UTF-8，所以不用担心。

Windows 中的 Firefox 3.6.13 和 6 将查询字符串编码为 ISO-8859-1
默认情况下，但是当您键入不存在的字符时
ISO-8859-1 到查询字符串（例如 α），Firefox 3
将整个查询字符串的编码切换为 UTF-8。我是
很确定这在以后的版本中也是同样的行为。

在我测试的 Windows 中的 Firefox 3.6.13 和 6 中，路径部分
URI始终编码为 UTF-8。

如果您在 Windows 中的 Firefox 3.6/6 中键入此 URL：

http://localhost/test/ü/ä/index.php?chär=ü

查询字符串将编码为 ISO-8859-1，但“路径”部分
编码为 UTF-8：

http://localhost//test/%C3%BC/%C3%A4/index.php?ch%E4r=%FC

还要注意，根据这篇博文，Firefox 3.0
在百分比编码之前将片中字符 ο 转换为 ア
它。当我尝试在 Firefox 3.6.13 的查询字符串中执行此操作时
和路径，katanaka 字符被正确编码为 UTF-8。

Opera

Mac 上的 Opera 10.10 将 URI 的查询字符串部分编码为
ISO-8859-1，即使 Mac OS X 的默认编码是
UTF-8。 “路径”部分被编码为 UTF-8，就像 Firefox 一样。

如果您尝试在查询字符串中输入希腊字母 α ，它会得到
作为问号发送。

Windows XP 中的 Opera 11.51 也表现出相同的行为。

Safari

Mac 上的 Safari 5.1 始终以 UTF-8 格式发送所有内容。
Windows 中的 Safari 5.1 也表现出相同的行为。

Windows 上的Chrome

版本 13 将查询字符串和路径编码为
UTF-8。我在 Mac 上没有 Chrome，但似乎可以放心地假设
Chrome 总是发送 UTF-8，就像 Safari 一样。

Internet Explorer

免责声明：我使用 IECollection 安装多个版本的 IE
在一台机器上，所以这可能不是 IE 的自然行为
（有人可以证实这一点吗？）。

Windows XP 中的 IE 6、7 和 8 将 URI 的“路径”部分编码为
UTF-8 正确。输入查询的变音符号和希腊字母
但字符串没有得到百分比编码。输入的查询字符串
到地址栏似乎是以 ISO-8859-1（希腊字母表）发送的
查询字符串中的 alpha 'α' 被音译为 'a'。

结论

这是简短且不完整的，我不能保证
它的正确性，但似乎最常见的编码
对于 URI 要么是 ISO-8859-1 要么是 UTF-8 （我不知道东亚人是什么
用作他们的编码，它对我来说太详尽了，无法尝试
并找出）。

由于它已经是 HTML 4.0 的推荐，我想它是
可以安全地假设 URI 的“路径”部分始终被编码为
UTF-8。 Firefox 2.0 可能仍然存在，因此您必须检查是否
编码也是 ISO-8859-1。如果不是 UTF-8 或 ISO-8859-1，
这很可能是一个错误的请求。

理论上不可能正确检测到的编码
字符串（请参阅此处，以及此处）。你可以猜到，但是
你可能会得到错误的结果。所以不要依赖编码检测。

安全多字节路由

最安全的方法就是选择一种编码（UTF-8 是最安全的）
最安全的选择）适用于您的整个应用程序。然后你必须：

确保所有字符串都以 UTF-8 编码
使用它来构建您的 URI。正确地对 URI 进行百分比编码
在那之后。
确保所有 URL 编码 (GET) 表单都将数据发送到
正确的编码。请参阅 Kore Nordmann 的常见问题解答
有关确保您的表单发送正确信息的更多信息
编码。

另请参阅来自bobince的这个很好的答案。

此后，解析 URI 时就不会有任何问题了。如果
编码不是 UTF-8，那么这是一个错误的请求，你
可以响应 404 或 400 页面。

Thank you for all the comments and answers! I have done some digging myself after I posted the question and would like to write it down here as a reference. Please let me know if this answer is wrong.

Skip to the end to go directly to the conclusion.

From the JETTY Docs on International Characters and Character Encoding,
from the section "International characters in URLs", I found these
paragraphs:

Due to the lack of a standard, different browers took different approaches to the character encoding used. Some use the encoding of the page and some use UTF-8. Some drafts were prepared by various standards bodies suggesting that UTF-8 would become the standard encoding. Older versions of jetty (eg 4.0.x series) used UTF-8 as the default in anticipation of a standard being adopted. As a standard was not forthcoming, jetty-4.1.x reverted to a default encoding of ISO-8859-1.
The W3C organization's HTML standard now recommends the use of UTF-8: http://www.w3.org/TR/html40/appendix/notes.html#non-ascii-chars and accordingly jetty-6 series uses a default of UTF-8.

On the linked HTML 4.0 spec, there is indeed a recommendation
for clients to encode non-ASCII characters into UTF-8 first before
percent-encoding it, so we know it has been a recommendation from
W3C since HTML 4.0.

The example used on the page is this:

<A href="http://foo.org/Håkon">...</A>

While it later states that the same encoding should be applied to
the fragment part, it doesn't say that if it also applies to query
string.

Typing URLs into browsers

Firefox

As Pekka already mentioned, based on this link Firefox
sends ISO-8859-1 encoded URI as late as 2007. Reading the link,
this seems to be the default behavior for Firefox < 3.0. I'm
not sure if this also applies to Firefox < 3.0 in Mac OS X,
since default encoding in Mac is UTF-8.

I've tested Firefox 3.6.13 in Windows XP and Firefox 6 in both
Windows 7 and Mac OS X. The Mac version sends everything in
UTF-8, so it's nothing to worry about.

Firefox 3.6.13 and 6 in windows encodes query strings into ISO-8859-1
by default, but when you type characters that doesn't exist in
ISO-8859-1 to the query string (α, for example), Firefox 3
switches the encoding of the entire query string to UTF-8. I'm
pretty sure this is the same behavior in later versions too.

In Firefox 3.6.13 and 6 in Windows that I tested, the path part of
the URI is always encoded as UTF-8.

If you type this URL to Firefox 3.6/6 in Windows:

http://localhost/test/ü/ä/index.php?chär=ü

The query string gets encoded as ISO-8859-1, but the 'path' part
gets encoded as UTF-8:

http://localhost//test/%C3%BC/%C3%A4/index.php?ch%E4r=%FC

Also to be noted, according to this blog post, Firefox 3.0
converts katanaka character ア into ア before percent-encoding
it. When I tried to do this in Firefox 3.6.13 in the query string
and the path, the katanaka character gets encoded in UTF-8 correctly.

Opera

Opera 10.10 on Mac encodes the query string part of the URI into
ISO-8859-1, even though the default encoding for Mac OS X is
UTF-8. The 'path' part gets encoded into UTF-8, just like Firefox.

If you try to type greek alphabet α to the query string it gets
sent as a question mark.

The same behavior is exhibited by Opera 11.51 in Windows XP.

Safari

Safari 5.1 on Mac always sends everything as UTF-8.
Safari 5.1 in Windows exhibit the same behavior.

Chrome

Version 13 on Windows encodes both query string and path as
UTF-8. I don't have Chrome on Mac, but it seems safe to assume
that Chrome always sends UTF-8, like Safari.

Internet Explorer

DISCLAIMER: I use IECollection to install multiple versions of IE
in one machine, so this may not be IE's natural behavior
(anyone can confirm on this?).

IE 6, 7, and 8 in Windows XP encodes 'path' part of the URI into
UTF-8 correctly. Umlauts and greek alphabet typed to the query
string does not get percent encoded though. The query string typed
to the address bar seems to be sent in ISO-8859-1, the greek alphabet
alpha 'α' in the query string gets transliterated into 'a'.

Conclusion

This is short and incomplete, and I cannot guarantee the
correctness of it, but it seems that the most common encodings
for URIs are either ISO-8859-1 and UTF-8 (I have no idea what east asians
use as their encoding, and it is too exhaustive for me to try
and find out).

Since it is already a recommendation from HTML 4.0, I guess it's
safe to assume the 'path' part of the URI is always encoded in
UTF-8. Firefox 2.0 might still be around, so you must check if
the encoding is ISO-8859-1 too. If it's not UTF-8 or ISO-8859-1,
most likely it's a bad request.

It's theoretically impossible to correctly detect the encoding of
of a string (see here, and here). You can guess, but
you can get the wrong result. So don't rely on encoding detection.

Safe Multibyte Routing

The safest way is just to choose one encoding (UTF-8 is the
safest bet) for your entire application. Then you have to:

Make sure that all your strings are encoded in UTF-8 before
using it to build your URI. Properly percent encode your URI
after that.
Make sure all your URL encoded (GET) forms sends their data in
the proper encoding. See this FAQ by Kore Nordmann for
more information about making sure your forms send the correct
encoding.

Also see this great answer from bobince.

After this, you shouldn't have any problems parsing the URI. If
the encoding is not in UTF-8, then it's a bad request, and you
can respond with 404 or 400 page.

回复收藏 0 原文