对 Unicode 字符进行 URL 编码的正确方法是什么?

发布于 2024-07-22 05:57:52 字数 1010 浏览 2 评论 0原文

我知道非标准 %uxxxx 方案,但这似乎不是一个明智的选择,因为该方案已被 W3C 拒绝。

一些有趣的例子:

心形角色。 如果我在浏览器中输入以下内容:

http://www.google.com/search?q=♥

然后复制并粘贴它,我会看到这个 URL

http://www.google.com/search?q=%E2%99%A5

,这使得 Firefox(或 Safari)看起来像是在执行此操作。

urllib.quote_plus(x.encode("latin-1"))
'%E2%99%A5'

这是有道理的,除了那些不能用 Latin-1 编码的东西,比如三点字符。

输入 URL

http://www.google.com/search?q=…

如果我在浏览器中

http://www.google.com/search?q=%E2%80%A6

,然后复制并粘贴,我就会返回。 这似乎是这样做的结果,

urllib.quote_plus(x.encode("utf-8"))

因为……不能用 Latin-1 编码。

但我不清楚浏览器如何知道是使用 UTF-8 还是 Latin-1 进行解码。

由于这似乎不明确:

In [67]: u"…".encode('utf-8').decode('latin-1')
Out[67]: u'\xc3\xa2\xc2\x80\xc2\xa6'

有效,所以我不知道浏览器如何确定是否使用 UTF-8 还是 Latin-1 进行解码。

对于我需要处理的特殊字符,正确的做法是什么?

I know of the non-standard %uxxxx scheme but that doesn't seem like a wise choice since the scheme has been rejected by the W3C.

Some interesting examples:

The heart character.
If I type this into my browser:

http://www.google.com/search?q=♥

Then copy and paste it, I see this URL

http://www.google.com/search?q=%E2%99%A5

which makes it seem like Firefox (or Safari) is doing this.

urllib.quote_plus(x.encode("latin-1"))
'%E2%99%A5'

which makes sense, except for things that can't be encoded in Latin-1, like the triple dot character.

If I type the URL

http://www.google.com/search?q=…

into my browser then copy and paste, I get

http://www.google.com/search?q=%E2%80%A6

back. Which seems to be the result of doing

urllib.quote_plus(x.encode("utf-8"))

which makes sense since … can't be encoded with Latin-1.

But then its not clear to me how the browser knows whether to decode with UTF-8 or Latin-1.

Since this seems to be ambiguous:

In [67]: u"…".encode('utf-8').decode('latin-1')
Out[67]: u'\xc3\xa2\xc2\x80\xc2\xa6'

works, so I don't know how the browser figures out whether to decode that with UTF-8 or Latin-1.

What's the right thing to be doing with the special characters I need to deal with?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

回忆躺在深渊里 2024-07-29 05:57:52

我总是使用 UTF-8 进行编码。 来自有关百分比编码的维基百科页面

通用 URI 语法要求提供 URI 中字符数据表示的新 URI 方案实际上必须表示非保留集中的字符而无需翻译,并且应根据 UTF-8 将所有其他字符转换为字节,然后对这些值进行百分比编码。 此要求于 2005 年 1 月随着 RFC 3986 的发布而引入。 在此日期之前引入的 URI 方案不受影响。

似乎因为过去还有其他可接受的 URL 编码方法,浏览器会尝试多种解码 URI 的方法,但如果您是进行编码的人,则应该使用 UTF-8。

I would always encode in UTF-8. From the Wikipedia page on percent encoding:

The generic URI syntax mandates that new URI schemes that provide for the representation of character data in a URI must, in effect, represent characters from the unreserved set without translation, and should convert all other characters to bytes according to UTF-8, and then percent-encode those values. This requirement was introduced in January 2005 with the publication of RFC 3986. URI schemes introduced before this date are not affected.

It seems like because there were other accepted ways of doing URL encoding in the past, browsers attempt several methods of decoding a URI, but if you're the one doing the encoding you should use UTF-8.

漫雪独思 2024-07-29 05:57:52

IRI (RFC 3987) 是取代 URI/URL (RFC 3986 及更早版本)标准。 URI/URL 本身并不支持 Unicode(嗯,RFC 3986 添加了针对未来 URI/基于 URL 的协议来支持它,但不会更新过去的 RFC)。 “%uXXXX”方案是在某些情况下允许 Unicode 的非标准扩展,但并非所有人都普遍实现。 另一方面,IRI 完全支持 Unicode,并要求将文本编码为 UTF-8,然后再进行百分比编码。

IRI (RFC 3987) is the latest standard that replaces the URI/URL (RFC 3986 and older) standards. URI/URL do not natively support Unicode (well, RFC 3986 adds provisions for future URI/URL-based protocols to support it, but does not update past RFCs). The "%uXXXX" scheme is a non-standard extension to allow Unicode in some situations, but is not universally implemented by everyone. IRI, on the other hand, fully supports Unicode, and requires that text be encoded as UTF-8 before then being percent-encoded.

乖乖 2024-07-29 05:57:52

一般规则似乎是浏览器根据提供表单的页面的内容类型对表单响应进行编码。 这是一种猜测,如果服务器向我们发送“text/xml; charset=iso-8859-1”,那么他们期望以相同的格式返回响应。

如果您只是在 URL 栏中输入 URL,则浏览器没有可运行的基本页面,因此只能进行猜测。 因此,在这种情况下,它似乎一直在执行 utf-8 (因为您的两个输入都生成了三个八位字节形式的值)。

可悲的事实是,据我所知,对于查询字符串中的值或 URL 中的任何字符应解释为什么字符集,没有标准。 至少在查询字符串中的值的情况下,没有理由假设它们一定确实对应于字符。

这是一个已知的问题,您必须告诉服务器框架您希望查询字符串被编码为哪种字符集——例如,在 Tomcat 中,您必须调用 request.setEncoding() (或一些类似的方法)在调用任何 request.getParameter() 方法之前。 关于这个主题的文档的缺乏可能反映出许多开发人员缺乏对这个问题的认识。 (我经常询问 Java 受访者 Reader 和 InputStream 之间的区别是什么,并且经常得到茫然的表情)

The general rule seems to be that browsers encode form responses according to the content-type of the page the form was served from. This is a guess that if the server sends us "text/xml; charset=iso-8859-1", then they expect responses back in the same format.

If you're just entering a URL in the URL bar, then the browser doesn't have a base page to work on and therefore just has to guess. So in this case it seems to be doing utf-8 all the time (since both your inputs produced three-octet form values).

The sad truth is that AFAIK there's no standard for what character set the values in a query string, or indeed any characters in the URL, should be interpreted as. At least in the case of values in the query string, there's no reason to suppose that they necessarily do correspond to characters.

It's a known problem that you have to tell your server framework which character set you expect the query string to be encoded as--- for instance, in Tomcat, you have to call request.setEncoding() (or some similar method) before you call any of the request.getParameter() methods. The dearth of documentation on this subject probably reflects the lack of awareness of the problem amongst many developers. (I regularly ask Java interviewees what the difference between a Reader and an InputStream is, and regularly get blank looks)

勿忘初心 2024-07-29 05:57:52

IRI 不会取代 URI,因为在某些上下文中(包括 HTTP)只允许使用 URI(实际上是 ASCII)。

相反,您指定一个 IRI,并在传输时将其转换为 URI。

IRIs do not replace URIs, because only URIs (effectively, ASCII) are permissible in some contexts -- including HTTP.

Instead, you specify an IRI and it gets transformed into a URI when going out on the wire.

一城柳絮吹成雪 2024-07-29 05:57:52

第一个问题是你的需求是什么? UTF-8 编码是使用廉价编辑器创建的文本和支持多种语言之间的一个很好的折衷方案。 对于识别编码的浏览器,响应(来自 Web 服务器)应该告诉浏览器编码。 大多数浏览器仍然会尝试猜测,因为在很多情况下,这种猜测要么丢失,要么错误。 他们通过读取一定量的结果流来猜测是否存在不适合默认编码的字符。 目前所有浏览器(?我没有检查这一点,但它非常接近真实)都使用 utf-8 作为默认值。

因此,请使用 utf-8,除非您有令人信服的理由使用许多其他编码方案之一。

The first question is what are your needs? UTF-8 encoding is a pretty good compromise between taking text created with a cheap editor and support for a wide variety of languages. In regards to the browser identifying the encoding, the response (from the web server) should tell the browser the encoding. Still most browsers will attempt to guess, because this is either missing or wrong in so many cases. They guess by reading some amount of the result stream to see if there is a character that does not fit in the default encoding. Currently all browser(? I did not check this, but it is pretty close to true) use utf-8 as the default.

So use utf-8 unless you have a compelling reason to use one of the many other encoding schemes.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文