您认为 Google 如何处理这个编码问题?
我最近遇到了一个编码问题,具体涉及 Firefox 如何对直接输入到地址栏中的 URL 进行编码。看起来 Firefox 的 URL 默认字符编码基本上不是 UTF-8,大多数浏览器都是这种情况。此外,看起来他们正在尝试根据 URL 的内容做出一些关于使用哪种字符编码的明智决策。
例如,如果您使用“q”参数直接在地址栏中输入 URL(我使用的是 Firefox 3.5.5),您将得到以下结果:
对于给定的查询字符串参数,这就是它的实际编码方式在http请求中:
1) ...q=克尼泽夫尼 --> q=Knji%9Eevni (这似乎是 iso-8859-1 编码的)
2) ...q=汉字 --> q=%E6%BC%A2%E5%AD%97(这似乎是 UTF-8 编码的)
3) ...q=Književni汉字 --> Knji%C5%BEevni%E6%BC%A2%E5%AD%97 (这似乎是 UTF-8 编码的……这很奇怪,因为请注意该值的第一部分与 1 相同,即iso-8859-1 编码)。
所以,这确实不是什么大问题,对吧?嗯,对我来说,不完全是,但有点。在我正在开发的应用程序中,我们的全局导航中有一个搜索框。当用户在我们的搜索框中提交搜索词时,“q”参数(如我们的示例中,保存查询字符串值的参数)将根据请求提交,并且是 UTF-8 编码的,一切都很好。
但是,地址栏中显示的 URL 包含该 URL 的解码形式,因此 q 参数看起来像“q=Književni”。现在,正如我之前提到的,如果用户按 ENTER 键提交地址栏中的内容,“q=Književni”参数现在会编码为 iso-8859-1 并以“q= Knji%9Eevni”。这样做的问题是,我们总是期待一个 UTF-8 编码的 URL ...所以当我们收到这个参数时,我们的应用程序不知道如何解释它,并且可能会导致一些奇怪的结果。
正如我之前提到的,这似乎只是 Firefox 的问题,用户很少会真正遇到这种情况,所以我们并不太担心。然而,我碰巧注意到谷歌实际上处理得很好。使用查询字符串参数的不同编码形式输入以下网址将在 Google 中返回良好的结果:
http://www.google.com/search?q=Knji%C5%BEevni
http://www.google.com/search?q=Knji%9Eevni
所以我的问题是,你认为他们如何处理这种情况?此外,还有其他人看到同样奇怪的 Firefox 行为吗?
I recently came across an encoding issue specific to how Firefox encodes URLs directly entered into the address bar. It basically looks like the default Firefox character encoding for URLs is NOT UTF-8, which is the case with most browsers. Additionally, it looks like they are trying to make some intelligent decisions as to what character encoding to use, based on the content of the URL.
For example, if you enter a URL directly into the address bar (I'm using Firefox 3.5.5) with a 'q' parameter, you will get the following results:
For the given query string parameter, this is how it's actually encoded in the http request:
1) ...q=Književni --> q=Knji%9Eevni (This appears to be iso-8859-1 encoded)
2) ...q=漢字 --> q=%E6%BC%A2%E5%AD%97 (This appears to be UTF-8 encoded)
3) ...q=Književni漢字 --> Knji%C5%BEevni%E6%BC%A2%E5%AD%97 (This appears to be UTF-8 encoded ... which is odd, because notice that the first part of the value is the same as 1, which was iso-8859-1 encoded).
So, this really shouldn't be a big deal, right? Well, for me, not totally, but sort of. In the application I'm working on, we have a search box in our global navigation. When a user submits a search term in our search box, the 'q' parameter (like in our example, the parameter that holds the query string value) is submitted on the request and is UTF-8 encoded and all is well and good.
However, the URL that then appears in the address bar contains the decoded form of that URL, so the q parameter looks like "q=Književni". Now, as I mentioned before, if a user then presses the ENTER key to submit what is in the address bar, the "q=Književni" parameter is now encoded to iso-8859-1 and gets sent to our server as "q=Knji%9Eevni". The problem with this is that we are always expecting a UTF-8 encoded URL ... so when we recieve this parameter our application does not know how to interpret it and it can cause some strange results.
As I mentioned before, this appears to ONLY be a Firefox issue, and it would be rare that a user would actually run into this scenario, so it is not too concerning for us. However, I happened to notice that Google actually handles this quite nicely. Typing in the following URL using either of the differently encoded forms of the query string parameter will return nice results in Google:
http://www.google.com/search?q=Knji%C5%BEevni
http://www.google.com/search?q=Knji%9Eevni
So my question really is, how do you think they handle this scenario? Additionally, does anyone else see the same strange Firefox behavior?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
看起来它使用的是latin-1,除非任何字符都不能用该编码表示,否则它使用的是UTF-8。
如果情况确实如此,则在另一端解决此问题的方法是假设您收到的所有内容都是 UTF-8,并将其验证为 UTF-8。如果它作为 UTF-8 验证失败,则假定它是 latin-1 (iso-8859-1)。
由于 UTF-8 的结构方式,实际上不是 UTF-8 的内容在验证为 UTF-8 时不太可能通过。
尽管如此,这种可能性仍然存在,我不认为 Firefox 的行为是一个好主意,尽管毫无疑问他们这样做是作为一种妥协——比如为了与那些不知道 UTF-8 的服务器兼容,如果他们介入的话。
Looks like it is using latin-1 unless any characters can't be represented in that encoding, otherwise it is using UTF-8.
If that is indeed the case, the way to get around this at the other end is to assume everything you receive is UTF-8, and validate it as UTF-8. If it fails validation as UTF-8 then assume it is latin-1 (iso-8859-1).
Due to the way UTF-8 is structured, it is highly unlikely that something that is not actually UTF-8 will pass when validated as UTF-8.
Still, the possibility exists and I don't think Firefox's behaviour is a good idea, though no doubt they have done it as a compromise - like for compatibility with servers that wouldn't know UTF-8 if they stepped in it.
一个 url 中有几个部分。域名根据 IDN(国际域名)规则进行编码 (http://en.wikipedia. org/wiki/国际化域名)。
您关心的部分(通常)来自表单。并且源页面的编码决定了编码(在%转义之前)。 html 中的表单元素还可以采用编码属性来覆盖页面设置。
所以这不是 Firefox 的错,referrer 页面/表单的编码才是决定因素。这是标准行为。
There are several parts in a url. The domain name is encoded according to the IDN (International Domain Names) rules (http://en.wikipedia.org/wiki/Internationalized_domain_name).
The part that you care about comes (usually) from a form. And the encoding of the source page determines the encoding (before the % escaping). The form element in html can also take an encoding attribute which overrides the the page setting.
So it is not the fault of Firefox, the encoding of the referrer page/form is the determining factor. And that is the standard behavior.