您认为 Google 如何处理这个编码问题？

发布于 2024-08-12 05:16:39 字数 1319 浏览 2 评论 0原文

我最近遇到了一个编码问题，具体涉及 Firefox 如何对直接输入到地址栏中的 URL 进行编码。看起来 Firefox 的 URL 默认字符编码基本上不是 UTF-8，大多数浏览器都是这种情况。此外，看起来他们正在尝试根据 URL 的内容做出一些关于使用哪种字符编码的明智决策。

例如，如果您使用“q”参数直接在地址栏中输入 URL（我使用的是 Firefox 3.5.5），您将得到以下结果：

对于给定的查询字符串参数，这就是它的实际编码方式在http请求中：
1) ...q=克尼泽夫尼 --> q=Knji%9Eevni （这似乎是 iso-8859-1 编码的）
2) ...q=汉字 --> q=%E6%BC%A2%E5%AD%97（这似乎是 UTF-8 编码的）
3) ...q=Književni汉字 --> Knji%C5%BEevni%E6%BC%A2%E5%AD%97 （这似乎是 UTF-8 编码的……这很奇怪，因为请注意该值的第一部分与 1 相同，即iso-8859-1 编码）。

所以，这确实不是什么大问题，对吧？嗯，对我来说，不完全是，但有点。在我正在开发的应用程序中，我们的全局导航中有一个搜索框。当用户在我们的搜索框中提交搜索词时，“q”参数（如我们的示例中，保存查询字符串值的参数）将根据请求提交，并且是 UTF-8 编码的，一切都很好。

但是，地址栏中显示的 URL 包含该 URL 的解码形式，因此 q 参数看起来像“q=Književni”。现在，正如我之前提到的，如果用户按 ENTER 键提交地址栏中的内容，“q=Književni”参数现在会编码为 iso-8859-1 并以“q= Knji%9Eevni”。这样做的问题是，我们总是期待一个 UTF-8 编码的 URL ...所以当我们收到这个参数时，我们的应用程序不知道如何解释它，并且可能会导致一些奇怪的结果。

正如我之前提到的，这似乎只是 Firefox 的问题，用户很少会真正遇到这种情况，所以我们并不太担心。然而，我碰巧注意到谷歌实际上处理得很好。使用查询字符串参数的不同编码形式输入以下网址将在 Google 中返回良好的结果：

http://www.google.com/search?q=Knji%C5%BEevni
http://www.google.com/search?q=Knji%9Eevni

所以我的问题是，你认为他们如何处理这种情况？此外，还有其他人看到同样奇怪的 Firefox 行为吗？

原文

I recently came across an encoding issue specific to how Firefox encodes URLs directly entered into the address bar. It basically looks like the default Firefox character encoding for URLs is NOT UTF-8, which is the case with most browsers. Additionally, it looks like they are trying to make some intelligent decisions as to what character encoding to use, based on the content of the URL.

For example, if you enter a URL directly into the address bar (I'm using Firefox 3.5.5) with a 'q' parameter, you will get the following results:

For the given query string parameter, this is how it's actually encoded in the http request:
1) ...q=Književni --> q=Knji%9Eevni (This appears to be iso-8859-1 encoded)
2) ...q=漢字 --> q=%E6%BC%A2%E5%AD%97 (This appears to be UTF-8 encoded)
3) ...q=Književni漢字 --> Knji%C5%BEevni%E6%BC%A2%E5%AD%97 (This appears to be UTF-8 encoded ... which is odd, because notice that the first part of the value is the same as 1, which was iso-8859-1 encoded).

So, this really shouldn't be a big deal, right? Well, for me, not totally, but sort of. In the application I'm working on, we have a search box in our global navigation. When a user submits a search term in our search box, the 'q' parameter (like in our example, the parameter that holds the query string value) is submitted on the request and is UTF-8 encoded and all is well and good.

However, the URL that then appears in the address bar contains the decoded form of that URL, so the q parameter looks like "q=Književni". Now, as I mentioned before, if a user then presses the ENTER key to submit what is in the address bar, the "q=Književni" parameter is now encoded to iso-8859-1 and gets sent to our server as "q=Knji%9Eevni". The problem with this is that we are always expecting a UTF-8 encoded URL ... so when we recieve this parameter our application does not know how to interpret it and it can cause some strange results.

As I mentioned before, this appears to ONLY be a Firefox issue, and it would be rare that a user would actually run into this scenario, so it is not too concerning for us. However, I happened to notice that Google actually handles this quite nicely. Typing in the following URL using either of the differently encoded forms of the query string parameter will return nice results in Google:

http://www.google.com/search?q=Knji%C5%BEevni
http://www.google.com/search?q=Knji%9Eevni

So my question really is, how do you think they handle this scenario? Additionally, does anyone else see the same strange Firefox behavior?

分享到QQ

分享到微博