使用 HttpClient 时正确编码 URL 中的字符
我有一个 URL 列表,需要验证这些 URL 是否有效。我用 Java 编写了一个程序,它使用 Apache 的 HttpClient 来检查链接。由于存在默认策略无法处理的无效字符(例如重定向 URL 中的 {}),我必须实现自己的重定向策略。它在大多数情况下工作正常,除了其中 2 种情况:
路径或查询参数中的转义字符,不应进一步编码。示例:
字符串 url = "http://www.example.com/chapter1/%3Fref%3Dsomething%26term%3D?ref=xyz"
如果我使用 URI 对象,它就会因“{”字符而卡住。
URI myUri = 新 URI(url) ==>这将会失败。
如果我运行:
URI myUri = 新的 URI(UriUtils.encodeHttpUrl(url))
它将 %3F 编码为 %253F。 然而,当我使用 Chrome 或 Fiddler 点击链接时,我没有看到 %3F 再次被转义。如何防止对路径或查询参数进行过度编码?
URL 中的最后一个查询参数也具有有效的 URL。例如。
字符串 url = "www.example.com/Chapter1/?param1=xyz¶m2=http://www.google.com/?abc=1"
我当前的编码策略拆分查询参数,然后对查询参数调用 URLEncoder.encode。然而,这会导致最后一个参数也被编码(当我在 Fiddler 或 Chrome 中遵循它时,情况并非如此)。
我已经尝试了很多方法(使用 UriUtils、将 URL 作为最后一个参数的特殊情况以及其他技巧),但似乎没有什么是理想的。解决这个问题的最好方法是什么?
I have a list of URLs that I need to verify are valid URLs. I've written a program in Java that uses Apache's HttpClient to check the link. I had to implement my own redirect strategy due to the presence of invalid characters (like {} in the redirect URLS) which the default stratgey didn't take care of. It works fine in the majority of the cases except for 2 of them:
Escaped Characters in the path or query params, which should not be encoded further. Example:
String url = "http://www.example.com/chapter1/%3Fref%3Dsomething%26term%3D?ref=xyz"
If I use a URI object, it chokes on the "{" character.
URI myUri = new URI(url) ==> This will fail.
If I run:
URI myUri = new URI(UriUtils.encodeHttpUrl(url))
it encodes the %3F to %253F.
However when I follow the link using Chrome or Fiddler, I do not see %3F getting escaped again. How do I protect from over-encoding the path or query params?The last query param in the URL has a valid URL as well. Eg.
String url = "www.example.com/Chapter1/?param1=xyz¶m2=http://www.google.com/?abc=1"
My current encoding strategy splits up the query params and then calls URLEncoder.encode on the query params. This however causes the last param to be encoded as well (which is not the case when I follow it in Fiddler or Chrome).
I've tried a number of things (using UriUtils, special cases for URLs as last param and other hacks) but nothing seems to be ideal. Whats the best way to solve this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

发布评论
评论(4)
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
您无法“防止过度编码”。您要么编码,要么不编码。对于任何给定的字符串,您应该始终知道它是否已编码。您应该只对尚未编码的字符串进行编码,并且永远不应该对已经编码的字符串进行编码。
那么这个字符串是否经过编码?
在我看来,这是错误的输入:显然这没有编码,因为它包含无效字符(“{”和“}”)。但它似乎也不是未编码的字符串,因为它包含“%xx”序列。所以它是部分编码的。一旦字符串采用这种形式,就没有程序化的“解决方案”——您只需首先避免将字符串采用这种形式即可。您也许能够构建一个算法来“修复”该字符串,方法是仔细查找看起来像“%”的部分,后跟两个十六进制数字,然后将它们保留。但这在微妙的情况下会破裂。考虑一个未编码的字符串“42%23”,它应该是数学表达式“42 mod 23”的字面表示。当我将其放入 URI 中时,我希望它编码为“42%2523”,因此它解码为“42%23”,但上述算法会中断并将其编码为“42%23”,然后解码为“ 42#”。所以没有办法修复上面的字符串。将“%3F”编码为“%253F”正是 URI 编码器应该做的事情。
注意:话虽如此,浏览器通常允许您在 URI 中输入错误字符,并且会自动对它们进行编码。这不是很强大,因此除非您试图非常宽容用户输入,否则不应使用它。在这种情况下,您可以“尽力”首先解码 URI,然后重新编码。在这种情况下,如果我想输入“42%23”,我必须手动输入“42%2523”。
至于问题2:
类似地,这正是您想要的。如果一个 URI 作为参数出现在另一个 URI 中,则它应该采用百分比编码。否则,您如何知道一个 URI 在哪里结束而另一个 URI 在哪里继续呢?我相信上面的 URI 实际上是有效的(因为 ':'、'/'、'&' 和 '=' 是保留字符,不是禁止的,因此只要它们不产生歧义,它们是允许的)。但对 URI-inside-a-URI 进行转义要安全得多。
You cannot "protect from over-encoding". You either encode, or you do not. You should always know, for any given string, whether it is encoded or not. You should only encode strings which are not yet encoded, and you should never encode strings which are already encoded.
So is this string encoded or not?
It seems to me like this is bad input: clearly this is not encoded because it contains invalid characters ('{' and '}'). Yet it also seems not to be an unencoded string, because it contains '%xx' sequences. So it's partly-encoded. There is no programmatic "solution" once a string is in this form -- you simply need to avoid getting a string into such a form in the first place. You may be able to construct an algorithm which "fixes" this string, by carefully looking for parts looking like a "%" followed by two hex digits, and leaving them alone. But this will break on subtle cases. Consider an unencoded string "42%23", which is supposed to be a literal representation of the mathematical expression "42 mod 23". When I put this into a URI, I expect it to encode as "42%2523" so it decodes as "42%23", but the above algorithm will break and encode it as "42%23" which will then decode as "42#". So there is no way to fix the above string. Encoding "%3F" to "%253F" is exactly what a URI encoder should be doing.
Note: Having said this, browsers often allow you to get away with typing bad characters into URIs and they automatically encode them. That's not very robust so it shouldn't be used unless you are trying to be very forgiving of user input. In that case, you can do a "best effort" by first decoding the URI and then re-encoding it. In this case, if I wanted to type "42%23" I would have to manually type in "42%2523".
As for question 2:
Similarly, this is exactly what you want. If a URI appears as a parameter inside another URI, it should be percent-encoded. Otherwise, how can you tell where one URI finishes and the other continues? I believe the above URI is actually valid (since ':', '/', '&' and '=' are reserved characters, not forbidden, and therefore they are allowed as long as they do not create ambiguity). But it is much safer to have a URI-inside-a-URI escaped.