HTTPClient - HTTP GET 因重定向 URL 中的 # 锚点而损坏
这有点奇怪。我正在使用 HTTPClient 4.1.2,似乎只要它发现 URL 中包含“#”之类的内容,它就会对 URL 中的 # 进行完整获取。
例如,尝试获取 URL http://stks.co/eWt
将重定向到 URL http://news.ichinastock.com/2011/10/jack-ma-阿里巴巴已准备好 200 亿美元收购 yahoo/#.Tpw-xG61XjU.twitter
。现在这个 URL 已上线,但问题是 HTTPClient 发送了一个 URI 设置为 URI 的 get 请求:/2011/10/jack-ma-alibaba-has-prepared-20-billion-to-acquire-yahoo /#.Tpw-xG61XjU.twitter
这会导致服务器发回 404 页面未找到。
查看IE、Firefox和cURL发送的GET,它们都去掉了URI末尾的#...,因此例如cURL GET请求URI设置为URI:/2011/10/jack -ma-alibaba-has-prepared-20-billion-to-acquire-yahoo/
- 所有 #... 已被删除。这与 http://stks.co/eWt
的入口 URL 完全相同。
作为测试,将此原始 URL 发送到 HTTPClient(即 HttpGet httpget = new HttpGet("http://news.ichinastock.com/2011/10/jack-ma-alibaba-has-prepared-20-billion- to-acquire-yahoo/#.Tpw-xG61XjU.twitter");) 给出相同的 404 未找到结果。
所以问题是 HTTPClient 中是否有任何设置可以设置,以便可以从 URL 中自动删除尾随 #... 之类的内容。或者我该如何手动从 URL 中删除它(请记住,我还需要捕获所有重定向 URL)?
This is a bit of a weird one. I'm using HTTPClient 4.1.2, and it seems that whenever it finds are URL with something like a '#' in it, it does a full get with the # in the URL.
For example, trying to get the URL http://stks.co/eWt
will redirect to the URL http://news.ichinastock.com/2011/10/jack-ma-alibaba-has-prepared-20-billion-to-acquire-yahoo/#.Tpw-xG61XjU.twitter
. Now this URL is live, but the problem is the HTTPClient sends a get request with the URI set to URI: /2011/10/jack-ma-alibaba-has-prepared-20-billion-to-acquire-yahoo/#.Tpw-xG61XjU.twitter
which causes the server to send back a 404 page not found.
Looking at the GET sent by IE, Firefox and cURL, they all strip out the #... from the end of the URI, so for example the cURL GET request URI is set as URI: /2011/10/jack-ma-alibaba-has-prepared-20-billion-to-acquire-yahoo/
- all the #... have been removed. This is for the exact same entry URL of http://stks.co/eWt
.
As a test, sending this raw URL into HTTPClient (i.e. HttpGet httpget = new HttpGet("http://news.ichinastock.com/2011/10/jack-ma-alibaba-has-prepared-20-billion-to-acquire-yahoo/#.Tpw-xG61XjU.twitter");
) gives the same 404 not found result.
So the question is are there any settings in HTTPClient that can be set so that things like the trailing #... can be auto removed from URLs. Or how would I go about manually removing this from URLs (remember that I would need to capture all redirect URLs as well)?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
听起来他们的网络服务器坏了。 URI 规范规定,URI 的路径部分以数字符号 (#) 结尾。如果 Web 服务器考虑路径的 # 部分之后的任何内容,则它不遵循 URI 规范。
我测试了一些流行的Web服务器,它们都正确解析这些URI,忽略数字符号后面的部分。
我不虽然有任何好的解决方法建议,但至少现在你知道该责怪谁了。
It sounds like their web server is broken. The URI specification says that a number sign (#) terminates the path portion of the URI. If a web server considers anything after a # part of the path, it is not following the URI specification.
I tested a few popular web servers, and they all parse these URIs correctly, ignoring the portion after the number sign.
I don't have any good suggestions for a workaround though. But at least now you know who to blame.
注意: 哈希(以及哈希)之后的所有字符串都不会发送到服务器。 URL 中的哈希值是供浏览器使用的,而不是供服务器使用的。
Note: All strings coming after the hash (and the hash) won't be sent to a server. The hash in URL's is meant for browsers to work with and not a server.