HTTPClient - HTTP GET 因重定向 URL 中的 # 锚点而损坏

发布于 2024-12-11 06:51:46 字数 959 浏览 3 评论 0原文

这有点奇怪。我正在使用 HTTPClient 4.1.2,似乎只要它发现 URL 中包含“#”之类的内容,它就会对 URL 中的 # 进行完整获取。

例如,尝试获取 URL http://stks.co/eWt 将重定向到 URL http://news.ichinastock.com/2011/10/jack-ma-阿里巴巴已准备好 200 亿美元收购 yahoo/#.Tpw-xG61XjU.twitter。现在这个 URL 已上线,但问题是 HTTPClient 发送了一个 URI 设置为 URI 的 get 请求:/2011/10/jack-ma-alibaba-has-prepared-20-billion-to-acquire-yahoo /#.Tpw-xG61XjU.twitter 这会导致服务器发回 404 页面未找到。

查看IE、Firefox和cURL发送的GET,它们都去掉了URI末尾的#...,因此例如cURL GET请求URI设置为URI:/2011/10/jack -ma-alibaba-has-prepared-20-billion-to-acquire-yahoo/ - 所有 #... 已被删除。这与 http://stks.co/eWt 的入口 URL 完全相同。

作为测试,将此原始 URL 发送到 HTTPClient(即 HttpGet httpget = new HttpGet("http://news.ichinastock.com/2011/10/jack-ma-alibaba-has-prepared-20-billion- to-acquire-yahoo/#.Tpw-xG61XjU.twitter");) 给出相同的 404 未找到结果。

所以问题是 HTTPClient 中是否有任何设置可以设置,以便可以从 URL 中自动删除尾随 #... 之类的内容。或者我该如何手动从 URL 中删除它(请记住,我还需要捕获所有重定向 URL)?

This is a bit of a weird one. I'm using HTTPClient 4.1.2, and it seems that whenever it finds are URL with something like a '#' in it, it does a full get with the # in the URL.

For example, trying to get the URL http://stks.co/eWt will redirect to the URL http://news.ichinastock.com/2011/10/jack-ma-alibaba-has-prepared-20-billion-to-acquire-yahoo/#.Tpw-xG61XjU.twitter. Now this URL is live, but the problem is the HTTPClient sends a get request with the URI set to URI: /2011/10/jack-ma-alibaba-has-prepared-20-billion-to-acquire-yahoo/#.Tpw-xG61XjU.twitter which causes the server to send back a 404 page not found.

Looking at the GET sent by IE, Firefox and cURL, they all strip out the #... from the end of the URI, so for example the cURL GET request URI is set as URI: /2011/10/jack-ma-alibaba-has-prepared-20-billion-to-acquire-yahoo/ - all the #... have been removed. This is for the exact same entry URL of http://stks.co/eWt.

As a test, sending this raw URL into HTTPClient (i.e. HttpGet httpget = new HttpGet("http://news.ichinastock.com/2011/10/jack-ma-alibaba-has-prepared-20-billion-to-acquire-yahoo/#.Tpw-xG61XjU.twitter");) gives the same 404 not found result.

So the question is are there any settings in HTTPClient that can be set so that things like the trailing #... can be auto removed from URLs. Or how would I go about manually removing this from URLs (remember that I would need to capture all redirect URLs as well)?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

莫言歌 2024-12-18 06:51:46

听起来他们的网络服务器坏了。 URI 规范规定,URI 的路径部分以数字符号 (#) 结尾。如果 Web 服务器考虑路径的 # 部分之后的任何内容,则它不遵循 URI 规范。

路径组件包含通常以分层形式组织的数据,这些数据与非分层查询组件中的数据一起用于标识 URI 方案和命名权限(如果有)范围内的资源。路径以第一个问号(“?”)或数字符号(“#”)字符或 URI 的末尾结束。” - RFC3986

我测试了一些流行的Web服务器,它们都正确解析这些URI,忽略数字符号后面的部分。

我不虽然有任何好的解决方法建议,但至少现在你知道该责怪谁了。

It sounds like their web server is broken. The URI specification says that a number sign (#) terminates the path portion of the URI. If a web server considers anything after a # part of the path, it is not following the URI specification.

The path component contains data, usually organized in hierarchical form, that, along with data in the non-hierarchical query component, serves to identify a resource within the scope of the URI's scheme and naming authority (if any). The path is terminated by the first question mark ("?") or number sign ("#") character, or by the end of the URI." - RFC3986

I tested a few popular web servers, and they all parse these URIs correctly, ignoring the portion after the number sign.

I don't have any good suggestions for a workaround though. But at least now you know who to blame.

溺ぐ爱和你が 2024-12-18 06:51:46

注意: 哈希(以及哈希)之后的所有字符串都不会发送到服务器。 URL 中的哈希值是供浏览器使用的,而不是供服务器使用的。

Note: All strings coming after the hash (and the hash) won't be sent to a server. The hash in URL's is meant for browsers to work with and not a server.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文