Apache HTTPClient 返回一个空页面

发布于 2024-09-03 14:46:04 字数 475 浏览 4 评论 0原文

我正在使用 Apache HTTPClient for Java,但我遇到了一个非常奇怪的问题。有时,当我尝试获取动态生成的页面时,它会返回其实际内容,但其他时候(使用另一个参数)我得到的只是 \t、\r 和 \n 的短序列。

我如何跟踪不同情况下发生的情况,以便找到错误所在?

我对该库的使用非常简单,我所做的就是对初始化的 HTTPClient 对象进行这几次调用:

String content = "/pageIwant.jsp?parameter=10101010";
HttpGet request = new HttpGet(content);
HttpResponse response = client.execute(targetHost, request);
HttpEntity entity = response.getEntity();
String page = EntityUtils.toString(entity);

I am using the Apache HTTPClient for Java and I'm facing a really strange issue. Sometimes when I try to get a dynamically generated page it returns its actual content, but other times (with another parameter) all I get is a short sequence of \t,\r and \n.

How could I track what's going on on the different cases in order to find where is the bug?

My usage of the library is pretty straightforward, all I do is this few calls on an initialized HTTPClient object:

String content = "/pageIwant.jsp?parameter=10101010";
HttpGet request = new HttpGet(content);
HttpResponse response = client.execute(targetHost, request);
HttpEntity entity = response.getEntity();
String page = EntityUtils.toString(entity);

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

荒芜了季节 2024-09-10 14:46:04

我的方法是首先尝试使用网络浏览器获取同一页面。如果您无法使其正常工作,则可以肯定地得出结论:真正的问题出在服务器上。您需要与服务器的支持人员交谈。

如果浏览器可以工作,请尝试使用 wget 实用程序重复该过程。如果 wget 给您带来问题,请返回浏览器并准确找出浏览器在 HTTP 请求中发送的标头,并尝试让 wget 使用相同的标头。一旦您让 wget 开始工作,请记下标题。

最后返回到您的 Java 代码,并修改它,使其发送的 HTTP 请求标头与 wget 的 HTTP 请求标头相同。

是的,我必须使用我大学的代理进行身份验证,然后才能访问所有数据。代理身份验证对于“期刊页面”甚至其他网站都可以完美运行,因此我排除了与此相关的问题。

我认为您可能已经排除了真正的问题。 @BalasC 不是在谈论代理身份验证。相反,他谈论的是 IEEE 站点的身份验证。仅仅因为网站的一部分无需身份验证即可运行,并不意味着全部都可以运行。 (但是,我原以为该网站会响应“禁止”或“需要授权”错误,而不是提供奇怪的内容。)

另一种可能性是该网站试图使用自动工具阻止其内容的“屏幕抓取” 。检查该网站的“服务条款”,看看您尝试做的事情是否被允许。 (您可以选择忽略服务条款并规避技术措施,但随后您可能会发现自己或您的组织的 IP 被封锁,或者您可能会收到有关侵犯版权的停止函。)

The way I would approach this to start by attempting to fetch the same page using a web browser. If you cannot get that to work, it is probably safe to conclude that the real problem is with the server. You'll need to talk to the server's support staff.

If a browser works, try and repeat the process using the wget utility. If wget gives you problems, go back to your browser and find out exactly what headers the browser is sending in the HTTP request and try to get wget to use the same headers. Once you've got wget to work, make a note of the headers.

Finally return to your Java code, and modify it so that the HTTP request headers it sends are the same as those that work for wget.

Yes, I have to authenticate using the proxy of my university and then I am able to access all the data. The proxy authentication is working flawlessly for the 'journal page' and even for other sites, so I'd exclude that the problem is related to that.

I think you may have excluded the real problem. @BalasC is not talking about proxy authentication. Rather he is talking about authentication at the IEEE site. And just because one part of the site appears to work without authentication does not mean it all will. (However, I'd have thought that the site would respond with a "FORBIDDEN" or "AUTHORIZATION REQUIRED" error rather than delivering strange content.)

Another possibility is that the site trying to prevent "screen scraping" of their content using automatic tools. Check the "Terms of Service" for the site to see if what you are trying to do is allowed. (You may choose to ignore the ToS and circumvent the technical measures, but then you might find yourself or your organization IP blocked, or you might be on the end of cease-and-desist letters talking about copyright violation.)

╭⌒浅淡时光〆 2024-09-10 14:46:04

我找到了问题的解决方案,我缺少一些标头信息,这些信息显然是动态页面的一部分所需要的。

为了解决我的问题,我首先使用wireshark来查看浏览器和服务器之间的通信,然后添加我缺少的所有标头。

我发现在我的例子中我需要指定“Accept-Language”数据

I found the solution to my problem, I was missing some header informations that apparently are required just from part of the dynamic page.

To solve my issue I first used wireshark to see the communications between the browser and the server and then I added all the headers I was missing.

I found out that in my case I needed to specify the 'Accept-Language' data

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文