Apache httpclient 在加载之前返回页面?

发布于 2024-09-29 04:03:08 字数 2195 浏览 4 评论 0原文

我在使用 apache httpclient 库时注意到一个奇怪的现象,我想知道为什么会发生它。我创建了一些示例代码来演示。 考虑以下代码:

//Example URL
 String url = "http://www.amazon.com/gp/offer-listing/05961580/ref=dp_olp_used?ie=UTF8";
 GetMethod get = new GetMethod(url);
 HttpMethodRetryHandler httpHandler = new DefaultHttpMethodRetryHandler(1, false);
 get.getParams().setParameter(HttpMethodParams.RETRY_HANDLER, httpHandler );
 get.getParams().setCookiePolicy(CookiePolicy.IGNORE_COOKIES);
 HttpConnectionManager connectionManager = new SimpleHttpConnectionManager();
 HttpClient client = new HttpClient( connectionManager );
 client.getParams().setParameter("http.useragent", FIREFOX );
 String line;
 StringBuilder stringBuilder = new StringBuilder();
 String toStreamBody = null;
 String toStringBody = null;
 try {
  int statusCode = client.executeMethod(get);
  if( statusCode != HttpStatus.SC_OK ){
   System.err.println("Internet Status: " + HttpStatus.getStatusText(statusCode) );
   System.err.println("While getting page: " + url );
  }
 //toString
  toStringBody = get.getResponseBodyAsString();
 //toStream
  InputStreamReader isr = new InputStreamReader(get.getResponseBodyAsStream())
  BufferedReader rd = new BufferedReader(isr);
  while ((line = rd.readLine()) != null) {
  stringBuilder.append(line);
  }
 } catch (java.io.IOException ex) {
  System.out.println( "Failed to get page: " + url);
 } finally {
  get.releaseConnection();
 }       
 toStreamBody = stringBuilder.toString();

该代码不打印任何内容:

 System.out.println(toStringBody); // ""

该代码打印网页:

 System.out.println(toStreamBody); // "Whole Page"

但它变得更奇怪...... 替换:

get.getResponseBodyAsString();

为:

 get.getResponseBodyAsString(150000);

现在我们得到错误: 无法获取页面:http://www.amazon.com/gp/offer-listing/0596158068/ref=dp_olp_used?ie=UTF8

除了亚马逊之外,我无法找到另一个复制此页面的网站行为,但我认为还有其他行为。

我知道根据 http://hc.apache.org/httpclient-3.x/performance.html 的文档不鼓励使用 getResponseBodyAsString(),它并不是说页面不会加载,只是说您可能面临内存不足异常的风险。 getResponseBodyAsString() 是否有可能在加载之前返回页面?为什么只有亚马逊才会出现这种情况?

I noticed a strange phenomenon when using the apache httpclient libraries and I want to know why it occurs. I created some sample code to demonstrate.
Consider the following code:

//Example URL
 String url = "http://www.amazon.com/gp/offer-listing/05961580/ref=dp_olp_used?ie=UTF8";
 GetMethod get = new GetMethod(url);
 HttpMethodRetryHandler httpHandler = new DefaultHttpMethodRetryHandler(1, false);
 get.getParams().setParameter(HttpMethodParams.RETRY_HANDLER, httpHandler );
 get.getParams().setCookiePolicy(CookiePolicy.IGNORE_COOKIES);
 HttpConnectionManager connectionManager = new SimpleHttpConnectionManager();
 HttpClient client = new HttpClient( connectionManager );
 client.getParams().setParameter("http.useragent", FIREFOX );
 String line;
 StringBuilder stringBuilder = new StringBuilder();
 String toStreamBody = null;
 String toStringBody = null;
 try {
  int statusCode = client.executeMethod(get);
  if( statusCode != HttpStatus.SC_OK ){
   System.err.println("Internet Status: " + HttpStatus.getStatusText(statusCode) );
   System.err.println("While getting page: " + url );
  }
 //toString
  toStringBody = get.getResponseBodyAsString();
 //toStream
  InputStreamReader isr = new InputStreamReader(get.getResponseBodyAsStream())
  BufferedReader rd = new BufferedReader(isr);
  while ((line = rd.readLine()) != null) {
  stringBuilder.append(line);
  }
 } catch (java.io.IOException ex) {
  System.out.println( "Failed to get page: " + url);
 } finally {
  get.releaseConnection();
 }       
 toStreamBody = stringBuilder.toString();

This code prints nothing:

 System.out.println(toStringBody); // ""

This code prints the web page:

 System.out.println(toStreamBody); // "Whole Page"

But it gets even stranger...
Replace:

get.getResponseBodyAsString();

With:

 get.getResponseBodyAsString(150000);

Now we get the error:
Failed to get page: http://www.amazon.com/gp/offer-listing/0596158068/ref=dp_olp_used?ie=UTF8

I was unable to find another website besides for amazon that replicates this behavior but I assume there are others.

I am aware that according to the documentation at http://hc.apache.org/httpclient-3.x/performance.html discourages the use of getResponseBodyAsString(), it does not say that the page will not load, only that you may be at risk of an out of memory exception. Is it possible that getResponseBodyAsString() is returning the page before it loads? Why does this only happen with amazon?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

撩人痒 2024-10-06 04:03:08

您是否使用其他 URL 进行了测试?

您提供的代码中的 URL 使用 302 重定向到 http://www. amazon.com/dp/05961580/?tag=stackoverfl08-20,然后返回 404(未找到)。

HttpClient 不处理重定向: http://hc.apache.org/httpclient- 3.x/redirects.html

Did you test with any other URL?

The URL in code that you provided redirects with 302 to http://www.amazon.com/dp/05961580/?tag=stackoverfl08-20, which then returns 404 (not found).

HttpClient does not handle redirects: http://hc.apache.org/httpclient-3.x/redirects.html

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文