Java：HttpComponents 从特定 URL 的输入流获取垃圾响应

发布于 2024-12-01 11:24:41 字数 1005 浏览 8 评论 0原文

我目前正在尝试让 HttpComponents 发送 HttpRequest 并检索响应。在大多数 URL 上，这都没有问题，但是当我尝试获取 phpBB 论坛的 URL 即 http://www 时.forum.animenokami.com 客户端需要更多时间，并且响应实体多次包含段落，导致 html 文件损坏。

例如，元标记包含六次。由于许多其他 URL 都可以工作，我无法弄清楚我做错了什么。该页面在已知的浏览器中可以正常工作，因此这对他们来说不是问题。

这是我用来发送和接收的代码。

        URI uri1 = new URI("http://www.forum.animenokami.com");
    HttpGet get = new HttpGet(uri1);
    get.setHeader(new BasicHeader("User-Agent", "Mozilla/5.0 (Windows NT 5.1; rv:6.0) Gecko/20100101 Firefox/6.0"));
    HttpClient httpClient = new DefaultHttpClient();
    HttpResponse response = httpClient.execute(get);
    HttpEntity ent = response.getEntity();
    InputStream is = ent.getContent();
    BufferedInputStream bis = new BufferedInputStream(is);
    byte[] tmp = new byte[2048];
    int l;
    String ret = "";
    while ((l = bis.read(tmp)) != -1){
        ret += new String(tmp);
    }

我希望你能帮助我。如果您需要更多信息，我会尽力尽快提供。

原文

I am currently trying to get HttpComponents to send HttpRequests and retrieve the Response.
On most URLs this works without a problem, but when I try to get the URL of a phpBB Forum namely http://www.forum.animenokami.com the client takes more time and the responseEntity contains passages more than once resulting in a broken html file.

For example the meta tags are contained six times. Since many other URLs work I can't figure out what I am doing wrong.
The Page is working correctly in known Browsers, so it is not a Problem on their side.

Here is the code I use to send and receive.

        URI uri1 = new URI("http://www.forum.animenokami.com");
    HttpGet get = new HttpGet(uri1);
    get.setHeader(new BasicHeader("User-Agent", "Mozilla/5.0 (Windows NT 5.1; rv:6.0) Gecko/20100101 Firefox/6.0"));
    HttpClient httpClient = new DefaultHttpClient();
    HttpResponse response = httpClient.execute(get);
    HttpEntity ent = response.getEntity();
    InputStream is = ent.getContent();
    BufferedInputStream bis = new BufferedInputStream(is);
    byte[] tmp = new byte[2048];
    int l;
    String ret = "";
    while ((l = bis.read(tmp)) != -1){
        ret += new String(tmp);
    }

I hope you can help me.
If you need anymore Information I will try to provide it as soon as possible.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

风透绣罗衣 2024-12-08 11:24:41

这段代码完全被破坏了：

String ret = "";
while ((l = bis.read(tmp)) != -1){
    ret += new String(tmp);
}

三件事：

这是在每次迭代时将整个缓冲区转换为字符串，无论读取了多少数据。（我怀疑这就是您的情况中实际出现的问题。）
它使用默认的平台编码，这几乎从来都不是一个好主意。
它在循环中使用字符串连接，这会导致性能不佳。

幸运的是，您可以使用 EntityUtils：

String text = EntityUtils.toString(ent);

这将使用响应中指定的适当字符编码（如果有），否则使用 ISO-8859-1。（还有另一个重载，如果未指定，它允许您指定要使用的字符编码。）

值得了解原始代码有什么问题，而不是只是用更好的代码替换它，以便您不要在其他情况下犯同样的错误。

This code is completely broken:

String ret = "";
while ((l = bis.read(tmp)) != -1){
    ret += new String(tmp);
}

Three things:

This is converting the whole buffer into a string on each iteration, regardless of how much data has been read. (I suspect this is what's actually going wrong in your case.)
It's using the default platform encoding, which is almost never a good idea.
It's using string concatenation in a loop, which leads to poor performance.

Fortunately you can avoid all of this very easily using EntityUtils:

String text = EntityUtils.toString(ent);

That will use the appropriate character encoding specified in the response, if any, or ISO-8859-1 otherwise. (There's another overload which allows you to specify which character encoding to use if it's not specified.)

It's worth understanding what's wrong with your original code though rather than just replacing it with the better code, so that you don't make the same mistakes in other situations.

回复收藏 0 原文