Java:HttpComponents 从特定 URL 的输入流获取垃圾响应

发布于 2024-12-01 11:24:41 字数 1005 浏览 0 评论 0原文

我目前正在尝试让 HttpComponents 发送 HttpRequest 并检索响应。 在大多数 URL 上,这都没有问题,但是当我尝试获取 phpBB 论坛的 URL 即 http://www 时.forum.animenokami.com 客户端需要更多时间,并且响应实体多次包含段落,导致 html 文件损坏。

例如,元标记包含六次。由于许多其他 URL 都可以工作,我无法弄清楚我做错了什么。 该页面在已知的浏览器中可以正常工作,因此这对他们来说不是问题。

这是我用来发送和接收的代码。

        URI uri1 = new URI("http://www.forum.animenokami.com");
    HttpGet get = new HttpGet(uri1);
    get.setHeader(new BasicHeader("User-Agent", "Mozilla/5.0 (Windows NT 5.1; rv:6.0) Gecko/20100101 Firefox/6.0"));
    HttpClient httpClient = new DefaultHttpClient();
    HttpResponse response = httpClient.execute(get);
    HttpEntity ent = response.getEntity();
    InputStream is = ent.getContent();
    BufferedInputStream bis = new BufferedInputStream(is);
    byte[] tmp = new byte[2048];
    int l;
    String ret = "";
    while ((l = bis.read(tmp)) != -1){
        ret += new String(tmp);
    }

我希望你能帮助我。 如果您需要更多信息,我会尽力尽快提供。

I am currently trying to get HttpComponents to send HttpRequests and retrieve the Response.
On most URLs this works without a problem, but when I try to get the URL of a phpBB Forum namely http://www.forum.animenokami.com the client takes more time and the responseEntity contains passages more than once resulting in a broken html file.

For example the meta tags are contained six times. Since many other URLs work I can't figure out what I am doing wrong.
The Page is working correctly in known Browsers, so it is not a Problem on their side.

Here is the code I use to send and receive.

        URI uri1 = new URI("http://www.forum.animenokami.com");
    HttpGet get = new HttpGet(uri1);
    get.setHeader(new BasicHeader("User-Agent", "Mozilla/5.0 (Windows NT 5.1; rv:6.0) Gecko/20100101 Firefox/6.0"));
    HttpClient httpClient = new DefaultHttpClient();
    HttpResponse response = httpClient.execute(get);
    HttpEntity ent = response.getEntity();
    InputStream is = ent.getContent();
    BufferedInputStream bis = new BufferedInputStream(is);
    byte[] tmp = new byte[2048];
    int l;
    String ret = "";
    while ((l = bis.read(tmp)) != -1){
        ret += new String(tmp);
    }

I hope you can help me.
If you need anymore Information I will try to provide it as soon as possible.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

风透绣罗衣 2024-12-08 11:24:41

这段代码完全被破坏了:

String ret = "";
while ((l = bis.read(tmp)) != -1){
    ret += new String(tmp);
}

三件事:

  • 这是在每次迭代时将整个缓冲区转换为字符串,无论读取了多少数据。 (我怀疑这就是您的情况中实际出现的问题。)
  • 它使用默认的平台编码,这几乎从来都不是一个好主意。
  • 它在循环中使用字符串连接,这会导致性能不佳。

幸运的是,您可以使用 EntityUtils

String text = EntityUtils.toString(ent);

这将使用响应中指定的适当字符编码(如果有),否则使用 ISO-8859-1。 (还有另一个重载,如果未指定,它允许您指定要使用的字符编码。)

值得了解原始代码有什么问题,而不是只是用更好的代码替换它,以便您不要在其他情况下犯同样的错误。

This code is completely broken:

String ret = "";
while ((l = bis.read(tmp)) != -1){
    ret += new String(tmp);
}

Three things:

  • This is converting the whole buffer into a string on each iteration, regardless of how much data has been read. (I suspect this is what's actually going wrong in your case.)
  • It's using the default platform encoding, which is almost never a good idea.
  • It's using string concatenation in a loop, which leads to poor performance.

Fortunately you can avoid all of this very easily using EntityUtils:

String text = EntityUtils.toString(ent);

That will use the appropriate character encoding specified in the response, if any, or ISO-8859-1 otherwise. (There's another overload which allows you to specify which character encoding to use if it's not specified.)

It's worth understanding what's wrong with your original code though rather than just replacing it with the better code, so that you don't make the same mistakes in other situations.

她如夕阳 2024-12-08 11:24:41

它工作正常,但我不明白的是为什么我只在此网址上多次看到相同的文本。

这是因为您的客户端在读取套接字时看到更多不完整的缓冲区。可能是:

  • 因为从远程站点到客户端的路由上存在网络带宽瓶颈,
  • 因为远程站点正在进行一些不必要的刷新,或者
  • 其他一些原因。

关键是您的客户端必须密切关注通过read调用读入缓冲区的字节数,否则最终会插入垃圾。网络流尤其容易无法填充缓冲区。

It works fine but what I don't understand is why I see the same text multiple times only on this URL.

It will be because your client is seeing more incomplete buffers when it reads the socket. Than could be:

  • because there is a network bandwidth bottleneck on the route from the remote site to your client,
  • because the remote site is doing some unnecessary flushes, or
  • some other reason.

The point is that your client must pay close attention to the number of bytes read into the buffer by the read call, otherwise it will end up inserting junk. Network streams in particular are prone not filling the buffer.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文