Java:HttpComponents 从特定 URL 的输入流获取垃圾响应
我目前正在尝试让 HttpComponents 发送 HttpRequest 并检索响应。 在大多数 URL 上,这都没有问题,但是当我尝试获取 phpBB 论坛的 URL 即 http://www 时.forum.animenokami.com 客户端需要更多时间,并且响应实体多次包含段落,导致 html 文件损坏。
例如,元标记包含六次。由于许多其他 URL 都可以工作,我无法弄清楚我做错了什么。 该页面在已知的浏览器中可以正常工作,因此这对他们来说不是问题。
这是我用来发送和接收的代码。
URI uri1 = new URI("http://www.forum.animenokami.com");
HttpGet get = new HttpGet(uri1);
get.setHeader(new BasicHeader("User-Agent", "Mozilla/5.0 (Windows NT 5.1; rv:6.0) Gecko/20100101 Firefox/6.0"));
HttpClient httpClient = new DefaultHttpClient();
HttpResponse response = httpClient.execute(get);
HttpEntity ent = response.getEntity();
InputStream is = ent.getContent();
BufferedInputStream bis = new BufferedInputStream(is);
byte[] tmp = new byte[2048];
int l;
String ret = "";
while ((l = bis.read(tmp)) != -1){
ret += new String(tmp);
}
我希望你能帮助我。 如果您需要更多信息,我会尽力尽快提供。
I am currently trying to get HttpComponents to send HttpRequests and retrieve the Response.
On most URLs this works without a problem, but when I try to get the URL of a phpBB Forum namely http://www.forum.animenokami.com the client takes more time and the responseEntity contains passages more than once resulting in a broken html file.
For example the meta tags are contained six times. Since many other URLs work I can't figure out what I am doing wrong.
The Page is working correctly in known Browsers, so it is not a Problem on their side.
Here is the code I use to send and receive.
URI uri1 = new URI("http://www.forum.animenokami.com");
HttpGet get = new HttpGet(uri1);
get.setHeader(new BasicHeader("User-Agent", "Mozilla/5.0 (Windows NT 5.1; rv:6.0) Gecko/20100101 Firefox/6.0"));
HttpClient httpClient = new DefaultHttpClient();
HttpResponse response = httpClient.execute(get);
HttpEntity ent = response.getEntity();
InputStream is = ent.getContent();
BufferedInputStream bis = new BufferedInputStream(is);
byte[] tmp = new byte[2048];
int l;
String ret = "";
while ((l = bis.read(tmp)) != -1){
ret += new String(tmp);
}
I hope you can help me.
If you need anymore Information I will try to provide it as soon as possible.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这段代码完全被破坏了:
三件事:
幸运的是,您可以使用
EntityUtils
:这将使用响应中指定的适当字符编码(如果有),否则使用 ISO-8859-1。 (还有另一个重载,如果未指定,它允许您指定要使用的字符编码。)
值得了解原始代码有什么问题,而不是只是用更好的代码替换它,以便您不要在其他情况下犯同样的错误。
This code is completely broken:
Three things:
Fortunately you can avoid all of this very easily using
EntityUtils
:That will use the appropriate character encoding specified in the response, if any, or ISO-8859-1 otherwise. (There's another overload which allows you to specify which character encoding to use if it's not specified.)
It's worth understanding what's wrong with your original code though rather than just replacing it with the better code, so that you don't make the same mistakes in other situations.
这是因为您的客户端在读取套接字时看到更多不完整的缓冲区。可能是:
关键是您的客户端必须密切关注通过
read
调用读入缓冲区的字节数,否则最终会插入垃圾。网络流尤其容易无法填充缓冲区。It will be because your client is seeing more incomplete buffers when it reads the socket. Than could be:
The point is that your client must pay close attention to the number of bytes read into the buffer by the
read
call, otherwise it will end up inserting junk. Network streams in particular are prone not filling the buffer.