爬网程序将大量 ESTABLISHED TCP 套接字留给某些服务器
我有一个 Java 网络爬虫。我注意到,对于我爬行的少量服务器,我留下了大量已建立的套接字:
joel@bohr:~/tmp/test$ lsof -p 6760 | grep TCP
java 6760 joel 105u IPv6 96546 0t0 TCP bohr:55602->174.143.223.193:www (ESTABLISHED)
java 6760 joel 109u IPv6 96574 0t0 TCP bohr:55623->174.143.223.193:www (ESTABLISHED)
java 6760 joel 110u IPv6 96622 0t0 TCP bohr:55644->174.143.223.193:www (ESTABLISHED)
java 6760 joel 111u IPv6 96674 0t0 TCP bohr:55665->174.143.223.193:www (ESTABLISHED)
任何一台服务器和服务器都可能有数十个这样的套接字。我不明白为什么它们保持开放状态。
我正在使用 HttpURLConnection 建立连接并读取数据。 HTTP 1.1 和 keep-alive
处于启用状态(默认情况下)。据我了解,只要我关闭输入/错误流,并且从流中读取所有数据,Java 的 HttpURLConnection 就会重新使用到远程服务器的底层 tcp 套接字。我的理解是,如果抛出异常,那么只要输入/错误流被关闭(如果不为空),那么套接字虽然不会再次重新使用,但也会被关闭。 (http-keepalive 的 Java 处理 )
我的缩写代码如下所示:
InputStream is = null;
try {
HttpURLConnection conn = (HttpURLConnection) uri.toURL().openConnection();
conn.setReadTimeout(10000);
conn.setConnectTimeout(10000);
conn.setRequestProperty("User-Agent", userAgent);
conn.setRequestProperty("Accept", "text/html,text/xml,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
conn.setRequestProperty("Accept-Encoding", "gzip deflate");
conn.setRequestProperty("Accept-Language", "en-gb,en;q=0.5");
conn.connect();
try {
int responseCode = conn.getResponseCode();
is = conn.getInputStream();
} catch (IOException e) {
is = conn.getErrorStream();
if (is != null){
// consume the error stream, http://download.oracle.com/javase/6/docs/technotes/guides/net/http-keepalive.html
StreamUtils.readStreamToBytes(is, -1 , MAX_LN);
}
throw e;
}
String type = conn.getContentType();
byte[] response = StreamUtils.readStream(is);
// do something with content
} catch (Exception e) {
conn.disconnect(); // don't try to re-use socket - just be done with it.
throw e;
} finally {
if (is != null) {
is.close();
}
}
我注意到,对于发生这种情况的网站,在发出 GET 请求时会抛出很多 IOException,原因是:
java.net.ProtocolException: Server redirected too many times (20)
我很确定我正在处理此问题,正确关闭套接字。难道真的是这个,还是我做错了什么?这可能是误用 keep-alive 的结果 - 如果是的话如何修复?我宁愿不必关闭保持活动来解决问题。
编辑:我已经测试了设置以下属性:
conn.setRequestProperty("Connection", "close"); // supposed to disable keep-alive
发送 Connection: close
标头会禁用持久 tcp 连接,并且所有套接字最终都会被清理。因此,看来我看到的问题确实与“keep-alive”和套接字未正确关闭有关,即使在关闭输入流之后也是如此。
EDIT2 - 是否每次请求重定向时都会创建一个套接字?如果此问题很明显,则在引发上述异常之前,请求会被重定向 20 次。如果是这种情况,是否有办法限制 URLConnection 上的重定向数量?
I've got a Java web crawler. I've noticed that for a small number of servers I crawl I am left with a large number of ESTABLISHED sockets:
joel@bohr:~/tmp/test$ lsof -p 6760 | grep TCP
java 6760 joel 105u IPv6 96546 0t0 TCP bohr:55602->174.143.223.193:www (ESTABLISHED)
java 6760 joel 109u IPv6 96574 0t0 TCP bohr:55623->174.143.223.193:www (ESTABLISHED)
java 6760 joel 110u IPv6 96622 0t0 TCP bohr:55644->174.143.223.193:www (ESTABLISHED)
java 6760 joel 111u IPv6 96674 0t0 TCP bohr:55665->174.143.223.193:www (ESTABLISHED)
There could be many tens of these to any one server & I cann't figure out why they are being left open.
I'm using HttpURLConnection
to establish a connection and read data. HTTP 1.1 and keep-alive
is on (by default). It's my understanding that the underlying tcp socket to a remote server will be re-used by Java's HttpURLConnection
, so long as I close the input/error stream, and all data is read from the stream. It's also my understanding that if an exception is thrown, then so long as the input/error stream is closed (if not null) then the socket, although not re-used again, will be closed. (java handling of http-keepalive)
My abbreviated code looks like this:
InputStream is = null;
try {
HttpURLConnection conn = (HttpURLConnection) uri.toURL().openConnection();
conn.setReadTimeout(10000);
conn.setConnectTimeout(10000);
conn.setRequestProperty("User-Agent", userAgent);
conn.setRequestProperty("Accept", "text/html,text/xml,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
conn.setRequestProperty("Accept-Encoding", "gzip deflate");
conn.setRequestProperty("Accept-Language", "en-gb,en;q=0.5");
conn.connect();
try {
int responseCode = conn.getResponseCode();
is = conn.getInputStream();
} catch (IOException e) {
is = conn.getErrorStream();
if (is != null){
// consume the error stream, http://download.oracle.com/javase/6/docs/technotes/guides/net/http-keepalive.html
StreamUtils.readStreamToBytes(is, -1 , MAX_LN);
}
throw e;
}
String type = conn.getContentType();
byte[] response = StreamUtils.readStream(is);
// do something with content
} catch (Exception e) {
conn.disconnect(); // don't try to re-use socket - just be done with it.
throw e;
} finally {
if (is != null) {
is.close();
}
}
I've noticed that for a site where this is happening I get a lot of IOExceptions thrown when making GET requests, due to:
java.net.ProtocolException: Server redirected too many times (20)
I'm pretty sure I'm handling this, closing the socket properly. Could it really be this, or something else I'm doing wrong? Could it be a result of mis-using keep-alive - and if so how to fix it? I'd rather not have to turn keep-alive off to fix the problem.
EDIT: I've tested setting the following property:
conn.setRequestProperty("Connection", "close"); // supposed to disable keep-alive
Sending the Connection: close
header disabled persistent tcp connections and all sockets are eventually cleaned up. So, it would seem that the problem I am seeing is indeed to do with keep-alive
and sockets not being closed correctly, even after closing the input stream.
EDIT2 - could it be that one socket is created everytime the request is redirected? Where this problem is noticeable the request is being redirected 20 times before the exception above is thrown. If this were the case is there a way of limiting the number of redirects on a URLConnection?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您需要将
conn.disconnect()
移至finally
部分。因为只有在抛出异常时才断开连接。You need to move
conn.disconnect()
into yourfinally
section. As it is you only disconnect if there's an exception thrown.