爬网程序将大量 ESTABLISHED TCP 套接字留给某些服务器

发布于 2024-10-13 12:24:29 字数 2855 浏览 2 评论 0原文

我有一个 Java 网络爬虫。我注意到,对于我爬行的少量服务器,我留下了大量已建立的套接字:

joel@bohr:~/tmp/test$ lsof -p 6760 | grep TCP 
java    6760 joel  105u  IPv6      96546      0t0      TCP bohr:55602->174.143.223.193:www (ESTABLISHED)
java    6760 joel  109u  IPv6      96574      0t0      TCP bohr:55623->174.143.223.193:www (ESTABLISHED)
java    6760 joel  110u  IPv6      96622      0t0      TCP bohr:55644->174.143.223.193:www (ESTABLISHED)
java    6760 joel  111u  IPv6      96674      0t0      TCP bohr:55665->174.143.223.193:www (ESTABLISHED)

任何一台服务器和服务器都可能有数十个这样的套接字。我不明白为什么它们保持开放状态。

我正在使用 HttpURLConnection 建立连接并读取数据。 HTTP 1.1 和 keep-alive 处于启用状态(默认情况下)。据我了解,只要我关闭输入/错误流,并且从流中读取所有数据,Java 的 HttpURLConnection 就会重新使用到远程服务器的底层 tcp 套接字。我的理解是,如果抛出异常,那么只要输入/错误流被关闭(如果不为空),那么套接字虽然不会再次重新使用,但也会被关闭。 (http-keepalive 的 Java 处理

我的缩写代码如下所示:

  InputStream is = null;
  try { 
   HttpURLConnection conn = (HttpURLConnection) uri.toURL().openConnection();
   conn.setReadTimeout(10000);
   conn.setConnectTimeout(10000);
   conn.setRequestProperty("User-Agent", userAgent);
   conn.setRequestProperty("Accept", "text/html,text/xml,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
   conn.setRequestProperty("Accept-Encoding", "gzip deflate");
   conn.setRequestProperty("Accept-Language", "en-gb,en;q=0.5");
   conn.connect();

   try {
    int responseCode = conn.getResponseCode();
    is = conn.getInputStream();   

   } catch (IOException e) {     
    is = conn.getErrorStream();
    if (is != null){ 
     // consume the error stream, http://download.oracle.com/javase/6/docs/technotes/guides/net/http-keepalive.html 
     StreamUtils.readStreamToBytes(is, -1 , MAX_LN); 
    }
    throw e;
   }

   String type = conn.getContentType();

   byte[] response = StreamUtils.readStream(is);
    // do something with content


  }  catch (Exception e) {
        conn.disconnect(); // don't try to re-use socket - just be done with it.
    throw e;

} finally {
   if (is != null) {
    is.close();
   }
  }

我注意到,对于发生这种情况的网站,在发出 GET 请求时会抛出很多 IOException,原因是:

java.net.ProtocolException: Server redirected too many  times (20)

我很确定我正在处理此问题,正确关闭套接字。难道真的是这个,还是我做错了什么?这可能是误用 keep-alive 的结果 - 如果是的话如何修复?我宁愿不必关闭保持活动来解决问题。

编辑:我已经测试了设置以下属性:

        conn.setRequestProperty("Connection", "close"); // supposed to disable keep-alive

发送 Connection: close 标头会禁用持久 tcp 连接,并且所有套接字最终都会被清理。因此,看来我看到的问题确实与“keep-alive”和套接字未正确关闭有关,即使在关闭输入流之后也是如此。

EDIT2 - 是否每次请求重定向时都会创建一个套接字?如果此问题很明显,则在引发上述异常之前,请求会被重定向 20 次。如果是这种情况,是否有办法限制 URLConnection 上的重定向数量?

I've got a Java web crawler. I've noticed that for a small number of servers I crawl I am left with a large number of ESTABLISHED sockets:

joel@bohr:~/tmp/test$ lsof -p 6760 | grep TCP 
java    6760 joel  105u  IPv6      96546      0t0      TCP bohr:55602->174.143.223.193:www (ESTABLISHED)
java    6760 joel  109u  IPv6      96574      0t0      TCP bohr:55623->174.143.223.193:www (ESTABLISHED)
java    6760 joel  110u  IPv6      96622      0t0      TCP bohr:55644->174.143.223.193:www (ESTABLISHED)
java    6760 joel  111u  IPv6      96674      0t0      TCP bohr:55665->174.143.223.193:www (ESTABLISHED)

There could be many tens of these to any one server & I cann't figure out why they are being left open.

I'm using HttpURLConnection to establish a connection and read data. HTTP 1.1 and keep-alive is on (by default). It's my understanding that the underlying tcp socket to a remote server will be re-used by Java's HttpURLConnection, so long as I close the input/error stream, and all data is read from the stream. It's also my understanding that if an exception is thrown, then so long as the input/error stream is closed (if not null) then the socket, although not re-used again, will be closed. (java handling of http-keepalive)

My abbreviated code looks like this:

  InputStream is = null;
  try { 
   HttpURLConnection conn = (HttpURLConnection) uri.toURL().openConnection();
   conn.setReadTimeout(10000);
   conn.setConnectTimeout(10000);
   conn.setRequestProperty("User-Agent", userAgent);
   conn.setRequestProperty("Accept", "text/html,text/xml,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
   conn.setRequestProperty("Accept-Encoding", "gzip deflate");
   conn.setRequestProperty("Accept-Language", "en-gb,en;q=0.5");
   conn.connect();

   try {
    int responseCode = conn.getResponseCode();
    is = conn.getInputStream();   

   } catch (IOException e) {     
    is = conn.getErrorStream();
    if (is != null){ 
     // consume the error stream, http://download.oracle.com/javase/6/docs/technotes/guides/net/http-keepalive.html 
     StreamUtils.readStreamToBytes(is, -1 , MAX_LN); 
    }
    throw e;
   }

   String type = conn.getContentType();

   byte[] response = StreamUtils.readStream(is);
    // do something with content


  }  catch (Exception e) {
        conn.disconnect(); // don't try to re-use socket - just be done with it.
    throw e;

} finally {
   if (is != null) {
    is.close();
   }
  }

I've noticed that for a site where this is happening I get a lot of IOExceptions thrown when making GET requests, due to:

java.net.ProtocolException: Server redirected too many  times (20)

I'm pretty sure I'm handling this, closing the socket properly. Could it really be this, or something else I'm doing wrong? Could it be a result of mis-using keep-alive - and if so how to fix it? I'd rather not have to turn keep-alive off to fix the problem.

EDIT: I've tested setting the following property:

        conn.setRequestProperty("Connection", "close"); // supposed to disable keep-alive

Sending the Connection: close header disabled persistent tcp connections and all sockets are eventually cleaned up. So, it would seem that the problem I am seeing is indeed to do with keep-alive and sockets not being closed correctly, even after closing the input stream.

EDIT2 - could it be that one socket is created everytime the request is redirected? Where this problem is noticeable the request is being redirected 20 times before the exception above is thrown. If this were the case is there a way of limiting the number of redirects on a URLConnection?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

怎樣才叫好 2024-10-20 12:24:29

您需要将 conn.disconnect() 移至 finally 部分。因为只有在抛出异常时才断开连接。

You need to move conn.disconnect() into your finally section. As it is you only disconnect if there's an exception thrown.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文