尝试获取响应代码时代码挂起
我正在尝试抓取 300,000 个 URL。但是,在尝试从 URL 检索响应代码时,代码在中间某个位置挂起。我不确定出了什么问题,因为正在建立连接,但问题是在那之后发生的。任何建议/指示将不胜感激。另外,有没有什么方法可以在特定时间段内 ping 一个网站,如果没有响应,则继续下一个网站?
我已经按照建议修改了代码,设置了读取超时和请求属性。但是,即使现在代码也无法获取响应代码!
这是我修改后的代码片段:
URL url=null;
try
{
Thread.sleep(8000);
}
catch (InterruptedException e1)
{
e1.printStackTrace();
}
try
{
//urlToBeCrawled comes from the database
url=new URL(urlToBeCrawled);
}
catch (MalformedURLException e)
{
e.printStackTrace();
//The code is in a loop,so the use of continue.I apologize for putting code in the catch block.
continue;
}
HttpURLConnection huc=null;
try
{
huc = (HttpURLConnection)url.openConnection();
}
catch (IOException e)
{
e.printStackTrace();
}
try
{
//Added the request property
huc.addRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)");
huc.setRequestMethod("HEAD");
}
catch (ProtocolException e)
{
e.printStackTrace();
}
huc.setConnectTimeout(1000);
try
{
huc.connect();
}
catch (IOException e)
{
e.printStackTrace();
continue;
}
int responseCode=0;
try
{
//Sets the read timeout
huc.setReadTimeout(15000);
//Code hangs here for some URL which is random in each run
responseCode = huc.getResponseCode();
}
catch (IOException e)
{
huc.disconnect();
e.printStackTrace();
continue;
}
if (responseCode!=200)
{
huc.disconnect();
continue;
}
I am trying to crawl 300,000 URLs. However, somewhere in the middle the code hangs when trying to retrieve the response code from a URL. I am not sure what is going wrong since a connection is being established but the problem is occurring after that. Any suggestions/pointers will be greatly appreciated. Also, is there any way to ping a website for a certain time period and if it's not responding just proceed to the next one?
I have modified the code as per the suggestions having set the read time out and the request property as suggested.However, even now the code is unable to obtain the response code!
Here is my modified code snippet:
URL url=null;
try
{
Thread.sleep(8000);
}
catch (InterruptedException e1)
{
e1.printStackTrace();
}
try
{
//urlToBeCrawled comes from the database
url=new URL(urlToBeCrawled);
}
catch (MalformedURLException e)
{
e.printStackTrace();
//The code is in a loop,so the use of continue.I apologize for putting code in the catch block.
continue;
}
HttpURLConnection huc=null;
try
{
huc = (HttpURLConnection)url.openConnection();
}
catch (IOException e)
{
e.printStackTrace();
}
try
{
//Added the request property
huc.addRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)");
huc.setRequestMethod("HEAD");
}
catch (ProtocolException e)
{
e.printStackTrace();
}
huc.setConnectTimeout(1000);
try
{
huc.connect();
}
catch (IOException e)
{
e.printStackTrace();
continue;
}
int responseCode=0;
try
{
//Sets the read timeout
huc.setReadTimeout(15000);
//Code hangs here for some URL which is random in each run
responseCode = huc.getResponseCode();
}
catch (IOException e)
{
huc.disconnect();
e.printStackTrace();
continue;
}
if (responseCode!=200)
{
huc.disconnect();
continue;
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
服务器保持连接打开但也没有响应。它甚至可能检测到您正在抓取他们的网站,并且防火墙或反 DDOS 工具故意试图迷惑您。确保您设置了用户代理(如果您不这样做,某些服务器会生气)。另外,设置一个读取超时,这样如果一段时间后无法读取,它就会放弃:
A server is holding the connection open but also is not responding. It may even be detecting that you're spidering their site and the firewall or anti-DDOS tools are intentionally trying to confuse you. Be sure you set a user-agent (some servers will get angry if you don't). Also, set a read timeout so that if it fails to read after awhile, it'll give up:
这确实应该使用多线程来完成。 尤其如果您尝试 300,000 个网址。我更喜欢 线程池方法
其次,您确实会从更强大的 HTTP 客户端(例如 apache commons http 客户端)中受益,因为它可以更好地设置用户代理。而大多数 JRE 不允许您使用
HttpURLConnection
类修改用户代理(它们将其强制为您的 JDK 版本,例如:Java/1.6.0_13
将是你的用户代理。)有一些技巧可以通过调整系统属性来改变这一点,但我从未见过它真正起作用。再次选择 Apache Commons HTTP 库,您不会后悔的。最后,您需要一个好的http调试器来最终处理这个问题,您可以使用Fiddler2,并且只需设置一个java代理指向fiddler(滚动到有关Java的部分)。
This really should be done using multi-threading. Especially if you are attempting 300,000 URLs. I prefer the thread-pool approach for this.
Second, you will really benefit better from a more robust HTTP client such as the apache commons http client as it can better set the user-agent. Whereas the most JRE's will not allow you to modify the user-agent using the
HttpURLConnection
class (they force it to your JDK version, eg:Java/1.6.0_13
will be your user-agent.) There are tricks to change this by adjusting the system property but I have never seen that actually work. Again go just go with Apache Commons HTTP library, you won't regret it.Finally you need a good http debugger to deal with this ultimately, You can use Fiddler2, and just setup a java proxy to point to fiddler (scroll to the part about Java).