如何在进行过程中随机向crawler4j的爬虫添加URL
我正在处理crawler4j。 http://code.google.com/p/crawler4j/
并简单测试抓取网站成功了。 但我想在进程中随机添加 URL。
此代码在第二次构造 CrawlController 时显示以下异常。 如何在进度过程中添加 URL?或者重用CrawlController? (在不重新构建 CrawlController 的情况下重用案例也失败了。)
有什么想法吗? 或者其他好的Java爬虫?
编辑: 因为这可能是一个错误,所以我也发布到crawler4j的页面。 http://code.google.com /p/crawler4j/issues/detail?id=87&thanks=87&ts=1318661893
private static final ConcurrentLinkedQueue<URI> urls = new ConcurrentLinkedQueue<URI>();
...
URI uri = null;
while (true) {
uri = urls.poll();
if (uri != null) {
CrawlController ctrl = null;
try {
ctrl = new CrawlController("crawler");
ctrl.setMaximumCrawlDepth(3);
ctrl.setMaximumPagesToFetch(100);
} catch (Exception e) {
e.printStackTrace();
return;
}
ctrl.addSeed(uri.toString());
ctrl.start(MyCrawler.class, depth);
}else{
try {
Thread.sleep(3000);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
java.lang.IllegalThreadStateException
at java.lang.Thread.start(Thread.java:638)
at edu.uci.ics.crawler4j.crawler.PageFetcher.startConnectionMonitorThread(PageFetcher.java:124)
at edu.uci.ics.crawler4j.crawler.CrawlController.<init>(CrawlController.java:77)
I'm tackling to crawler4j.
http://code.google.com/p/crawler4j/
and simple test crawl a site was succeeded.
but I want to add URLs at random times during progress.
this code shows the following exception at second constructing CrawlController.
how can I add URLs during progress? or reuse CrawlController?
(also reuse case without re-constructing CrawlController was failed.)
any idea?
or other good crawler in Java?
edit:
since it might be a bug, I posted also to the page of crawler4j.
http://code.google.com/p/crawler4j/issues/detail?id=87&thanks=87&ts=1318661893
private static final ConcurrentLinkedQueue<URI> urls = new ConcurrentLinkedQueue<URI>();
...
URI uri = null;
while (true) {
uri = urls.poll();
if (uri != null) {
CrawlController ctrl = null;
try {
ctrl = new CrawlController("crawler");
ctrl.setMaximumCrawlDepth(3);
ctrl.setMaximumPagesToFetch(100);
} catch (Exception e) {
e.printStackTrace();
return;
}
ctrl.addSeed(uri.toString());
ctrl.start(MyCrawler.class, depth);
}else{
try {
Thread.sleep(3000);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
java.lang.IllegalThreadStateException
at java.lang.Thread.start(Thread.java:638)
at edu.uci.ics.crawler4j.crawler.PageFetcher.startConnectionMonitorThread(PageFetcher.java:124)
at edu.uci.ics.crawler4j.crawler.CrawlController.<init>(CrawlController.java:77)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
从3.0版本开始,这个功能在crawler4j中实现。请访问 http://code.google.com/p/crawler4j/source/browse/src/test/java/edu/uci/ics/crawler4j/examples/multiple/ 有关示例用法。
基本上,您需要以非阻塞模式启动控制器:
然后您可以在循环中添加种子。请注意,您不需要循环启动控制器多次。
As of version 3.0, this feature is implemented in crawler4j. Please visit http://code.google.com/p/crawler4j/source/browse/src/test/java/edu/uci/ics/crawler4j/examples/multiple/ for an example usage.
Basically, you need to start the controller in non-blocking mode:
Then you can add your seeds in a loop. Note that you don't need to start the controller several times in a loop.