如何在进行过程中随机向crawler4j的爬虫添加URL

发布于 2024-12-10 04:32:43 字数 1562 浏览 0 评论 0原文

我正在处理crawler4j。 http://code.google.com/p/crawler4j/

并简单测试抓取网站成功了。但我想在进程中随机添加 URL。

此代码在第二次构造 CrawlController 时显示以下异常。如何在进度过程中添加 URL？或者重用CrawlController？（在不重新构建 CrawlController 的情况下重用案例也失败了。）

有什么想法吗？或者其他好的Java爬虫？

编辑：因为这可能是一个错误，所以我也发布到crawler4j的页面。 http://code.google.com /p/crawler4j/issues/detail?id=87&thanks=87&ts=1318661893

private static final ConcurrentLinkedQueue<URI> urls = new ConcurrentLinkedQueue<URI>();
...
URI uri = null;
while (true) {
    uri = urls.poll();
    if (uri != null) {
        CrawlController ctrl = null;
        try {
            ctrl = new CrawlController("crawler");
            ctrl.setMaximumCrawlDepth(3);
            ctrl.setMaximumPagesToFetch(100);
        } catch (Exception e) {
            e.printStackTrace();
            return;
        }
        ctrl.addSeed(uri.toString());
        ctrl.start(MyCrawler.class, depth);
    }else{
        try {
            Thread.sleep(3000);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    }
}

java.lang.IllegalThreadStateException
    at java.lang.Thread.start(Thread.java:638)
    at edu.uci.ics.crawler4j.crawler.PageFetcher.startConnectionMonitorThread(PageFetcher.java:124)
    at edu.uci.ics.crawler4j.crawler.CrawlController.<init>(CrawlController.java:77)

原文

I'm tackling to crawler4j.
http://code.google.com/p/crawler4j/

and simple test crawl a site was succeeded.
but I want to add URLs at random times during progress.

this code shows the following exception at second constructing CrawlController.
how can I add URLs during progress? or reuse CrawlController?
(also reuse case without re-constructing CrawlController was failed.)

any idea?
or other good crawler in Java?

edit:
since it might be a bug, I posted also to the page of crawler4j.
http://code.google.com/p/crawler4j/issues/detail?id=87&thanks=87&ts=1318661893

private static final ConcurrentLinkedQueue<URI> urls = new ConcurrentLinkedQueue<URI>();
...
URI uri = null;
while (true) {
    uri = urls.poll();
    if (uri != null) {
        CrawlController ctrl = null;
        try {
            ctrl = new CrawlController("crawler");
            ctrl.setMaximumCrawlDepth(3);
            ctrl.setMaximumPagesToFetch(100);
        } catch (Exception e) {
            e.printStackTrace();
            return;
        }
        ctrl.addSeed(uri.toString());
        ctrl.start(MyCrawler.class, depth);
    }else{
        try {
            Thread.sleep(3000);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    }
}

java.lang.IllegalThreadStateException
    at java.lang.Thread.start(Thread.java:638)
    at edu.uci.ics.crawler4j.crawler.PageFetcher.startConnectionMonitorThread(PageFetcher.java:124)
    at edu.uci.ics.crawler4j.crawler.CrawlController.<init>(CrawlController.java:77)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

<逆流佳人身旁 2024-12-17 04:32:43

从3.0版本开始，这个功能在crawler4j中实现。请访问 http://code.google.com/p/crawler4j/source/browse/src/test/java/edu/uci/ics/crawler4j/examples/multiple/ 有关示例用法。

基本上，您需要以非阻塞模式启动控制器：

controller.startNonBlocking(MyCrawler.class, numberOfThreads);

然后您可以在循环中添加种子。请注意，您不需要循环启动控制器多次。

As of version 3.0, this feature is implemented in crawler4j. Please visit http://code.google.com/p/crawler4j/source/browse/src/test/java/edu/uci/ics/crawler4j/examples/multiple/ for an example usage.

Basically, you need to start the controller in non-blocking mode:

controller.startNonBlocking(MyCrawler.class, numberOfThreads);

Then you can add your seeds in a loop. Note that you don't need to start the controller several times in a loop.

回复收藏 0 原文

~没有更多了~

关于作者

A君

暂无简介

0 文章

0 评论

24 人气

关注发私信

友情链接

文江博客

如何在进行过程中随机向crawler4j的爬虫添加URL

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

胡图图

zt006

z祗昰~

冰葑

野の

天空

友情链接

如何在进行过程中随机向crawler4j的爬虫添加URL

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

胡图图

zt006

z祗昰~

冰葑

野の

天空

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。