增加爬虫的线程数

发布于 2024-11-19 16:08:02 字数 1898 浏览 1 评论 0原文

This is the code taken from http://code.google.com/p/crawler4j/ and the name of this file is MyCrawler.java


public class MyCrawler extends WebCrawler {

        Pattern filters = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g"
                + "|png|tiff?|mid|mp2|mp3|mp4"
                + "|wav|avi|mov|mpeg|ram|m4v|pdf"
                + "|rm|smil|wmv|swf|wma|zip|rar|gz))$");

        /*
         * You should implement this function to specify
         * whether the given URL should be visited or not.
         */
        public boolean shouldVisit(WebURL url) {
                String href = url.getURL().toLowerCase();
                if (filters.matcher(href).matches()) {
                        return false;
                }
                if (href.startsWith("http://www.xyz.us.edu/")) {
                        return true;
                }
                return false;
        }

        /*
         * This function is called when a page is fetched
         * and ready to be processed by your program
         */
        public void visit(Page page) {
                int docid = page.getWebURL().getDocid();
                String url = page.getWebURL().getURL();         
                String text = page.getText();
                List<WebURL> links = page.getURLs();            
        }
}

这是调用 MyCrawler 的 Controller.java 代码。

public class Controller {
        public static void main(String[] args) throws Exception {
                CrawlController controller = new CrawlController("/data/crawl/root");
                controller.addSeed("http://www.xyz.us.edu/");
                controller.start(MyCrawler.class, 10);  
        }
}

所以我只想确定这行代码在controller.java 文件中的含义是什么

controller.start(MyCrawler.class, 10);

。10 的含义是什么。如果我们将这个 10 增加到20 那么效果会怎样...任何建议将不胜感激...

This is the code taken from http://code.google.com/p/crawler4j/ and the name of this file is MyCrawler.java


public class MyCrawler extends WebCrawler {

        Pattern filters = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g"
                + "|png|tiff?|mid|mp2|mp3|mp4"
                + "|wav|avi|mov|mpeg|ram|m4v|pdf"
                + "|rm|smil|wmv|swf|wma|zip|rar|gz))$");

        /*
         * You should implement this function to specify
         * whether the given URL should be visited or not.
         */
        public boolean shouldVisit(WebURL url) {
                String href = url.getURL().toLowerCase();
                if (filters.matcher(href).matches()) {
                        return false;
                }
                if (href.startsWith("http://www.xyz.us.edu/")) {
                        return true;
                }
                return false;
        }

        /*
         * This function is called when a page is fetched
         * and ready to be processed by your program
         */
        public void visit(Page page) {
                int docid = page.getWebURL().getDocid();
                String url = page.getWebURL().getURL();         
                String text = page.getText();
                List<WebURL> links = page.getURLs();            
        }
}

And this is the code for Controller.java from where MyCrawler is getting called..

public class Controller {
        public static void main(String[] args) throws Exception {
                CrawlController controller = new CrawlController("/data/crawl/root");
                controller.addSeed("http://www.xyz.us.edu/");
                controller.start(MyCrawler.class, 10);  
        }
}

So I just want to make sure what does this line means in controller.java file

controller.start(MyCrawler.class, 10);

here what is the meaning of 10.. And if we Increase this 10 to 20 then what will be the effect... Any suggestions will be appreciated...

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

像你 2024-11-26 16:08:03

网站显示了 CrawlController 的源代码。

从 10 增加到 20 会增加爬网程序的数量(每个爬网程序都在自己的线程中) - 研究该代码将告诉您这会产生什么影响。

This website shows the source for CrawlController.

Incrementing from 10 to 20 increases the number of crawlers (each in their own thread) - studying that code will tell you what affect this will have.

久伴你 2024-11-26 16:08:03

鉴于您在帖子上放置的名称,您似乎已经知道它的作用 - 它设置爬虫线程的数量。至于它会产生什么效果......这很大程度上取决于每个线程等待 I/O 的时间 - 主要是网络,还有一点磁盘,以及您拥有多少 CPU 和磁盘吞吐量。当以下情况之一发生时,就会出现吞吐量峰值:

  • 没有更多的 CPU 时间
  • 没有更多的网络带宽
  • 没有更多的磁盘带宽

对于 CPU,不要期望达到 100% - 最多 80% 左右。

Given the name you put on the post, you appear to already know what this does - it sets the number of crawler threads. As for what effect it will have... that depends largely on how much of the time each thread will be waiting for I/O - mostly network, and a little disk, and on how much CPU and disk throughput you have. Peak throughput will happen when one of these happens:

  • no more CPU time left
  • no more network bandwidth
  • no more disk bandwidth

For CPU, don't expect to get to 100% - figure 80% or so max.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文