增加爬虫的线程数

发布于 2024-11-19 16:08:02 字数 1898 浏览 1 评论 0原文

This is the code taken from http://code.google.com/p/crawler4j/ and the name of this file is MyCrawler.java


public class MyCrawler extends WebCrawler {

        Pattern filters = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g"
                + "|png|tiff?|mid|mp2|mp3|mp4"
                + "|wav|avi|mov|mpeg|ram|m4v|pdf"
                + "|rm|smil|wmv|swf|wma|zip|rar|gz))$");

        /*
         * You should implement this function to specify
         * whether the given URL should be visited or not.
         */
        public boolean shouldVisit(WebURL url) {
                String href = url.getURL().toLowerCase();
                if (filters.matcher(href).matches()) {
                        return false;
                }
                if (href.startsWith("http://www.xyz.us.edu/")) {
                        return true;
                }
                return false;
        }

        /*
         * This function is called when a page is fetched
         * and ready to be processed by your program
         */
        public void visit(Page page) {
                int docid = page.getWebURL().getDocid();
                String url = page.getWebURL().getURL();         
                String text = page.getText();
                List<WebURL> links = page.getURLs();            
        }
}

这是调用 MyCrawler 的 Controller.java 代码。

public class Controller {
        public static void main(String[] args) throws Exception {
                CrawlController controller = new CrawlController("/data/crawl/root");
                controller.addSeed("http://www.xyz.us.edu/");
                controller.start(MyCrawler.class, 10);  
        }
}

所以我只想确定这行代码在controller.java 文件中的含义是什么

controller.start(MyCrawler.class, 10);

。10 的含义是什么。如果我们将这个 10 增加到20 那么效果会怎样...任何建议将不胜感激...

原文

This is the code taken from http://code.google.com/p/crawler4j/ and the name of this file is MyCrawler.java


public class MyCrawler extends WebCrawler {

        Pattern filters = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g"
                + "|png|tiff?|mid|mp2|mp3|mp4"
                + "|wav|avi|mov|mpeg|ram|m4v|pdf"
                + "|rm|smil|wmv|swf|wma|zip|rar|gz))$");

        /*
         * You should implement this function to specify
         * whether the given URL should be visited or not.
         */
        public boolean shouldVisit(WebURL url) {
                String href = url.getURL().toLowerCase();
                if (filters.matcher(href).matches()) {
                        return false;
                }
                if (href.startsWith("http://www.xyz.us.edu/")) {
                        return true;
                }
                return false;
        }

        /*
         * This function is called when a page is fetched
         * and ready to be processed by your program
         */
        public void visit(Page page) {
                int docid = page.getWebURL().getDocid();
                String url = page.getWebURL().getURL();         
                String text = page.getText();
                List<WebURL> links = page.getURLs();            
        }
}

And this is the code for Controller.java from where MyCrawler is getting called..

public class Controller {
        public static void main(String[] args) throws Exception {
                CrawlController controller = new CrawlController("/data/crawl/root");
                controller.addSeed("http://www.xyz.us.edu/");
                controller.start(MyCrawler.class, 10);  
        }
}

So I just want to make sure what does this line means in controller.java file