如何让crawler4j更快地下载页面上的所有链接?
我所做的是:
- 抓取页面
- 获取页面的所有链接,将它们放入列表中
- 启动一个新的爬虫,它访问列表中的每个链接
- 下载它们
一定有一种更快的方法,当我访问页面时可以直接下载链接吗?谢谢!
What I do is:
- crawl the page
- fetch all links of the page, puts them in a list
- start a new crawler, which visits each links of the list
- download them
There must be a quicker way, where I can download the links directly when I visit the page? Thx!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
crawler4j 会自动为您执行此过程。您首先添加一个或多个种子页面。这些是首先获取和处理的页面。然后crawler4j提取这些页面中的所有链接并将它们传递给您的shouldVisit函数。如果你真的想抓取所有这些函数,这个函数应该在所有函数上返回 true。如果您只想抓取特定域内的页面,您可以检查 URL 并根据该 URL 返回 true 或 false。
那些你的shouldVisit返回true的URL,然后由爬虫线程获取,并对它们执行相同的过程。
示例代码此处 是一个很好的入门示例。
crawler4j automatically does this process for you. You first add one or more seed pages. These are the pages that are first fetched and processed. crawler4j then extracts all the links in these pages and passes them to your shouldVisit function. If you really want to crawl all of them this function should just return true on all functions. If you only want to crawl pages within a specific domain you can check the URL and return true or false based on that.
Those URLs that your shouldVisit returns true, are then fetched by crawler threads and the same process is performed on them.
The example code here is a good sample for starting.
一般方法是将爬行和下载任务分离到单独的工作线程中,线程的最大数量取决于您的内存要求(即您想要用于存储所有这些信息的最大 RAM)。
不过,crawler4j 已经为您提供了此功能。通过将下载和爬行分成单独的线程,您可以尝试最大限度地利用连接,提取连接可以处理的尽可能多的数据以及提供信息的服务器可以发送给您的数据。对此的自然限制是,即使您生成 1,000 个线程,如果服务器仅以每秒 0.3k 的速度向您提供内容,您每秒仍只能下载 300 KB。但恐怕你对这方面没有任何控制权。
提高速度的另一种方法是在具有更宽的互联网管道的系统上运行爬虫程序,因为我猜,您的最大下载速度是当前获取数据速度的限制因素。例如,如果您在 AWS 实例(或任何云应用程序平台)上运行爬网,您将受益于它们与主干网的极高速连接,并缩短爬网网站集合所需的时间有效地扩展您的带宽,远远超出您在家庭或办公室连接中获得的带宽(除非您在 ISP 工作)。
从理论上讲,在管道非常大的情况下,对于保存到本地(或网络)磁盘存储的任何数据,限制可能开始成为磁盘的最大写入速度。
The general approach would be to separate the crawling, and the downloading tasks into separate worker Threads, with a maximum number of Threads, depending on your memory requirements (i.e. maximum RAM you want to use for storing all this info).
However, crawler4j already gives you this functionality. By splitting downloading and crawling into separate Threads, you try to maximize the utilization of your connection, pulling down as much data as both your connection can handle, and as the servers providing the information can send you. The natural limitation to this is that, even if you spawn 1,000 Threads, if the servers are only given you the content at 0.3k per second, that still only 300 KB per second that you'll be downloading. But you just don't have any control over that aspect of it, I'm afraid.
The other way to increase the speed is to run the crawler on a system with a fatter pipe to the internet, since your maximum download speed is, I'm guessing, the limiting factor to how fast you can get data currently. For example, if you were running the crawling on an AWS instance (or any of the cloud application platforms), you would benefit from their extremely high speed connections to backbones, and shorten the amount of time it takes to crawl a collection of websites by effectively expanding your bandwidth far beyond what you're going to get at a home or office connection (unless you work at an ISP, that is).
It's theoretically possible that, in a situation where your pipe is extremely large, the limitation starts to become the maximum write speed of your disk, for any data that you're saving to local (or network) disk storage.