使用 cron 编写 PHP 网络爬虫
我使用 simplehtmldom 为自己编写了一个网络爬虫,并且爬行过程运行得很好。它抓取起始页面,将所有链接添加到数据库表中,设置会话指针,元刷新页面以进入下一页。这种情况一直持续下去,直到链接用完为止。
这工作得很好,但显然较大网站的抓取时间相当乏味。不过,我希望能够加快速度,并可能使其成为一项计划任务。
除了设置更高的内存限制/执行时间之外,还有什么想法可以使其尽可能快速和高效?
I have written myself a web crawler using simplehtmldom, and have got the crawl process working quite nicely. It crawls the start page, adds all links into a database table, sets a session pointer, and meta refreshes the page to carry onto the next page. That keeps going until it runs out of links
That works fine however obviously the crawl time for larger websites is pretty tedious. I wanted to be able to speed things up a bit though, and possibly make it a cron job.
Any ideas on making it as quick and efficient as possible other than setting the memory limit / execution time higher?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
看起来您正在网络浏览器中运行脚本。您可以考虑从命令行运行它。您可以执行多个脚本同时对不同页面进行爬取。这应该会加快速度。
Looks like you're running your script in a web browser. You may consider running it from the command line. You can execute multiple scripts to crawl on different pages at the same time. That should speed things up.
内存对于爬虫来说一定不是问题。
完成一页并将所有相关数据写入数据库后,您应该删除为此作业创建的所有变量。
100 页后的内存使用量必须与 1 页后的内存使用量相同。如果情况并非如此,请找出原因。
您可以在不同的进程之间分配工作:通常解析页面并不需要加载它那么长时间,因此您可以将找到的所有链接写入数据库,并让多个其他进程将文档下载到临时目录。
如果您这样做,您必须确保
Memory must not be an problem for a crawler.
Once you are done with one page and have written all relevant data to the database you should get rid of all variables you created for this job.
The memory usage after 100 pages must be the same as after 1 page. If this is not the case find out why.
You can split up the work between different processes: Usually parsing a page does not take as long as loading it,so you can write all links that you find to a database and have multiple other processes that just download the documents to a temp directory.
If you do this you must ensure that