使用 cron 编写 PHP 网络爬虫

发布于 2024-10-11 14:49:19 字数 218 浏览 3 评论 0原文

我使用 simplehtmldom 为自己编写了一个网络爬虫,并且爬行过程运行得很好。它抓取起始页面,将所有链接添加到数据库表中,设置会话指针,元刷新页面以进入下一页。这种情况一直持续下去,直到链接用完为止。

这工作得很好,但显然较大网站的抓取时间相当乏味。不过,我希望能够加快速度,并可能使其成为一项计划任务。

除了设置更高的内存限制/执行时间之外,还有什么想法可以使其尽可能快速和高效?

I have written myself a web crawler using simplehtmldom, and have got the crawl process working quite nicely. It crawls the start page, adds all links into a database table, sets a session pointer, and meta refreshes the page to carry onto the next page. That keeps going until it runs out of links

That works fine however obviously the crawl time for larger websites is pretty tedious. I wanted to be able to speed things up a bit though, and possibly make it a cron job.

Any ideas on making it as quick and efficient as possible other than setting the memory limit / execution time higher?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

堇年纸鸢 2024-10-18 14:49:19

看起来您正在网络浏览器中运行脚本。您可以考虑从命令行运行它。您可以执行多个脚本同时对不同页面进行爬取。这应该会加快速度。

Looks like you're running your script in a web browser. You may consider running it from the command line. You can execute multiple scripts to crawl on different pages at the same time. That should speed things up.

够运 2024-10-18 14:49:19

内存对于爬虫来说一定不是问题。

完成一页并将所有相关数据写入数据库后,您应该删除为此作业创建的所有变量。

100 页后的内存使用量必须与 1 页后的内存使用量相同。如果情况并非如此,请找出原因。

您可以在不同的进程之间分配工作:通常解析页面并不需要加载它那么长时间,因此您可以将找到的所有链接写入数据库,并让多个其他进程将文档下载到临时目录。
如果您这样做,您必须确保

  1. 工作人员不会下载任何链接。
  2. 如果没有,您的进程将等待新链接。
  3. 每次扫描后都会删除临时文件。
  4. 当链接用完时,下载过程就会停止。您可以通过设置“kill flag”来存档它,这可以是具有特殊名称的文件或数据库中的条目。

Memory must not be an problem for a crawler.

Once you are done with one page and have written all relevant data to the database you should get rid of all variables you created for this job.

The memory usage after 100 pages must be the same as after 1 page. If this is not the case find out why.

You can split up the work between different processes: Usually parsing a page does not take as long as loading it,so you can write all links that you find to a database and have multiple other processes that just download the documents to a temp directory.
If you do this you must ensure that

  1. no link is downloaded by to workers.
  2. your processes wait for new links if there are none.
  3. temp files are removed after each scan.
  4. the download process stop when you run out of links. You can archive this by setting a "kill flag" this can be a file with a special name or an entry in the database.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文