帮助编写一个算法,用于在 cron 运行时索引/解析有限的数据块
情况是这样的。我正在废弃一个网站,以使用该网站提供的机器人页面(指向该网站上发布的每篇文章的 URL 列表)从其文章中获取数据。到目前为止,我进行了数据库合并,将 URL“更新插入”到我的表中。我知道每次抓取运行都会花费很长时间,因为有超过 1400 篇文章需要解析。我需要编写一个算法,一次只在 cron 上执行一小部分工作,这样它就不会超载我的服务器等。
编辑:我想我应该提到我正在使用 drupal 7。另外,这个必须是一个随着时间的推移而发生的持续脚本,我不太担心数据库初始填充所需的时间。机器人页面是动态的,随着文章的添加,URL 会定期添加到那里。我目前正在使用 hook_cron() 来实现此目的,但如果有比这更好的方法,我愿意接受更好的方法。
Here's the situation. I am scrapping a website to get the data from it's articles using a robots page supplied by that website (list of URLs pointing to every article that's posted on the site). So far, I do a database merge to 'upsert' the URLs into my table. I know that each scrapping run will take a good while cause there's over 1400 articles to parse. I need to write an algorithm that will only do a small chunk of the jobs on cron at a time so it doesn't overload my server, etc.
Edit: I think I should mention that I'm using drupal 7. Also, this has to be an ongoing script that happens over time, I'm not so worried about the time it takes for the initial fill of the database. The robots page is dynamic, URLs get added there periodically as articles are added. I'm using hook_cron() currently for this, but I'm open to better methods if there's something better than that for doing it.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您可以使用 Drupal 队列操作API 将每个页面排入队列以作为队列项报废。您可以(但不是必需)将您的队列声明为 cron 执行的。然后,Drupal 将负责在每个 cron run 未达到队列声明的最大执行时间。
有关示例,请参阅 aggregator_cron项目排队。和
aggregator_cron_queue_info
用于让 Drupal 在其 cron 期间处理这些排队项目的声明。如果正常 Drupal cron 期间的队列处理存在问题,您可以借助 等待队列 或 Beanstalkd 集成。
You can use the Drupal queue operations API to enqueue each page to scrap as queue item. You can, but are not required, declare your queue as cron-executed. Drupal will then take cares of executing as much queue item at each cron run without reaching the queue declared maximum execution time.
See aggregator_cron for an example of item en-queuing. And
aggregator_cron_queue_info
for the declaration that will let Drupal process these queued items during its cron.If queue processing during normal Drupal cron is an issue, you can process your queue independently with the help of modules like Waiting Queue or Beanstalkd integration.
最有可能的是,获取每篇文章的 http 开销将大大超过执行数据库操作的开销。只要不要同时获取太多文章就可以了。大多数网站管理员都不喜欢抓取工具,尤其是当他们进行 10、20、500 多次并行获取时。
Most likely the http overhead of fetching each article will vastly outweigh the overhead of doing the database operations. Just don't fetch too many articles in parallel and you should be fine. Most webmasters frown on scrapers, especially when they're doing 10, 20, 500+ parallel fetches.
所以,您的数据库中已经有了这些网址。该表中有一个状态列 - 是否已刮擦。 cron 可以每隔一段时间就开始抓取尚未从表中删除的下一个 url,并将其标记为已删除。
So, you already have the urls in your database. Have a status column in that table - scraped or not. The cron can kick off every so often grabbing the next url that has not been scraped from the table and marking it as scraped.