php cron 作业可以运行多长时间/我做得对吗?
我创建了一个 php/mysql scraper,运行良好,但不知道如何最有效地将其作为 cron 作业运行。
有 300 个网站,每个网站有 20 - 200 个页面被抓取。抓取所有站点需要 4 - 7 小时(取决于网络延迟和其他因素)。刮刀需要每天完整运行一次。
我应该将其作为 1 个 cron 作业运行,运行整个 4 - 7 小时,还是每小时运行 7 次,或者每 10 分钟运行一次直到完成?
该脚本设置为从 cron 运行,如下所示:
while($starttime+600 > time()){
do_scrape();
}
它将运行 do_scrape() 函数,该函数一次抓取 10 个 url,直到(在本例中)600 秒过去。 do_scrape 可能需要 5 - 60 秒才能运行。
我在这里问是因为我在网上找不到任何关于如何运行它的信息,并且我对每天运行它持谨慎态度,因为 php 并不是真正设计为作为单个脚本运行 7 小时的。
我用 vanilla PHP/mysql 编写了它,它运行在只安装了lighttpd/mysql/php5 的精简版 debian VPS 上。我已经以 6000 秒(100 分钟)的超时运行它,没有任何问题(服务器没有崩溃)。
任何有关如何完成这项任务的建议都将受到赞赏。我应该注意什么等等?或者我执行这一切都是错误的?
谢谢!
I have created a php/mysql scraper, which is running fine, and have no idea how to most-efficiently run it as a cron job.
There are 300 sites, each with between 20 - 200 pages being scraped. It takes between 4 - 7 hours to scrape all the sites (depending on network latency and other factors). The scraper needs to do a complete run once daily.
Should I run this as 1 cron job which runs for the entire 4 - 7 hours, or run it every hour 7 times, or run it every 10 minutes until complete?
The script is set up to run from the cron like this:
while($starttime+600 > time()){
do_scrape();
}
Which will run the do_scrape() function, which scrapes 10 urls at a time, until (in this case) 600 seconds has passed. The do_scrape can take between 5 - 60 seconds to run.
I am asking here as I cant find any information on the web about how to run this, and am kind of wary about getting this running daily, as php isnt really designed to be run as a single script for 7 hours.
I wrote it in vanilla PHP/mysql, and it is running on cut down debian VPS with only lighttpd/mysql/php5 installed. I have run it with a timeout of 6000 seconds (100 minutes) without any issue (the server didnt fall over).
Any advice on how to go about this task is appreciated. What should I be watching out for etc..? or am i going about executing this all wrong?
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
长期运行编写良好的 PHP 脚本并没有什么问题。我有一些脚本实际上已经连续运行了几个月。只要观察你的内存使用情况,就应该没问题。
也就是说,您的架构非常基础,并且不太可能很好地扩展。
您可能会考虑从大型整体脚本转向分而治之的策略。例如,听起来您的脚本正在为每个 URL 发出同步请求,这些都是抓取的内容。如果这是真的,那么这 7 小时的运行时间大部分都花在等待某个远程服务器的响应上。
在理想的情况下,您不会编写这种 PHP 代码。一些处理线程并且可以通过回调轻松执行异步 http 请求的语言会更适合。
也就是说,如果我在 PHP 中执行此操作,我的目标是创建一个脚本,该脚本可以启动 N 个从 URL 获取数据的子进程,并将响应数据粘贴到某种工作队列中,然后是另一个脚本一直运行,处理在队列中找到的任何工作。
然后,您只需让 fetcher-script-manager 每小时运行一次,它会管理一些获取数据的工作进程(并行,因此延迟不会杀死您),并将工作保留在队列中。然后队列处理程序会看到队列上的工作并对其进行处理。
根据您实现队列的方式,这可以很好地扩展。你可以有多个盒子来获取远程数据,并将其粘贴到某个中央队列盒子上(使用 mysql、memcache 或其他实现的队列)。您甚至可以想象有多个盒子从队列中取出工作并完成工作。
当然,细节决定成败,但这种设计通常比单线程获取处理重复脚本更具可扩展性并且更健壮。
There's nothing wrong with running a well-written PHP script for long periods. I have some scripts that have literally been running continuously for months. Just watch your memory usage, and you should be fine.
That said, your architecture is pretty basic, and is unlikely scale very well.
You might consider moving from a big monolithic script to a divide-and-conquer strategy. For instance, it sounds like your script is making synchronous requests for every URL is scrapes. If that's true, then most of that 7 hour run time is spent idly waiting for a response from some remote server.
In an ideal world, you wouldn't write this kind of thing PHP. Some language that handles threads and can easily do asynchronous http requests with callback would be much better suited.
That said, if I were doing this in PHP, I'd be aiming at having a script that kicks of N children who grab data from URLs, and stick the response data in some kind of work queue, and then another script that pretty much runs all the time, processing any work it finds in the queue.
Then you just cron your fetcher-script-manager to run once an hour, it manages some worker processes that fetch the data (in parellel, so latency doesn't kill you), and stick the work on the queue. Then the queue-cruncher sees the work on the queue and crunches it.
Depending on how you implement the queue, this could scale pretty well. You could have multiple boxes fetching remote data, and sticking it on some central queue box (with a queue implemented in mysql, or memcache, or whatever). You could even conceivably have multiple boxes taking work from the queue and doing the work.
Of course, the devil is in the details, but this design is generally more scalable and usually more robust than a single-threaded fetch-process-repeat script.
每天运行一次直至完成应该不会有问题。我就是这么做的。如果 php 通过 Web 服务器提供服务,超时是一个大问题,但由于您是直接通过 php 可执行文件进行解释,所以这是可以的。不过,我建议你使用 python 或其他更适合任务的东西。
You shouldn't have a problem running it once a day to completion. That's the way I would do it. Timeouts are a big issue if php is being served through a web server, but since you are interpreting directly through the php executable this is ok. I would advise you to use python or something else that is more task-friendly, though.