使用 CRON 自动抓取数据
我目前正在开发一个应用程序,该应用程序从赛车网站收集有关灰狗的信息,并对给定的数据执行大量计算。当前的应用程序能够根据用户输入在赛车网站上执行单独的 YQL 请求,从而正确显示数据并执行正确的计算。
然而,我发现由于大量的 HTTP 调用和缺乏数据缓存,应用程序往往有点慢。为了加快速度并提供进一步分析数据的能力,我想构建某种系统,通过 cron 选项卡抓取并存储与前一天晚上相关的所有数据。但是,我不确定如何去做。
目前,应用程序经历以下粗略过程:
- 允许用户选择一个日期
- 执行 YQL 查询并迭代结果以获取该日期的所有比赛
- 允许用户从上面的列表中选择比赛
- 执行 YQL 查询并迭代结果获取比赛中的所有狗
- 执行 YQL 查询并迭代结果以获取每只狗执行的所有比赛
- 根据每只狗执行的比赛计算统计数据
- 输出所有内容
如您所见,有相当多的单独的 HTTP 请求。这是不可避免的,因为每个数据集都存在于赛车网站的不同页面上。出于这个原因,我宁愿通过一个单独的系统完成大部分处理,并将数据存储在数据库中,而不是在用户请求时收集和处理数据。
我可以轻松地从当前系统中提取提取和计算处理,然后让它们从 cron 选项卡运行,但它们都将从单个 PHP 请求运行。这意味着服务器必须迭代数千条数据,将每组数据存储在数据库中,所有这些都在一个 PHP 请求中完成。没有尝试过,我会认为请求会超时?
总而言之,我的问题是:
- 如果我将处理放入单个 PHP 文件中并从 cron 运行它,它会在完成工作之前超时还是会继续进行?
- 是否有任何现有的库可以处理此类任务?
- 对于实现这一目标的替代方法有什么想法吗?
非常感谢,
丹
I am currently working on an application that harvests information about greyhounds from a racing website and performs a number of calculations on the given data. The current application manages to display the data correctly and perform the correct calculations by performing individual YQL requests on the racing website based on the users input.
However, I've found that due to the large amount of HTTP calls and the lack of data caching, the application tends to be a bit slow. To speed it up and open up the ability to further analyse the data, I would like to build some sort of system that will scrape and store all the data relevant to a day on the night before via a cron tab. However, I'm unsure as to how to go about it.
At the moment, the application goes through the following rough process:
- Allow user to select a date
- Perform YQL query and iterate through result to get all the races on that date
- Allow user to select race from the above list
- Perform YQL query and iterate through result to get all the dogs in the race
- Perform YQL query and iterate through result to get all the races performed by each dog
- Calculate statistics based on the races performed by each dog
- Output everything
As you can see, there are quite a few seprate HTTP requests. This is unavoidable as each dataset exists on a different page on the racing website. For that reason, I would much rather get the bulk of the processing out of the way through a separate system and have the data stored in a database as opposed to being harvested and processed when the user requests.
I could easily extract the extraction and calculation processing from the current system and just have them run from a cron tab but they would all be running from a single PHP request. That means that the server would have to iterate over literally thousands of pieces of data, storing each set in a database, all within one PHP request. Having not tried it out, I would assume that the request would timeout?
So to sum up, here are my questions:
- If I placed the processing into a single PHP file and ran it from cron, would it timeout before finishing the job or would it just continue to plough through?
- Is there any pre-existing library that deals with such a task?
- Any thoughts on alternative ways to accomplish this?
Many thanks,
Dan
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
与其大规模抓取网站,不如按需缓存怎么样?
这可能更容易实现,并且如果竞赛网站的 TOS 不允许爬行(可能不允许),也不会引起怀疑。
您只需要一个以日期为键的本地 SQL 表,并且包含您已输出的统计信息的列。
您的流程将类似于
Intstead of mass crawling the site, how about on demand caching?
This is probably easier to implement, and doesn't make the race site suspicious if their TOS doesn't allow crawling (it probably doesn't).
You just need a local sql table that is keyed by date, and has columns for the statistics you are already outputting.
Your flow would go something like