使用 CRON 自动抓取数据

发布于 2024-10-19 14:10:38 字数 874 浏览 9 评论 0原文

我目前正在开发一个应用程序，该应用程序从赛车网站收集有关灰狗的信息，并对给定的数据执行大量计算。当前的应用程序能够根据用户输入在赛车网站上执行单独的 YQL 请求，从而正确显示数据并执行正确的计算。

然而，我发现由于大量的 HTTP 调用和缺乏数据缓存，应用程序往往有点慢。为了加快速度并提供进一步分析数据的能力，我想构建某种系统，通过 cron 选项卡抓取并存储与前一天晚上相关的所有数据。但是，我不确定如何去做。

目前，应用程序经历以下粗略过程：

允许用户选择一个日期
执行 YQL 查询并迭代结果以获取该日期的所有比赛
允许用户从上面的列表中选择比赛
执行 YQL 查询并迭代结果获取比赛中的所有狗
执行 YQL 查询并迭代结果以获取每只狗执行的所有比赛
根据每只狗执行的比赛计算统计数据
输出所有内容

如您所见，有相当多的单独的 HTTP 请求。这是不可避免的，因为每个数据集都存在于赛车网站的不同页面上。出于这个原因，我宁愿通过一个单独的系统完成大部分处理，并将数据存储在数据库中，而不是在用户请求时收集和处理数据。

我可以轻松地从当前系统中提取提取和计算处理，然后让它们从 cron 选项卡运行，但它们都将从单个 PHP 请求运行。这意味着服务器必须迭代数千条数据，将每组数据存储在数据库中，所有这些都在一个 PHP 请求中完成。没有尝试过，我会认为请求会超时？

总而言之，我的问题是：

如果我将处理放入单个 PHP 文件中并从 cron 运行它，它会在完成工作之前超时还是会继续进行？
是否有任何现有的库可以处理此类任务？
对于实现这一目标的替代方法有什么想法吗？

非常感谢，

丹

原文

I am currently working on an application that harvests information about greyhounds from a racing website and performs a number of calculations on the given data. The current application manages to display the data correctly and perform the correct calculations by performing individual YQL requests on the racing website based on the users input.

However, I've found that due to the large amount of HTTP calls and the lack of data caching, the application tends to be a bit slow. To speed it up and open up the ability to further analyse the data, I would like to build some sort of system that will scrape and store all the data relevant to a day on the night before via a cron tab. However, I'm unsure as to how to go about it.

At the moment, the application goes through the following rough process:

Allow user to select a date
Perform YQL query and iterate through result to get all the races on that date
Allow user to select race from the above list
Perform YQL query and iterate through result to get all the dogs in the race
Perform YQL query and iterate through result to get all the races performed by each dog
Calculate statistics based on the races performed by each dog
Output everything

As you can see, there are quite a few seprate HTTP requests. This is unavoidable as each dataset exists on a different page on the racing website. For that reason, I would much rather get the bulk of the processing out of the way through a separate system and have the data stored in a database as opposed to being harvested and processed when the user requests.

I could easily extract the extraction and calculation processing from the current system and just have them run from a cron tab but they would all be running from a single PHP request. That means that the server would have to iterate over literally thousands of pieces of data, storing each set in a database, all within one PHP request. Having not tried it out, I would assume that the request would timeout?

So to sum up, here are my questions:

If I placed the processing into a single PHP file and ran it from cron, would it timeout before finishing the job or would it just continue to plough through?
Is there any pre-existing library that deals with such a task?
Any thoughts on alternative ways to accomplish this?

Many thanks,

Dan

分享到QQ

分享到微博