PHP - 从外部 API 同步数据的 Cron 作业。我的方法论如何?

发布于 2024-11-05 21:52:47 字数 1397 浏览 0 评论 0原文

我正在寻找关于我正在开发的基于 PHP/MySQL 的 Web 应用程序的一些反馈。该应用程序是一个基于会员的网站,它使用本地数据库每天存储每个用户的数据。这些数据来自外部 API,需要每天自动同步,以便我的本地数据库拥有最新数据。这是我想到的方法:

我有 2 个 Cron 作业:

  1. 队列生成器

  2. 队列工作器

    队列

..和 3 个数据库表:

  1. 用户数据(存储我到目前为止拥有的任何用户数据,如果有的话)。

  2. 用户详细信息(所有成员的列表,其中包括我尚未拥有数据的用户,也称为新注册)。

  3. 处理队列

队列生成器是一个 PHP 脚本,将通过 Cron 定期运行。它将:

  • 比较用户详细信息用户数据表,以确定我还没有哪些新用户的任何数据。对于这些用户,它将构建一个从 2011 年 1 月 1 日开始至今的 URL 列表,并将它们插入到处理队列表中(这是因为我希望获得从 2011 年 1 月 1 日开始的数据)我所有用户的年份)。

  • 分析用户数据表以查找每个用户的数据上次同步的时间,并构建从上次同步日期到当天的网址列表。这些也将插入到处理队列表中。

这样,处理队列表将包含所有需要查询的 URL 的列表。

队列工作器也是一个 PHP Cron 脚本,它将:

  • 选择处理队列表中的前 20 项,使用 CURL multi 获取其内容,错误检查,然后从表中删除前 20 行。我一次将其分解为 20 个 URL,因为如果我处理太多 URL,脚本可能会挂起,或者我的主机可能会拿着猎枪来敲我的门。

这也将通过 Cron 作业定期运行,因此我们的想法是数据同步应该自动化,并且用户应该拥有最新的数据。我的问题是:

  1. 关于我的方法的一般想法是什么?这样做有副作用吗?我是一名没有 CS 背景的业余开发者,所以总是热衷于获得批评和学习最佳实践! =)

  2. 当新用户注册时,我计划向他们提供“您的数据可能需要 xx 分钟才能同步”,同时将他们重定向到入门资源等。这对于我的初始版本来说可能没问题,但进一步向下我想对其进行改进,以便用户在同步准备就绪时收到电子邮件通知,或者可以看到进度百分比。我当前的解决方案可以轻松解决这个问题吗?或者我会在后续过程中遇到麻烦吗?

意见表示赞赏!提前非常非常感谢 - 我希望我已经解释清楚了!

I was after some feedback on a PHP/MySQL based web app that I'm in the process of developing. The app is a member-based site which uses a local database to store data for each user by day. This data comes from an external API and needs to be automatically synced daily so that my local DB has up-to-date data. This is the methodology I have in mind:

I have 2 Cron Jobs:

  1. The Queue Builder

  2. The Queue Worker

..and 3 database tables:

  1. User Data (stores whatever user data I have so far, if any).

  2. User Details (a list of all members which includes users that I don't have data for as yet, aka new signups).

  3. The Processing Queue

The Queue Builder is a PHP script that will run via Cron at regular intervals. It will:

  • Compare the User Details and User Data tables to determine which new users I don't have any data for yet. For these users, it will build a list of URLs starting from 1/1/11 until the present day and insert them into The Processing Queue table (this is because I wish to have data from the start of the year for all my users).

  • Analyse the User Data table to find when each user's data was last synced, and build a list of URLs from the last synced date until the current day. These will also be inserted into The Processing Queue table.

This way The Processing Queue table will contain a list of all the URLs that need to be queried.

The Queue Worker is also a PHP Cron script that will:

  • Select the first 20 items in The Processing Queue table, get their contents using CURL multi, error-check, and then delete the first 20 rows from the table. I'm breaking it up into 20 URLs at a time because if I process too many URLs the script may hang or my host may come knocking on my door equipped with a shotgun.

This will also run regularly via a Cron job, so the idea is that data-syncing should be automated and users should have up-to-date data. My questions are:

  1. What are the general thoughts on my methodology? Are there any side effects to doing it this way? I'm a hobbyist developer without a CS background so always keen on gaining criticism and learning about best practices! =)

  2. When a new user signs up, I plan on giving them a "your data can take xx minutes to sync" while redirecting them to Getting Started resources etc. This is probably okay for my initial release, but further down the track I'd like to refine it so users get an email notification when syncing is ready or can see a % progress. Does my current solution accomodate this easily? Or will I have headaches down the track?

Opinions are appreciated! Many, MANY thanks in advance - I hope I have explained this clearly!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

小姐丶请自重 2024-11-12 21:52:47

也许我能给你的最好建议是:KISS!不,我并没有过于深情,这代表“保持简单,傻瓜!”可以说是一个非常重要的工程原理。考虑到这一点,我要问的第一个问题是“为什么要使用 cron?”当用户注册时是否可以实时运行所有这些任务?如果是的话,我会说现在就用这个​​,不要为 cron 烦恼。如果您确实决定使用 cron 模块,我建议您执行以下操作:

  • 考虑使用锁定文件来防止脚本的多个实例同时运行。例如,如果您每 5 分钟运行一次脚本,并且每次运行脚本需要 10 分钟才能完成,则多个实例可能会相互干扰。
  • 使用curl multi 可能会给目标服务器带来比在循环中发出单个请求更大的压力,如果您想对目标服务器礼貌,那么最好使用单个请求并在循环中短暂休眠。
  • 如果您一次只处理 20 个作业,并且您的服务非常受欢迎,那么您最终可能会遇到永久延长的工作队列。例如,如果您每小时获取 40 个任务,但每小时仅处理 20 个任务,则您将永远不会到达队列末尾,并且队列将永远不会完成。

HTH。

Probably the best advise I can give you is this: KISS!! No, I'm not being over-affectionate, this stands for "Keep it simple, stupid!" and is arguably a very important engineering principle. With this in mind, the first question I'd ask is "why cron?" Would it be possible to have all of these tasks run in real-time when users sign up? If yes, I'd say go with this for now and don't bother with cron. If you do decide to go with the cron module I'd recommend the following:

  • Consider using a lock file to prevent multiple instances of your script running at the same time. For example, if you run the script every 5 minutes, and each time it runs the script takes 10 minutes to complete then the multiple instances could interfere with each other.
  • Using curl multi will probably put more strain on your target server than making single requests in a loop, if you want to be polite to the target server then it's probably best to use single requests and have a short sleep in the loop.
  • If you only process 20 jobs at a time and your service is very popular you could end up with a permanently extending work queue. For example, if you're acquiring 40 tasks an hour and only processing 20 tasks an hour, you'll never reach the end of the queue and the queue will never complete.

HTH.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文