cron 作业周期和工作量

发布于 2024-11-26 17:08:47 字数 364 浏览 1 评论 0原文

我正在从事博客聚合项目。 主要任务之一是获取博客的 RSS 提要并对其进行处理。我目前有大约 500 个博客,但数量会随着时间的推移而稳步增加(很快就会达到数千个)。

目前(仍为测试版),我有 cron 作业,每天定期获取所有 RSS 提要。但这使得所有处理和网络 IO 每天只运行一次。

我应该:

  1. 保持当前情况(一次全部)
  2. 每小时获取 number_of_blogs / 24(恒定的 cron 作业计时)
  3. 更改 cron 周期以获取恒定数量的 RSS(每隔更短的时间获取 10 个博客)

还是还有其他想法?

我使用共享主机,因此非常感谢减少 CPU 和网络 IO :)

I am working on blog-aggregation project.
One of the main tasks is the fetching of RSS feeds of blogs and processing them. I have currently about 500 blogs, but the number will be increasing steadily with time (it should reach thousands soon).

Currently (still beta), I have cron job which periodically fetches all the RSS feeds once every day. But this puts all processing and network IO on only once per day.

Should I:

  1. Keep the current situation (all at once)
  2. Make hourly fetching of number_of_blogs / 24 (constant cron job timing)
  3. Change cron periodicity to make constant number of RSS fetches (10 blogs every smaller time)

or there any other ideas?

I am on shared hosting, so reducing CPU and network IO is much appreciated :)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

晨与橙与城 2024-12-03 17:08:47

我使用了一个适应提要更新频率的系统,如

如果您使用条件 HTTP GET 来检索支持它的提要,则可以节省资源。保留 HTTP 响应中 Last-ModifiedETag 标头的值。下次尝试在 If-Modified-SinceIf-None-Match 请求标头中提供它们的值。

现在,如果您收到 HTTP 304 响应代码,您就知道 Feed 尚未更改。在这种情况下,完整的提要不会再次发送,只有标题告诉您没有新帖子。这减少了带宽和数据处理。

I have used a system that adapts the update frequency of the feed, described in this answer.

You can spare resources if you use conditional HTTP GET's to retrieve feeds that support it. Keep the values of the Last-Modified and ETag headers from the HTTP response. On the next try supply their values in the If-Modified-Since and If-None-Match request headers.

Now if you receive the HTTP 304 response code you know the feed hasn't changed. In this case the complete feed hasn't been send again, only the header telling you there are no new posts. This reduces the bandwidth and data processing.

狼性发作 2024-12-03 17:08:47

我有类似的情况,但没有那么多博客:)我过去每 24 小时导入一次,但为了节省 CPU 负载,我在每个博客后使用 sleep() ,例如 sleep( 10);它保证了我的安全。

I had similar situation, but not so many blogs :) I used to import them once in 24 hours but to save CPU load, I was using sleep() after every blog, like sleep(10); and it kept me safe.

戏舞 2024-12-03 17:08:47

我会考虑使用 Google App Engine 来检索和处理“原始”信息,并将其以可管理大小的数据包发布到网络服务器。 GAE有自己的cron作业系统,可以24/7独立运行。

目前正在使用类似的系统从多个网站检索工作信息并将其编译为另一种抵消带宽和工作量的绝妙方法。加工要求也是如此。

I would consider using the Google App Engine to retrieve and process the 'raw' information and have it POST out the data in managable size packets to the web server. The GAE has its own cron job system and can run independantly 24/7.

Currently using a similar system to retrieve job information from several websites and compile it for another, brilliant way to offset the bandwidth & processing requirements as well.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文