抓取非 RSS 页面以生成提要

发布于 2024-08-21 16:42:21 字数 300 浏览 4 评论 0原文

我想抓取一个定期更新的页面(添加与以前的结构完全相同的新文章),以生成 RSS 提要。

我可以编写代码来轻松分析页面,但是如何模拟 ping 即页面更新时我的 php 脚本如何知道?它必须是一个 cron 作业吗?

(我知道可能是一个重复的问题,但没有找到直接答案。我得到的最接近的是 抓取并生成 RSS feed,它有一个抓取脚本,但没有关于如何让它自动响应页面上的更改的信息)

I want to scrape a page that regularly updates (adding new articles with exactly the same structure as previous ones) in order to generate an RSS feed.

I can write the code to analyse the page easily, but how do I emulate a ping i.e. when the page updates how can my php script know? Does it have to be a cron job?

(Probably a duplicate question I know, but searched for a direct answer with no luck. Closest I got was Scrape and generate RSS feed, which has a scraping script but no info on how to get it to respond to changes on the page automatically)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

深爱成瘾 2024-08-28 16:42:21

根据系统的不同,可能很难判断页面上次更新的时间。

要检查更改,您可以检查页面的 Last-Modified 标头的 HTTP 标头。并非所有系统都能正确更新标头,因此它可能没有用。未修改的页面也有可能返回 304 状态(未修改),特别是当您在请求中提供 If-Modified-Since 标头时。

我肯定会在 cron 作业上运行这样的东西。虽然可能可能仅从标题中完成,但如果您必须更新页面,您的用户将等待很长时间(相对而言),直到您的服务器退出,获取页面,进行处理,并发送响应。如果您没有使用基于非 cron 的方法时不时遇到超时问题,我会感到惊讶。

Depending on the system it may or may not be easy to tell when the page was updated last.

To check for changes, you can check the HTTP headers for the Last-Modified header of the page. Not all systems update the header properly, so it may not be useful. It's also possible that unmodified page will return a status of 304 (Not Modified), particularly if you provide a If-Modified-Since header in your request.

I would definitely run something like this on a cron job. While it might be possible do it just from the headers, if you have to update the page your user will be waiting a long time (in relative terms) for your server to go out, get the page, do the processing, and send the response. I would be surprised if you didn't run into time outs from time to time with a non-cron based a approach.

心奴独伤 2024-08-28 16:42:21

您可以运行一个 crontab 来检查站点是否已更新(通过检查上次修改的标头(如果有),或者通过检查您感兴趣的内容)。

如果当您的 crontab 检查站点时,它检测到内容发生变化,它可以将一条消息附加到队列(类似于 Zend_Queue http://framework.zend.com/manual/en/zend.queue.example.html 例如),那么你可以有一个工作人员,它只通过消息要么达到时间/数据限制,要么直到队列为空。

You could have a crontab running that checks if the site has updated (either by checking the last modified headers, if available, or by checking the content you are interested in).

If when your crontab checks the site, it detects change in content, it could append a message to a queue (something like Zend_Queue http://framework.zend.com/manual/en/zend.queue.example.html for example), then you could have a worker which just works through the messages either until a time / data limit has been reached, or until the queue is empty.

赠意 2024-08-28 16:42:21

如果没有 Last-Modified 行,您还可以检查对 HEAD 请求的响应中是否存在 ETag 和 Content-Length 行及其值。如果这些都不与先前的值(您存储的)匹配,则内容可能已更改。您可以向这些任何其他响应标头行添加指示更改。

You could also check in the response to a HEAD request, if there is no Last-Modified line, for the presence and value of ETag and Content-Length lines. If neither of these match the prior values (which you've stored), then the content has likely changed. You could add to those any other response header lines that would indicate change.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文