安排向特定站点发送 http 请求
我希望每当特定站点中的搜索查询出现新结果时都能收到通知。该网站不为此提供任何功能(通过 RSS、警报等)。我认为实现此目的的一种方法是发送http请求(用于搜索)并处理http响应以发送邮件以获取出现的任何新结果。搜索参数可以是静态的,也可以更好地从源(如csv文件)中获取。有谁知道现有的解决方案(最好是在线)可以实现这一点。
谢谢, 杰特
I want some way to be notified whenever a new result appears for search query in particular site. The site does not provide any feature(via RSS, alerts ..etc) for this. One way I think to accomplish this would be to send http request (for search) and process http response to send mail for any new result which comes up.The search parameters can be static or better taken from a source (like a csv file). Does anyone know of an existing solution/s preferably online which can accomplish this.
Thanks,
Jeet
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
尝试 iHook,它允许您安排(频率为 1 分钟)对公共 Web 资源的 HTTP 请求,并接收规则 -基于电子邮件的通知。您可以围绕响应状态代码和响应正文创建通知规则(通过 JSON 表达式和 CSS 选择器)。
Try iHook, it allows you to schedule (as frequent as 1 minute) HTTP requests to public web resources, and receive rule-based email notifications. You can create notification rules around response status code and response body (via JSON expression and CSS selector).
这取决于您要查询的特定站点。
That would depend on the particular site you want to query.
我知道没有“开箱即用”的开源解决方案可以做到这一点,所以我相信您需要编写一个自定义蜘蛛/爬行器来完成您的任务;它需要提供以下服务:
调度 - 何时进行爬网。通常使用类 Unix 系统中的“cron”系统服务或 Windows 中的任务计划程序。
检索 - 检索目标页面。使用 Perl 等脚本语言或“curl”或“wget”等专用系统工具。
提取/规范化 - 从目标(检索的页面)中删除除感兴趣内容之外的所有内容。需要补偿与任务无关的目标部分的变化,例如日期或广告。通常通过支持正则表达式(对于简单情况)的脚本语言或 HTML 解析器库(对于更专业的提取)来完成。
校验和 - 将目标转换为由其内容确定的唯一标识符。用于确定自上次爬网以来目标的更改。通过系统工具(例如Linux“cksum”命令)或脚本语言来完成。
更改检测 - 将上次检索的目标的先前保存的校验和与当前检索的新计算的校验和进行比较。同样,通常使用脚本语言。
警报 - 通知用户已识别的更改。通常通过电子邮件或短信。
状态管理 - 存储上次运行的目标 URI、提取规则、用户首选项和目标校验和。使用配置文件或数据库(如 Mysql)。
请注意,此服务列表试图抽象地描述系统,因此听起来比您创建的实际工具复杂得多。我之前已经编写过几个类似的系统,因此我预计一个用 Perl 编写的简单解决方案(利用标准 Perl 模块)并在 Linux 上运行,对于几个目标站点来说需要一百行左右,具体取决于提取的复杂性。
I know of no open-source solution "out of the box" to do this so I believe you'd need to write a custom spider/crawler to accomplish your task; it would need to provide the following services:
Scheduling - when the crawl should occur. Typically the 'cron' system service in Unix-like systems or the Task Scheduler in Windows are used.
Retrieval - retrieving targeted pages. Using either a scripting language like Perl or a dedicated system tool like 'curl' or 'wget'.
Extraction / Normalization - removing everything from the target (retrieved page) except the content of interest. Needed to compensate for changing sections of the target that are not germane to the task, like dates or advertising. Typically accomplished via a scripting language that supports regular expressions (for trivial cases) or an HTML parser library (for more specialized extractions).
Checksumming - converting the target into a unique identifier determined by its content. Used to determine changes to the target since the last crawl. Accomplished by a system tool (such as the Linux 'cksum' command) or a scripting language.
Change detection - comparing the previously saved checksum for the last retrieved target with the newly computed checksum for the current retrieval. Again, typically using a scripting language.
Alerting - informing users of identified changes. Typically via email or text message.
State management - storing target URIs, extraction rules, user preferences and target checksums from the previous run. Both configuration files or databases (like Mysql) are used.
Please note that this list of services attempts to describe the system in abstract and so sounds a lot more complicated than the actual tool you create will be. I've written several systems like this before so I expect a simple solution written in Perl (utilizing standard Perl modules) and running on Linux would require a hundred lines or so for a couple of target sites depending on extraction complexity.