抓取网站的一部分并通知更改
不幸的是,我大学的网站不提供提要,但他们不断在那里发布对我来说很重要的信息(截止日期、考试日期等)作为 pdf 的链接 在网站的某个部分。
我怎样才能定期抓取网站的该部分并通知我(咆哮,邮寄类似的东西)。
通常我会使用 wget 来镜像它,但如何仅提取网站的部分内容? 有没有可以通过 XPATH 或类似工具提取 XHTML 的 cli 工具?
The website of my university unfortunately does not provide feeds but they keep publishing information there that is important for me (deadlines, dates of exams etc.) as links to pdfs
in a certain section of the website.
How can I regularly scrape that section of the site and have me notified (growl, mail something alike).
Normally I would use wget to mirror it but how to extract only parts of the website?
Is there a cli tool that can extract the XHTML via XPATH or similar?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
试试这个:
这将打印可能包含“长度”属性的标题。如果有变化,您可以通知自己。
编辑:如果它发生变化,您可以下载整个 html 文件、grep 查找 pdf 文件或任何您想要查找的内容(可能是“
")
Try this:
This will print the headers which might contain the "Length"-attribute. If it changes, you can notify yourself.
edit: If it changes, you can download the whole html file, grep for a pdf file or whatever you want to look for (maybe for "<div id='news'>(.*?)</div>")
嗯...您应该看看 QueryPath。 QueryPath 使解析 HTML 变得容易。如果 HTML 结构发生变化怎么办?如果您想要页面的特定元素怎么办? QueryPath 会为您完成繁重的工作。你喜欢 JQuery 吗? QueryPath 就像 PHP 中的 JQuery。
请参阅:http://www.ibm.com/developerworks/opensource/library/os-php-querypath/index.html?S_TACT=105AGX01&S_CMP=HP
请参阅:http://querypath.org/
Mmm... You should take a look at QueryPath. QueryPath makes easy to parse HTML. What if the HTML structure changes? What if you want specific elements of the page? QueryPath does the hard work for you. Do you like JQuery? QueryPath is like the JQuery of PHP.
See: http://www.ibm.com/developerworks/opensource/library/os-php-querypath/index.html?S_TACT=105AGX01&S_CMP=HP
See: http://querypath.org/
您可能有兴趣查看 Pjscrape (免责声明:这是我的项目)。它是一个基于 PhantomJS 构建的网络抓取工具,让您可以在无头 Webkit 浏览器上下文中对页面进行完全的 jQuery 访问。它使得通过命令行从网页中提取半结构化数据变得非常容易,特别是当您正在抓取的页面具有一致的新元素结构时。
例如,您可以使用以下代码从此课程目录中提取所有课程标题:
默认情况下,从命令行运行此命令会为您提供 JSON 到 STDOUT 的结果:
因此,定期运行此脚本、捕获文件中的输出,然后在新输出与之前的输出不匹配时提醒您将非常简单刮。您还可以编写自己的抓取功能,因此,如果简单的选择器无法解决问题,则可以灵活地进行更复杂的抓取。
You might be interested in looking at Pjscrape (disclaimer: this is my project). It's a web-scraping tool built on PhantomJS, giving you full jQuery access to the page in a headless Webkit browser context. It makes it very easy to pull semi-structured data from webpages via the command line, particularly if the page you're scraping has a consistent structure for new elements.
For example, you can pull all the course titles from this course catalog with the following code:
Running this from the command line gives you JSON to STDOUT by default:
So it would be pretty simple to run this script on a regular basis, capture the output in a file, and then alert you when the new output doesn't match the previous scrape. You can also write your own scraper functions, so there's a lot of flexibility for more complex scraping if a simple selector won't do the trick.