当前位置：文江博客话题详情

抓取网站的一部分并通知更改

发布于 2024-12-25 22:15:28 字数 195 浏览 1 评论 0原文

不幸的是，我大学的网站不提供提要，但他们不断在那里发布对我来说很重要的信息（截止日期、考试日期等）作为 pdf 的链接在网站的某个部分。

我怎样才能定期抓取网站的该部分并通知我（咆哮，邮寄类似的东西）。

通常我会使用 wget 来镜像它，但如何仅提取网站的部分内容？有没有可以通过 XPATH 或类似工具提取 XHTML 的 cli 工具？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

你又不是我 2025-01-01 22:15:28

试试这个：

wget --spider --server-response http://example.com

这将打印可能包含“长度”属性的标题。如果有变化，您可以通知自己。

编辑：如果它发生变化，您可以下载整个 html 文件、grep 查找 pdf 文件或任何您想要查找的内容（可能是“

(.* ?)

Try this:

wget --spider --server-response http://example.com

This will print the headers which might contain the "Length"-attribute. If it changes, you can notify yourself.

edit: If it changes, you can download the whole html file, grep for a pdf file or whatever you want to look for (maybe for "<div id='news'>(.*?)</div>")

回复收藏 0 原文

心碎无痕… 2025-01-01 22:15:28

嗯...您应该看看 QueryPath。 QueryPath 使解析 HTML 变得容易。如果 HTML 结构发生变化怎么办？如果您想要页面的特定元素怎么办？ QueryPath 会为您完成繁重的工作。你喜欢 JQuery 吗？ QueryPath 就像 PHP 中的 JQuery。

请参阅：http://www.ibm.com/developerworks/opensource/library/os-php-querypath/index.html?S_TACT=105AGX01&S_CMP=HP
请参阅：http://querypath.org/

回复收藏 0 原文

耳根太软 2025-01-01 22:15:28

您可能有兴趣查看 Pjscrape （免责声明：这是我的项目）。它是一个基于 PhantomJS 构建的网络抓取工具，让您可以在无头 Webkit 浏览器上下文中对页面进行完全的 jQuery 访问。它使得通过命令行从网页中提取半结构化数据变得非常容易，特别是当您正在抓取的页面具有一致的新元素结构时。

例如，您可以使用以下代码从此课程目录中提取所有课程标题：

pjs.addScraper(
    // the page you're scraping
    'http://www.ischool.berkeley.edu/courses/catalog', 
    // selector for elements you want to pull text from
    '.views-row .views-field-title'
);

// suppress STDOUT logging
pjs.config('log', 'none');

默认情况下，从命令行运行此命令会为您提供 JSON 到 STDOUT 的结果：

~> phantomjs /path/to/pjscrape.js my_script.js
["W10. Introduction to Information","24. Freshman Seminar", ...]

因此，定期运行此脚本、捕获文件中的输出，然后在新输出与之前的输出不匹配时提醒您将非常简单刮。您还可以编写自己的抓取功能，因此，如果简单的选择器无法解决问题，则可以灵活地进行更复杂的抓取。

You might be interested in looking at Pjscrape (disclaimer: this is my project). It's a web-scraping tool built on PhantomJS, giving you full jQuery access to the page in a headless Webkit browser context. It makes it very easy to pull semi-structured data from webpages via the command line, particularly if the page you're scraping has a consistent structure for new elements.

For example, you can pull all the course titles from this course catalog with the following code:

pjs.addScraper(
    // the page you're scraping
    'http://www.ischool.berkeley.edu/courses/catalog', 
    // selector for elements you want to pull text from
    '.views-row .views-field-title'
);

// suppress STDOUT logging
pjs.config('log', 'none');

Running this from the command line gives you JSON to STDOUT by default:

~> phantomjs /path/to/pjscrape.js my_script.js
["W10. Introduction to Information","24. Freshman Seminar", ...]

So it would be pretty simple to run this script on a regular basis, capture the output in a file, and then alert you when the new output doesn't match the previous scrape. You can also write your own scraper functions, so there's a lot of flexibility for more complex scraping if a simple selector won't do the trick.

回复收藏 0 原文

~没有更多了~