如何知道被抓取的网站是否发生了变化?

发布于 2024-08-27 00:19:04 字数 198 浏览 5 评论 0 原文

我正在使用 PHP 抓取网站并收集一些数据。这一切都是在不使用正则表达式的情况下完成的。我使用 php 的explode() 方法来查找特定的 HTML 标签。

如果网站的结构发生变化(CSS、HTML),那么抓取工具可能会收集到错误的数据。那么问题是 - 我如何知道 HTML 结构是否发生了变化?如何在将任何数据存储到数据库之前识别这一点,以避免存储错误的数据。

I'm using PHP to scrape a website and collect some data. It's all done without using regex. I'm using php's explode() method to find particular HTML tags instead.

It is possible that if the structure of the website changes (CSS, HTML), then wrong data may be collected by the scraper. So the question is - how do I know if the HTML structure has changed? How to identify this before storing any data to my database to avoid wrong data being stored.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

多孤肩上扛 2024-09-03 00:19:04

我认为如果您正在抓取内容发生变化的页面,您没有任何干净的解决方案。

我已经开发了几个Python scrapers,我知道当网站只是对其布局进行细微的更改时会多么令人沮丧。

您可以尝试机械化的解决方案(不知道 php 对应项),如果幸运的话,您可以隔离需要提取的内容(链接?)。

另一种可能的方法是编写一些约束并在存储到数据库之前检查它们。

例如,如果您正在抓取 URL,则需要验证抓取器解析的内容是否是正式有效的 URL;对于整数 ID 或任何您想要抓取的可以被识别为有效的内容也是如此。

如果您正在抓取纯文本,则检查起来会更加困难。

I think you don't have any clean solutions if you are scraping a page where content changes.

I have developed several python scrapers and I know how can be frustrating when site just makes a subtle change on its layout.

You could try a solution a la mechanize (don't know the php counterpart) and if you are lucky you could isolate the content you need to extract (links?).

Another possibile approach would be to code some constraints and check them before store to db.

For example, if you are scraping Urls, you will need to verify that what scraper has parsed is formally a valid Url; same for integer ID or whatever you want to scrape that can be recognized as valid.

If you are scraping plain text, it will be more difficult to check.

黯然#的苍凉 2024-09-03 00:19:04

取决于站点,但您可以计算抓取页面中的页面元素数量,例如 div、class 和 class 等。然后,样式标签通过将这些总数与后来抓取的总数进行比较来检测页面结构是否已更改。

CSS 文件可以使用类似的过程,其中可以使用简单的正则表达式提取每个类或 ID 的名称,并根据需要进行存储和检查。如果此列表有新的添加,那么页面结构几乎肯定已在被抓取的网站上的某个位置发生了变化。

Depends on the site but you could count the number of page elements in the scraped page like div, class & style tags then by comparing these totals against those of later scrapes detect if the page structure has been changed.

A similiar process could be used for the CSS file where the names of each each class or id could be extracted using simple regex, stored and checked as needed. If this list has new additions then the page structure has almost certainly changed somewhere on the site being scraped.

你的心境我的脸 2024-09-03 00:19:04

这是我的胡言乱语,但您可能想查看一些文档对象模型 PHP 方法。

http://php.net/manual/en/book.dom.php

如果我对 DOM 的非常非常有限的理解是正确的,那么 HTML 站点结构的更改将更改文档对象模型,但固定结构内的简单内容更改则不会。因此,如果您可以捕获 DOM 状态,然后在每次抓取时进行比较,那么您在理论上是否可以确定已经进行了这样的更改?

(顺便说一句,当我试图在律师资格考试结果发布在特定页面上时收到电子邮件通知时,我这样做的方式只是比较 file_get_contents() 值。令人惊讶的是,工作完美:没有误报,并向我发送了电子邮件网站发布内容后立即。)

Speaking out of my ass here, but its possible you might want to look at some Document Object Model PHP methods.

http://php.net/manual/en/book.dom.php

If my very, very limited understanding of DOM is correct, a change in HTML site structure would change the Document Object Model, but a simple content change within a fixed structure wouldn't. So, if you could capture the DOM state, and then compare it at each scrape, couldn't you in theory determine that such a change has been made?

(By the way, the way I did this when I was trying to get an email notification when the bar exam results were posted on a particular page was just compare file_get_contents() values. Surprisingly, worked flawlessly: No false positives, and emailed me as soon as the site posted the content.)

稚气少女 2024-09-03 00:19:04

如果你想了解结构方面的变化,我认为最好的方法是存储第一页的 DOM 结构,然后将其与新页面进行比较。

有很多方法可以做到这一点:-
萨克斯解析器
DOmParser 等

我有一个小博客,它将给出一些我的意思的指示
http://let-them- c.blogspot.com/2009/04/xml-as-objects-in-oops.html

或者您可以使用 http://en.wikipedia.org/wiki/Simple_API_for_XML 或 DOm 实用程序解析器。

If you want to know changes with respect to structure, I think the best way is to store the DOM structure of your first page and then compare it with new one.

There are lot of way you can do it:-
SaxParser
DOmParser etc

I have a small blog which will give some pointers to what I mean
http://let-them-c.blogspot.com/2009/04/xml-as-objects-in-oops.html

or you can use http://en.wikipedia.org/wiki/Simple_API_for_XML or DOm Utility parser.

禾厶谷欠 2024-09-03 00:19:04

首先,在某些情况下,您可能需要将原始文件的哈希值进行比较新的 HTML。 MD5 和 SHA1 是两种流行的哈希值。这可能在所有情况下都有效,也可能并非在所有情况下都有效,但您应该熟悉这一点。这会告诉您是否有某些内容发生了变化 - 内容、标签或其他任何内容。

要了解结构是否已更改,您需要捕获标签出现次数的直方图,然后进行比较。如果您关心标签乱序,那么您必须捕获标签树并进行比较以查看标签是否以相同的顺序出现。这对于您想要实现的目标非常具体。

PHP Simple HTML DOM Parser 是一个帮助您解析 HTML 的工具。

First, in some cases you may want to compare hashes of the original to the new html. MD5 and SHA1 are two popular hashes. This may or may not be valid in all circumstances but is something you should be familiar with. This will tell you if something has changed - content, tags, or anything.

To understand if the structure has changed you would need to capture a histogram of the tag occurrences and then compare those. If you care about tags being out of order then you would have to capture a tree of the tags and do a comparison to see if the tags occur in the same order. This is going to be very specific to what you want to achieve.

PHP Simple HTML DOM Parser is a tool which will help you parse the HTML.

木森分化 2024-09-03 00:19:04

Explode() 不是 HTML 解析器,但您想了解 HTML 结构的变化。这会很棘手。尝试使用 HTML 解析器。没有其他东西能够正确地做到这一点。

Explode() is not an HTML parser, but you want to know about changes in the HTML structure. That's going to be tricky. Try using an HTML parser. Nothing else will be able to do this properly.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文