使用 python 从网站中提取 HTML 部分

发布于 2024-12-18 02:41:16 字数 129 浏览 0 评论 0原文

我目前正在开发一个项目，该项目涉及使用 Python 检查网页 HTML 的程序。我的程序必须监视网页，当 HTML 发生更改时，它将完成一组操作。我的问题是如何提取网页的一部分，以及如何监控网页的 HTML 并在发生更改时几乎立即报告。谢谢。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦里寻她 2024-12-25 02:41:16

过去我编写了自己的解析器。如今 HTML 是 HTML 5，更多的语句，更多的 Javascript，开发人员及其编辑所做的很多蹩脚的事情，比如

document.write('<SCR' + 'IPT

一些 Web 框架/开发人员错误的编码会在每个请求上更改 HTTP 标头中的 Last-Modified，即使对于人类来说也是如此您在页面上阅读的文本不会更改。

我建议你使用 BeautifulSoup 进行解析；您必须自己仔细选择要观看的内容来决定网页是否被修改。

其简介：

BeautifulSoup 是一个 Python 包，可以解析损坏的 HTML，就像
lxml基于libxml2的解析器来支持它。 BeautifulSoup 使用
不同的解析方法。它不是真正的 HTML 解析器，而是使用
正则表达式可以深入了解标签汤。因此更
在某些情况下宽容，而在另一些情况下则不太好。这并不罕见
lxml/libxml2 可以更好地解析和修复损坏的 HTML，但是
BeautifulSoup 对编码检测具有出色的支持。它非常
很大程度上取决于哪个解析器工作得更好的输入。

In the past I wrote my own parsers. Nowadays HTML is HTML 5, more statements,more Javascript, a lot of crappiness done by developers and their editors, like

document.write('<SCR' + 'IPT

And some web frameworks / developers bad coding change the Last-Modified in the HTTP header on every request, even if for a human person the text you read on the page isn't changed.

I suggest you BeautifulSoup for the parsing stuff; by your own you have to careful choose what to watch to decide if the Web page is modified.

Its intro :

BeautifulSoup is a Python package that parses broken HTML, just like
lxml supports it based on the parser of libxml2. BeautifulSoup uses a
different parsing approach. It is not a real HTML parser but uses
regular expressions to dive through tag soup. It is therefore more
forgiving in some cases and less good in others. It is not uncommon
that lxml/libxml2 parses and fixes broken HTML better, but
BeautifulSoup has superiour support for encoding detection. It very
much depends on the input which parser works better.

回复收藏 0 原文