如何像浏览器一样匹配整个 html 元素,无论里面有什么?

发布于 2024-09-13 16:05:17 字数 243 浏览 3 评论 0原文

在给定的页面上有很多元素:

<div class="some class"> <-- here is anything, other divs, even other divs with
the same class, but I need to match right on closing tag for this particular
opening tag --></div>

on a given page there are bunch of elements:

<div class="some class"> <-- here is anything, other divs, even other divs with
the same class, but I need to match right on closing tag for this particular
opening tag --></div>

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

半枫 2024-09-20 16:05:17

正则表达式描述了对正则语言的操作。 HTML 不是常规语言。我敢打赌你可以用所谓的“递归正则表达式”来做到这一点,因为它们不是真正的正则表达式,并且不限于正则语言。我准备打赌你最好还是解析它。

最简单的方法(不是最好的,但最容易用几行代码编写)是保留内部 div 的计数。每当遇到 div 标签时,就增加计数。当您遇到结束 div 标签时,如果计数不为零,或者您已找到完整元素,则删除计数。每当您遇到文件末尾时,都表明有人没有正确关闭他们的 div。

如果您可以依赖格式良好的代码(如果不能,那么您将遇到两个问题...),或者准备好在格式不正确的情况下出错,那么使用 XML 解析器会更容易。形成的输入。

Regular expressions describe operations on regular languages. HTML is not a regular language. I'd be prepared to bet you could do it with a so-called "recursive regular expression" as they aren't really regular expressions and aren't limitied to regular languages. I'd be prepared to bet more that you'd be better off parsing it instead anyway.

The easist (not the best, but the easiest to code in a few lines), is to keep a count of inner divs. Whenever you encounter a div tag, up the count. Whenver you encounter a closing div tag, then drop the count if it's non-zero, or you've found your complete element. Whenever you encounter the end of the file, somebody hasn't closed their divs properly.

Using an XML parser is easier still if you can either depend on the code being well-formed (if you can't, you've got two problems...) or are prepared to just error in the case of non-well-formed input.

深巷少女 2024-09-20 16:05:17

唯一可靠的解决方案是解析 HTML,正则表达式无法在所有情况下解决这个问题。

事实上浏览器往往非常宽容,它们甚至可以处理诸如丢失之类的错误

< /p>

标签。所以处理任意页面实际上是相当棘手的。

如果您正在处理自己生成的页面,那么也许您可以编写一些特殊情况的正则表达式。否则,您可能需要寻找一个真正的解析器,例如 this。 (我自己从未使用过它,但它很可能正是您所需要的。)

The only robust solution is to parse the HTML, regexps can't solve this in all cases.

In fact browsers are often very tolerant, they even cope with errors such as missing

< / P >

tags. So dealing with arbitrary pages is actually quite tricky.

If you are dealing with a page that you produce yourself, then perhaps you can code some special case regexps. Otherwise you may need to seek out a true parser such as this. (never used it myself, but it may well be what you need.)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文