如何像浏览器一样匹配整个 html 元素,无论里面有什么?
在给定的页面上有很多元素:
<div class="some class"> <-- here is anything, other divs, even other divs with
the same class, but I need to match right on closing tag for this particular
opening tag --></div>
on a given page there are bunch of elements:
<div class="some class"> <-- here is anything, other divs, even other divs with
the same class, but I need to match right on closing tag for this particular
opening tag --></div>
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
不要使用正则表达式来解析 HTML。使用 DOMDocument 来代替,这样可以省去所有麻烦。
Stack Overflow 上的相关阅读:
Don't use regex to parse HTML. Use DOMDocument instead and save yourself all the headaches.
Related reading here on Stack Overflow:
DOMDocument
DOMDocument
正则表达式描述了对正则语言的操作。 HTML 不是常规语言。我敢打赌你可以用所谓的“递归正则表达式”来做到这一点,因为它们不是真正的正则表达式,并且不限于正则语言。我准备打赌你最好还是解析它。
最简单的方法(不是最好的,但最容易用几行代码编写)是保留内部 div 的计数。每当遇到 div 标签时,就增加计数。当您遇到结束 div 标签时,如果计数不为零,或者您已找到完整元素,则删除计数。每当您遇到文件末尾时,都表明有人没有正确关闭他们的 div。
如果您可以依赖格式良好的代码(如果不能,那么您将遇到两个问题...),或者准备好在格式不正确的情况下出错,那么使用 XML 解析器会更容易。形成的输入。
Regular expressions describe operations on regular languages. HTML is not a regular language. I'd be prepared to bet you could do it with a so-called "recursive regular expression" as they aren't really regular expressions and aren't limitied to regular languages. I'd be prepared to bet more that you'd be better off parsing it instead anyway.
The easist (not the best, but the easiest to code in a few lines), is to keep a count of inner divs. Whenever you encounter a div tag, up the count. Whenver you encounter a closing div tag, then drop the count if it's non-zero, or you've found your complete element. Whenever you encounter the end of the file, somebody hasn't closed their divs properly.
Using an XML parser is easier still if you can either depend on the code being well-formed (if you can't, you've got two problems...) or are prepared to just error in the case of non-well-formed input.
唯一可靠的解决方案是解析 HTML,正则表达式无法在所有情况下解决这个问题。
事实上浏览器往往非常宽容,它们甚至可以处理诸如丢失之类的错误
标签。所以处理任意页面实际上是相当棘手的。
如果您正在处理自己生成的页面,那么也许您可以编写一些特殊情况的正则表达式。否则,您可能需要寻找一个真正的解析器,例如 this。 (我自己从未使用过它,但它很可能正是您所需要的。)
The only robust solution is to parse the HTML, regexps can't solve this in all cases.
In fact browsers are often very tolerant, they even cope with errors such as missing
tags. So dealing with arbitrary pages is actually quite tricky.
If you are dealing with a page that you produce yourself, then perhaps you can code some special case regexps. Otherwise you may need to seek out a true parser such as this. (never used it myself, but it may well be what you need.)