当相同的元素类型嵌套在外部 HTML 元素中时，如何使用正则表达式捕获外部 HTML 元素？

发布于 2024-09-14 00:57:49 字数 975 浏览 14 评论 0原文

我正在尝试使用正则表达式捕获 HTML 的某些部分，但遇到了一种我不知道如何解决的情况。

我有一个像这样的 HTML 片段：

<span ...> .... <span ...> ... </span> ... </span>

一个元素，其中嵌套了另一个元素。

我已经成功地使用以下正则表达式（在 PHP 的 preg_match() / preg_match_all() 中）来捕获整个 HTML 元素：

@<sometag[^>]+>.*?</sometag>@

这将捕获给定的起始标记和所有内容到相同类型的结束标记。

但是，在上述情况下，这将捕获起始以及遇到的 下一个 结束之前的所有内容，所以我得到的是这样的：

<span ...> .... <span ...> ... </span>

即外部起始标签，然后是直到内部跨度的起始标签的所有内容，然后是直到内部跨度的结束标签的所有内容，这当然不是我想要的。

我真正想要的是外部元素及其内部的所有内容，包括内部嵌套的。

有什么实际的方法可以实现这一目标吗？

注意：使用 XML 解析器解析 HTML 可能不是一个选项，因为我正在处理的 HTML 是旧的，并且来自 MS FrontPage 的 HTML 4 非常损坏，任何解析器都会被阻塞。

感谢您的帮助！

原文

I'm trying to capture certain parts of HTML using regular expressions, and I've come across a situation which I don't know how to resolve.

I've got an HTML fragment like this:

<span ...> .... <span ...> ... </span> ... </span>

so, a  element into which another  element is nested.

I've been successfully using the following regex (in PHP's preg_match() / preg_match_all()) to capture entire HTML elements:

@<sometag[^>]+>.*?</sometag>@

This would capture a given starting tag and everything up to the closing tag of the same type.

However, in the situation above, this would capture the starting  and everything up to the next closing  encountered, so what I get is this:

<span ...> .... <span ...> ... </span>

that is, the outer starting tag, then everything until the starting tag of the inner span, then everything up to the closing tag of the inner span, which, of course, is not what I want.

What I really wanted is the outer  element complete with everything that is inside it, including the inner nested .

Is there any practical way to achieve this?

Note: parsing the HTML using an XML parser is probably not an option, as the HTML I'm working on is old and very broken HTML 4 coming out of MS FrontPage that any parser would choke on.

Thanks for any help!

分享到QQ

分享到微博