正则表达式匹配某个 HTML 元素

发布于 2024-09-27 00:22:59 字数 301 浏览 4 评论 0原文

我正在尝试编写一个正则表达式来匹配以下 HTML。

<span class="hidden_text">Some text here.</span>

我正在努力写出与其匹配的条件,并尝试了以下操作,但在某些情况下,它也会选择跨度之后的所有内容。

$condition = "/<span class=\"hidden_text\">(.*)<\/span>/";

如果有人能强调我做错了什么,那就太好了。

I'm trying to write a regular expression for matching the following HTML.

<span class="hidden_text">Some text here.</span>

I'm struggling to write out the condition to match it and have tried the following, but in some cases it selects everything after the span as well.

$condition = "/<span class=\"hidden_text\">(.*)<\/span>/";

If anyone could highlight what I'm doing wrong that would be great.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

自由如风 2024-10-04 00:22:59

您需要通过在 .* 之后添加 ? 来使用非贪婪选择:

$condition = "/<span class=\"hidden_text\">(.*?)<\/span>/";

注意: 如果您需要匹配通用 HTML,您应该使用XML 解析器,例如 DOM

You need to use a non-greedy selection by adding ? after .* :

$condition = "/<span class=\"hidden_text\">(.*?)<\/span>/";

Note : If you need to match generic HTML, you should use a XML parser like DOM.

寂寞陪衬 2024-10-04 00:22:59

您不应该尝试在非正则语言(例如 HTML)上使用正则表达式。最好使用合适的 HTML 解析器来解析文档。

有关如何使用 PHP 执行此操作的更多信息,请参阅以下问题:

You shouldn’t try to use regular expressions on a non-regular language like HTML. Better use a proper HTML parser to parse the document.

See the following questions for further information on how to do that with PHP:

初见你 2024-10-04 00:22:59
$condition = "/<span class=\"hidden_text\">(?<=^|>)[^><]+?(?=<|$)<\/span>/";

我得到了它。 ;)

$condition = "/<span class=\"hidden_text\">(?<=^|>)[^><]+?(?=<|$)<\/span>/";

I got it. ;)

も让我眼熟你 2024-10-04 00:22:59

很可能您有多个跨度,并且您使用的正则表达式将默认为贪婪模式

使用 PHP 的 DOM 解析器从 HTML 中提取内容要容易得多

Chances are that you have multiple spans, and the regexp you're using will default to greedy mode

It's a lot easier using PHP's DOM Parser to extract content from HTML

输什么也不输骨气 2024-10-04 00:22:59

我认为这就是他们所谓的“可教时刻”。 :P 现在让我们比较和对比您自我回答中的正则表达式:

"/<span class=\"hidden_text\">(?<=^|>)[^><]+?(?=<|$)<\/span>/"

...还有这个:

'~<span class="hidden_text">[^><]++</span>~'
  • PHP 的双引号字符串受到嵌入变量 ($my_var) 插值的影响,并且对括在大括号中的源代码进行评估 ({return "foo"})。如果您不使用这些功能,最好使用单引号字符串以避免意外。作为奖励,您不必再转义这些双引号。

  • PHP 允许您使用几乎任何 ASCII 标点符号作为正则表达式分隔符。通过用 ~ 替换斜杠,我不再需要转义结束标记中的斜杠。

  • 后视 - (?<=^|>) - 没有做任何有用的事情。它只会在开始标记匹配后立即进行评估,因此前一个字符总是 >

  • [^><]+? 很好(假设您不想在内容中允许其他标签),但量词不需要勉强。 [^><]+ 不可能超出结束 标记,因此有必要偷偷摸摸地处理它。事实上,直接用 所有格量词 踢门:[ ^><]++.

  • 与之前的lookbehind一样,(?=<|$)只占用空间。如果 [^><]+ 消耗了它能消耗的所有内容,并且下一个字符不是 <,那么您不需要先行来告诉您匹配将发生失败。

请注意,我只是批评你的正则表达式,而不是修复它;你的正则表达式和我的可能每次都会产生相同的结果。即使您正在使用的 HTML 完全有效,它们也可能在很多情况下出错。将 HTML 与正则表达式匹配就像试图抓住一头肥猪。

I think this is what they call a teachable moment. :P Let us now compare and contrast the regex in your self-answer:

"/<span class=\"hidden_text\">(?<=^|>)[^><]+?(?=<|$)<\/span>/"

...and this one:

'~<span class="hidden_text">[^><]++</span>~'
  • PHP's double-quoted strings are subject to interpolation of embedded variables ($my_var) and evaluation of source code wrapped in braces ({return "foo"}). If you aren't using those features, it's best to use single-quoted strings to avoid surprises. As a bonus, you don't have to escape those double-quotes any more.

  • PHP allows you to use almost any ASCII punctuation character for the regex delimiters. By replacing your slashes with ~ I eliminated the need to escape the slash in the closing tag.

  • The lookbehind - (?<=^|>) - was not doing anything useful. It would only ever be evaluated immediately after the opening tag had been matched, so the previous character was always >.

  • [^><]+? is good (assuming you don't want to allow other tags in the content), but the quantifier doesn't need to be reluctant. [^><]+ can't possibly overrun the closing </span> tag, so there's point sneaking up on it. In fact, go ahead and kick the door in with a possessive quantifier: [^><]++.

  • Like the lookbehind before it, (?=<|$) was only taking up space. If [^><]+ consumes everything it can and the next character not <, you don't need a lookahead to tell you the match is going to fail.

Note that I'm just critiquing your regex, not fixing it; your regex and mine would probably yield the same results every time. There are many ways both of them can go wrong, even if the HTML you're working with is perfectly valid. Matching HTML with regexes is like trying to catch a greased pig.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文