正则表达式匹配某个 HTML 元素
我正在尝试编写一个正则表达式来匹配以下 HTML。
<span class="hidden_text">Some text here.</span>
我正在努力写出与其匹配的条件,并尝试了以下操作,但在某些情况下,它也会选择跨度之后的所有内容。
$condition = "/<span class=\"hidden_text\">(.*)<\/span>/";
如果有人能强调我做错了什么,那就太好了。
I'm trying to write a regular expression for matching the following HTML.
<span class="hidden_text">Some text here.</span>
I'm struggling to write out the condition to match it and have tried the following, but in some cases it selects everything after the span as well.
$condition = "/<span class=\"hidden_text\">(.*)<\/span>/";
If anyone could highlight what I'm doing wrong that would be great.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
您需要通过在
.*
之后添加?
来使用非贪婪选择:注意: 如果您需要匹配通用 HTML,您应该使用XML 解析器,例如 DOM。
You need to use a non-greedy selection by adding
?
after.*
:Note : If you need to match generic HTML, you should use a XML parser like DOM.
您不应该尝试在非正则语言(例如 HTML)上使用正则表达式。最好使用合适的 HTML 解析器来解析文档。
有关如何使用 PHP 执行此操作的更多信息,请参阅以下问题:
You shouldn’t try to use regular expressions on a non-regular language like HTML. Better use a proper HTML parser to parse the document.
See the following questions for further information on how to do that with PHP:
我得到了它。 ;)
I got it. ;)
很可能您有多个跨度,并且您使用的正则表达式将默认为贪婪模式
使用 PHP 的 DOM 解析器从 HTML 中提取内容要容易得多
Chances are that you have multiple spans, and the regexp you're using will default to greedy mode
It's a lot easier using PHP's DOM Parser to extract content from HTML
我认为这就是他们所谓的“可教时刻”。 :P 现在让我们比较和对比您自我回答中的正则表达式:
...还有这个:
PHP 的双引号字符串受到嵌入变量 (
$my_var
) 插值的影响,并且对括在大括号中的源代码进行评估 ({return "foo"}
)。如果您不使用这些功能,最好使用单引号字符串以避免意外。作为奖励,您不必再转义这些双引号。PHP 允许您使用几乎任何 ASCII 标点符号作为正则表达式分隔符。通过用
~
替换斜杠,我不再需要转义结束标记中的斜杠。后视 -
(?<=^|>)
- 没有做任何有用的事情。它只会在开始标记匹配后立即进行评估,因此前一个字符总是>
。[^><]+?
很好(假设您不想在内容中允许其他标签),但量词不需要勉强。[^><]+
不可能超出结束标记,因此有必要偷偷摸摸地处理它。事实上,直接用 所有格量词 踢门:
[ ^><]++
.与之前的lookbehind一样,
(?=<|$)
只占用空间。如果[^><]+
消耗了它能消耗的所有内容,并且下一个字符不是<
,那么您不需要先行来告诉您匹配将发生失败。请注意,我只是批评你的正则表达式,而不是修复它;你的正则表达式和我的可能每次都会产生相同的结果。即使您正在使用的 HTML 完全有效,它们也可能在很多情况下出错。将 HTML 与正则表达式匹配就像试图抓住一头肥猪。
I think this is what they call a teachable moment. :P Let us now compare and contrast the regex in your self-answer:
...and this one:
PHP's double-quoted strings are subject to interpolation of embedded variables (
$my_var
) and evaluation of source code wrapped in braces ({return "foo"}
). If you aren't using those features, it's best to use single-quoted strings to avoid surprises. As a bonus, you don't have to escape those double-quotes any more.PHP allows you to use almost any ASCII punctuation character for the regex delimiters. By replacing your slashes with
~
I eliminated the need to escape the slash in the closing tag.The lookbehind -
(?<=^|>)
- was not doing anything useful. It would only ever be evaluated immediately after the opening tag had been matched, so the previous character was always>
.[^><]+?
is good (assuming you don't want to allow other tags in the content), but the quantifier doesn't need to be reluctant.[^><]+
can't possibly overrun the closing</span>
tag, so there's point sneaking up on it. In fact, go ahead and kick the door in with a possessive quantifier:[^><]++
.Like the lookbehind before it,
(?=<|$)
was only taking up space. If[^><]+
consumes everything it can and the next character not<
, you don't need a lookahead to tell you the match is going to fail.Note that I'm just critiquing your regex, not fixing it; your regex and mine would probably yield the same results every time. There are many ways both of them can go wrong, even if the HTML you're working with is perfectly valid. Matching HTML with regexes is like trying to catch a greased pig.