PCRE (php) 正则表达式,带有 + 的子模式匹配而 * 则不匹配?

发布于 2024-10-14 20:35:21 字数 652 浏览 1 评论 0 原文

我想匹配并捕获所有现有的(如果有) 块和其中一个的内部 html 文档内的 块。 我以为这很简单,但我遇到了一些奇怪的事情。 这是我对整个正则表达式的猜测:

/(<style[\s\S]+<\/style>)*[\s\S]*<body.*>([\s\S]+)<\/body>/i

它什么结果也没有。所以我把它拆开了,这些部分工作了:

/(<body.*>([\s\S]+)<\/body>)/i
/(<style[\s\S]+<\/style>)/i

最奇怪的是第一行也工作了,而第二个结果是空的!

/(<style[\s\S]+<\/style>)+/i
/(<style[\s\S]+<\/style>)*/i

所以,我猜错误是子模式之后 * 和 + 之间的差异。为什么?我该如何解决这个问题?

谢谢!!

I want to match and catch all existing (if any) <style...</style> blocks and the inside of one <body..</body> block inside a html document.
I thought this was simple, but I bump against something weird.
This was my guess for the whole regex:

/(<style[\s\S]+<\/style>)*[\s\S]*<body.*>([\s\S]+)<\/body>/i

It results into nothing. So i broke it apart and these parts work:

/(<body.*>([\s\S]+)<\/body>)/i
/(<style[\s\S]+<\/style>)/i

And weirdest of all this first line works too, while the second results empty!

/(<style[\s\S]+<\/style>)+/i
/(<style[\s\S]+<\/style>)*/i

So, I guess the error is the difference between * and + after the subpattern. Why? And how do I solve this?

Thanks!!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

月竹挽风 2024-10-21 20:35:21

您遇到了四个问题:

第一和第二,您使用正则表达式来解析 HTML。

第三,你匹配太多:你至少需要让一些量词变得懒惰,即使用 .*?, [\s\S]*? 等.,或者您的正则表达式将匹配所有内容,直到行或文件的末尾,然后仅回溯必要的数量以找到最后一个可能的匹配标记。

第四,通过在重复组中设置重复组,您已经为灾难性的回溯做好了准备,这两个组都有无数种方法来匹配相同的文本。

据我了解您的问题,您希望匹配从第一个

/(<style[\s\S]+<\/style>)[\s\S]*?<body.*?>([\s\S]+)<\/body>/i

单独捕获每个

/(<style[\s\S]+?<\/style>)?\s*(<style[\s\S]+?<\/style>)?\s*(<style[\s\S]+?<\/style>)?\s*(<style[\s\S]+?<\/style>)?\s*<body.*?>([\s\S]+)<\/body>/i

如果

You've got four problems:

First and second, you're using regular expressions to parse HTML.

Third, you're matching too much: You need at least to make some of the quantifiers lazy, i. e. use .*?, [\s\S]*? etc., or your regex will match everything up to the end of the line or the file, and then only backtrack as much as necessary to find the last possible matching tag.

Fourth, you've set yourself up for catastrophic backtracking by having repeated groups inside repeated groups, both of which have myriads of ways to match the same text.

As I understand your question, you want to match everything from the first <style> tag to the final </body> and capture all the <style> tags' contents and the <body> tag's contents. Right? Then try

/(<style[\s\S]+<\/style>)[\s\S]*?<body.*?>([\s\S]+)<\/body>/i

To capture each <style> block separately, you could try, for a maximum of four possible <style> blocks:

/(<style[\s\S]+?<\/style>)?\s*(<style[\s\S]+?<\/style>)?\s*(<style[\s\S]+?<\/style>)?\s*(<style[\s\S]+?<\/style>)?\s*<body.*?>([\s\S]+)<\/body>/i

if the <style> blocks are all adjacent and only separated by whitespace. Can you see why it's not a good idea to use regex for this?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文