如何匹配不包含单词的字符串?

发布于 2024-11-04 09:18:21 字数 582 浏览 0 评论 0 原文

为了匹配包含某个单词的字符串,我可以使用模式“/.*word.*/”。但是如何匹配不包含该单词的字符串呢?

示例:

我需要在一个大文本中找到一个子字符串,该子字符串由两个标签 和 括起来,并且里面有一些像“Hello”这样的字符串。我想出的最好的:

"@<div>(.*?Hello.?*)</div>@i"

但它也会匹配序列:

<div>Bye.</div><div>Hello!</div>

而且我不想匹配第一对 div 标签 - 因此我想替换“.*?”类似于“匹配任何字符串,但不包含”的字符串除外。

测试用例

对于输入字符串:

<div>Bye.</div><div>Hello!</div>

我需要捕获

<div>Hello!</div>

To match string, which contains some word, I can use pattern "/.*word.*/". But how do I match a string, which does not contain this word?

Example:

I need to find a substring in a big text, which is enclosed by two tags, and , and has some string like "Hello" inside. The best I came up with:

"@<div>(.*?Hello.?*)</div>@i"

But it will also match the sequence:

<div>Bye.</div><div>Hello!</div>

And I do not want to match the first pair of div tags - thus I want to replace ".*?" with something like "match any string, except which does not contain ".

Test case:

For input string:

<div>Bye.</div><div>Hello!</div>

I need to catch

<div>Hello!</div>

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

夜无邪 2024-11-11 09:18:21

该问题的更好标题可能是:“匹配包含特定子字符串的 DIV 元素。”首先必须说,正则表达式不是解决此问题的最佳工具工作。最好使用 HTML 解析器来解析标记,然后在每个 DIV 元素的内容中搜索所需的子字符串。也就是说,由于您不想更多地了解如何使用正则表达式来匹配非其他内容,因此以下内容描述了使用正则表达式执行此操作的有限方法。

正如 Dogbert 正确指出的那样,这个问题确实是 Regular 的重复表达式来匹配不包含单词的字符串?。但是,我发现您已经看过这个问题,但需要知道如何将此技术应用于子模式。

要匹配不包含特定单词(或多个单词)的字符串部分(子模式),您需要在每个字符之前应用否定先行断言检查。以下是对开始和结束 DIV 标记之间的文本执行此操作的方法。请注意,仅使用单个正则表达式时,由于 DIV 元素可能嵌套,因此只有在嵌套 DIV 的“最里面”查找 "HELLO" 才是合理的 元素。

伪代码:

  • 匹配开始的DIV标签。
  • 延迟匹配零个或多个字符,每个字符都不是 的开头。
  • 找到所需的字符串:"HELLO" 后,继续进行匹配。
  • 继续(贪婪地)匹配零个或多个字符,每个字符都不是 的开头。
  • 匹配结束 标记。

请注意,要仅匹配“最里面”的 DIV 内容,需要在扫描时排除 元素的内容一次一个字符。以下是经过测试的 PHP 函数形式的相应正则表达式:

// Find an innermost DIV element containing the string "HELLO".
function p1($text) {
    $re = '% # Match innermost DIV element containing "HELLO"
        <div[^>]*>        # DIV element start tag.
        (?:               # Group to match contents up to "HELLO".
          (?!</?div\b)    # Assert this char is not start of DIV tag.
          .               # Safe to match this non-DIV-tag char.
        )*?               # Lazily match contents one chara at a time.
        \bhello\b         # Match target "HELLO" word inside DIV.
        (?:               # Group to match content following "HELLO".
          (?!</?div\b)    # Assert this char is not start of DIV tag.
          .               # Safe to match this non-DIV-tag char.
        )*                # Greedily match contents one chara at a time.
        </div>            # DIV element end tag.
        %six';
    if (preg_match($re, $text, $matches)) {
        // Match found.
        return $matches[0];
    } else {
        // No match found
        return 'no-match';
    }
}

该函数将正确匹配以下测试数据所需的 DIV 元素:

<div>Bye.</div><div>Hello!</div>

它还会在嵌套 DIV 元素的最里面正确找到“HELLO”:

<div>
    <div>
        Hello world!
    </div>
</div>

但是,如前所述,它不会找到位于非最内层嵌套 DIV 元素内的“HELLO”字符串,如下所示:

<div>
    Hello,
    <div>
        world!
    </div>
</div>

要做到这一点是一个更加复杂的解决方案。

在很多情况下,该解决方案可能会失败。再次。我建议使用 HTML 解析器。

A better title for the question might be: "Match a DIV element containing a specific sub-string." First it must be said that regex is not the best tool for this job. It would be much better to use an HTML parser to parse the markup, then search the contents of each DIV element for the desired sub-string. That said, since you wan't to know more about how to use regex to match stuff that is not something else, the following describes a limited way of doing this with a regex.

As Dogbert correctly points out, this question really is a duplicate of Regular expression to match string not containing a word?. However, I see that you have looked at that question but need to know how to apply this technique to a subpattern.

To match a part of a string (sub-pattern) which does not include a specific word (or words), you need to apply a negative lookahead assertion check before each and every character. Here is how you would do it for the text between opening and closing DIV tags. Note that when using only a single regex, because DIV elements may be nested, it is only reasonable to find "HELLO" within the "innermost" of nested DIV elements.

Pseudo code:

  • Match the opening DIV tag.
  • Lazily match zero or more characters, each of which is not the beginning of <div or </div.
  • Once the desired string: "HELLO" is found, go ahead and match it.
  • Continue (greedily) matching zero or more characters, each of which is not the beginning of <div or </div.
  • Match the closing </div> tag.

Note that to match only the "innermost" DIV contents, it is necessary to exclude both <DIV and </DIV while scanning the element's contents one char at a time. Here is the corresponding regex in the form of a tested PHP function:

// Find an innermost DIV element containing the string "HELLO".
function p1($text) {
    $re = '% # Match innermost DIV element containing "HELLO"
        <div[^>]*>        # DIV element start tag.
        (?:               # Group to match contents up to "HELLO".
          (?!</?div\b)    # Assert this char is not start of DIV tag.
          .               # Safe to match this non-DIV-tag char.
        )*?               # Lazily match contents one chara at a time.
        \bhello\b         # Match target "HELLO" word inside DIV.
        (?:               # Group to match content following "HELLO".
          (?!</?div\b)    # Assert this char is not start of DIV tag.
          .               # Safe to match this non-DIV-tag char.
        )*                # Greedily match contents one chara at a time.
        </div>            # DIV element end tag.
        %six';
    if (preg_match($re, $text, $matches)) {
        // Match found.
        return $matches[0];
    } else {
        // No match found
        return 'no-match';
    }
}

This function will correctly match the desired DIV element of your following test data:

<div>Bye.</div><div>Hello!</div>

It will also correctly find "HELLO" within the innermost of nested DIV elements:

<div>
    <div>
        Hello world!
    </div>
</div>

But, as stated earlier, it will NOT find the "HELLO" string located within non-innermost nested DIV elements like so:

<div>
    Hello,
    <div>
        world!
    </div>
</div>

To do this is a much more complicated solution.

There are lots of cases where this solution can fail. Once again. I recommend using an HTML parser.

吾性傲以野 2024-11-11 09:18:21
'~<div>(?!.*?Bye\..*?</div>).+?</div>~'
'~<div>(?!.*?Bye\..*?</div>).+?</div>~'
月光色 2024-11-11 09:18:21

你就不能检查一下是否没有匹配到吗?

如果您正在寻找除单词“word”之外的任何内容:

if(!preg_match("/word/i", $myString))

仅当找到“word”时,才会在 if 下面运行代码。

Can't you just check for if you didn't get a match?

If you're looking for anything but the word "word":

if(!preg_match("/word/i", $myString))

This will run code underneath the if only if "word" was not found.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文