正则表达式匹配不包含单词的行
我知道可以匹配一个单词,然后使用其他工具(例如 grep -v
)反转匹配。 但是,是否可以使用正则表达式来匹配不包含特定单词(例如 hede
)的行?
输入:
hoho
hihi
haha
hede
代码:
grep "<Regex for 'doesn't contain hede'>" input
所需输出:
hoho
hihi
haha
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(30)
正则表达式不支持反向匹配的想法并不完全正确。 您可以通过使用负向查找来模仿此行为:
上面的正则表达式将匹配任何字符串或没有换行符的行,不包含(子)字符串“hede”。 如前所述,这不是正则表达式“擅长”(或应该做)的事情,但它仍然是可能的。
如果您还需要匹配换行符,请使用 DOT-ALL 修饰符 (以下模式中的尾随
s
):或内联使用它:
(其中
/.../
是正则表达式分隔符,即不是模式)如果 DOT-ALL 修饰符不可用,您可以使用字符类
[\s\S]
模仿相同的行为:说明
字符串只是
n
个字符。 每个字符之前和之后都有一个空字符串。 因此,n
个字符的列表将包含n+1
个空字符串。 考虑字符串"ABhedeCD"
:其中
e
是空字符串。 正则表达式(?!hede).
向前查看是否没有子字符串"hede"
可见,如果是这样的话(所以会看到其他东西) ,那么.
(点)将匹配除换行符之外的任何字符。 环视也称为零宽度断言,因为它们不消耗任何字符。 他们只是断言/验证某些东西。因此,在我的示例中,在
.
(点)消耗字符之前,首先验证每个空字符串,看看前面是否没有"hede"
。 正则表达式(?!hede).
只会执行一次,因此它被包装在一个组中,并重复零次或多次:((?!hede).)*. 最后,锚定输入的开始和结束,以确保消耗整个输入:
^((?!hede).)*$
如您所见,输入
"ABhedeCD"
将失败,因为在e3
上,正则表达式(?!hede)
失败(有 is"hede “
就在前面!)。The notion that regex doesn't support inverse matching is not entirely true. You can mimic this behavior by using negative look-arounds:
The regex above will match any string, or line without a line break, not containing the (sub)string 'hede'. As mentioned, this is not something regex is "good" at (or should do), but still, it is possible.
And if you need to match line break chars as well, use the DOT-ALL modifier (the trailing
s
in the following pattern):or use it inline:
(where the
/.../
are the regex delimiters, i.e., not part of the pattern)If the DOT-ALL modifier is not available, you can mimic the same behavior with the character class
[\s\S]
:Explanation
A string is just a list of
n
characters. Before, and after each character, there's an empty string. So a list ofn
characters will haven+1
empty strings. Consider the string"ABhedeCD"
:where the
e
's are the empty strings. The regex(?!hede).
looks ahead to see if there's no substring"hede"
to be seen, and if that is the case (so something else is seen), then the.
(dot) will match any character except a line break. Look-arounds are also called zero-width-assertions because they don't consume any characters. They only assert/validate something.So, in my example, every empty string is first validated to see if there's no
"hede"
up ahead, before a character is consumed by the.
(dot). The regex(?!hede).
will do that only once, so it is wrapped in a group, and repeated zero or more times:((?!hede).)*
. Finally, the start- and end-of-input are anchored to make sure the entire input is consumed:^((?!hede).)*$
As you can see, the input
"ABhedeCD"
will fail because one3
, the regex(?!hede)
fails (there is"hede"
up ahead!).请注意,不以“hede”开头的解决方案:
通常比不包含“的解决方案更有效” hede”:
前者仅在输入字符串的第一个位置检查“hede”,而不是在每个位置检查“hede”。
Note that the solution to does not start with “hede”:
is generally much more efficient than the solution to does not contain “hede”:
The former checks for “hede” only at the input string’s first position, rather than at every position.
如果您只是将其用于 grep,则可以使用
grep -v hede
来获取所有不包含 hede 的行。ETA 哦,重读一下问题,
grep -v
可能就是您所说的“工具选项”。If you're just using it for grep, you can use
grep -v hede
to get all lines which do not contain hede.ETA Oh, rereading the question,
grep -v
is probably what you meant by "tools options".答案:
解释:
^
字符串的开头,(
分组并捕获到 \1(0 次或多次(匹配尽可能多的数量)),(?!
向前看是否有,hede
你的字符串,)
向前看结束,.
除 \n 之外的任何字符,)*
\1 结尾(注意:由于您在此捕获上使用量词,因此仅捕获模式的最后一个重复项将存储在 \1 中)可选 \n 之前的
$
以及字符串末尾Answer:
Explanation:
^
the beginning of the string,(
group and capture to \1 (0 or more times (matching the most amount possible)),(?!
look ahead to see if there is not,hede
your string,)
end of look-ahead,.
any character except \n,)*
end of \1 (Note: because you are using a quantifier on this capture, only the LAST repetition of the captured pattern will be stored in \1)$
before an optional \n, and the end of the string给出的答案非常好,只是一个学术观点:
理论计算机科学意义上的正则表达式不能这样做。 对于他们来说,它必须看起来像这样:
这只进行完整匹配。 如果是在分赛场上这样做,那就更尴尬了。
The given answers are perfectly fine, just an academic point:
Regular Expressions in the meaning of theoretical computer sciences ARE NOT ABLE do it like this. For them it had to look something like this:
This only does a FULL match. Doing it for sub-matches would even be more awkward.
如果您希望正则表达式测试仅在整个字符串匹配时失败,则以下内容将起作用:
例如 -- 如果您想允许除“foo”之外的所有值(即“foofoo”、“barfoo”和“foobar”将通过,但“foo”将失败),使用:
^(?!foo$).*
当然,如果您要检查精确相等,在这种情况下更好的通用解决方案是检查字符串相等性,即
如果您需要任何正则表达式功能,您甚至可以将否定外部放在测试中(这里,不区分大小写和范围匹配):
但是,在需要正则表达式测试(可能通过 API)的情况下,此答案顶部的正则表达式解决方案可能会有所帮助。
If you want the regex test to only fail if the entire string matches, the following will work:
e.g. -- If you want to allow all values except "foo" (i.e. "foofoo", "barfoo", and "foobar" will pass, but "foo" will fail), use:
^(?!foo$).*
Of course, if you're checking for exact equality, a better general solution in this case is to check for string equality, i.e.
You could even put the negation outside the test if you need any regex features (here, case insensitivity and range matching):
The regex solution at the top of this answer may be helpful, however, in situations where a positive regex test is required (perhaps by an API).
通过负向前瞻,正则表达式可以匹配不包含特定模式的内容。 Bart Kiers 对此进行了回答和解释。 很好的解释!
然而,根据 Bart Kiers 的回答,前瞻部分将在匹配任何单个字符时提前测试 1 到 4 个字符。 我们可以避免这种情况,让前瞻部分检查整个文本,确保没有“hede”,然后正常部分(.*)可以一次吃掉整个文本。
这是改进的正则表达式:
请注意,负前瞻部分中的 (*?) 惰性量词是可选的,您可以使用 (*) 贪婪量词代替,具体取决于您的数据:如果 'hede' 确实存在并且在前半部分中text,惰性量词可以更快; 否则,贪婪量词会更快。 然而,如果“hede”不存在,则两者都会同样慢。
这是演示代码。
有关 Lookahead 的更多信息,请查看这篇精彩文章:掌握 Lookahead 和 Lookbehind。
另外,请查看 RegexGen.js,这是一个 JavaScript 正则表达式生成器,可帮助构建复杂的正则表达式。 使用 RegexGen.js,您可以以更易读的方式构建正则表达式:
With negative lookahead, regular expression can match something not contains specific pattern. This is answered and explained by Bart Kiers. Great explanation!
However, with Bart Kiers' answer, the lookahead part will test 1 to 4 characters ahead while matching any single character. We can avoid this and let the lookahead part check out the whole text, ensure there is no 'hede', and then the normal part (.*) can eat the whole text all at one time.
Here is the improved regex:
Note the (*?) lazy quantifier in the negative lookahead part is optional, you can use (*) greedy quantifier instead, depending on your data: if 'hede' does present and in the beginning half of the text, the lazy quantifier can be faster; otherwise, the greedy quantifier be faster. However if 'hede' does not present, both would be equal slow.
Here is the demo code.
For more information about lookahead, please check out the great article: Mastering Lookahead and Lookbehind.
Also, please check out RegexGen.js, a JavaScript Regular Expression Generator that helps to construct complex regular expressions. With RegexGen.js, you can construct the regex in a more readable way:
FWIW,由于正则语言(又名有理语言)在互补下是封闭的,因此总是可以找到否定另一个表达式的正则表达式(又名有理表达式)。 但实现这一点的工具并不多。
Vcsn 支持此运算符(表示
{c}< /代码>,后缀)。
首先定义表达式的类型:标签是字母 (
lal_char
),例如从a
到z
进行选择(在工作时定义字母表)当然,补码非常重要),为每个单词计算的“值”只是一个布尔值:true
该单词被接受,false
被拒绝。在 Python 中:
FWIW, since regular languages (aka rational languages) are closed under complementation, it's always possible to find a regular expression (aka rational expression) that negates another expression. But not many tools implement this.
Vcsn supports this operator (which it denotes
{c}
, postfix).You first define the type of your expressions: labels are letter (
lal_char
) to pick froma
toz
for instance (defining the alphabet when working with complementation is, of course, very important), and the "value" computed for each word is just a Boolean:true
the word is accepted,false
, rejected.In Python:
then you enter your expression:
convert this expression to an automaton:
finally, convert this automaton back to a simple expression.
where
+
is usually denoted|
,\e
denotes the empty word, and[^]
is usually written.
(any character). So, with a bit of rewriting()|h(ed?)?|([^h]|h([^e]|e([^d]|d([^e]|e.)))).*
.You can see this example here, and try Vcsn online there.
基准
我决定评估一些提供的选项并比较它们的性能,以及使用一些新功能。
.NET 正则表达式引擎基准测试:http://regexhero.net/tester/
基准文本:
前 7 行不应该匹配,因为它们包含搜索的表达式,而下面的 7 行应该匹配!
结果:
结果是每秒迭代次数,作为 3 次运行的中位数 - 数字越大 = 越好
由于 .NET 不支持操作动词(*FAIL 等),我无法测试解决方案 P1 和P2。
摘要:
总体上最具可读性和性能方面最快的解决方案似乎是带有简单负向预测的 03。 这也是 JavaScript 最快的解决方案,因为 JS 不支持其他解决方案的更高级的正则表达式功能。
Benchmarks
I decided to evaluate some of the presented Options and compare their performance, as well as use some new Features.
Benchmarking on .NET Regex Engine: http://regexhero.net/tester/
Benchmark Text:
The first 7 lines should not match, since they contain the searched Expression, while the lower 7 lines should match!
Results:
Results are Iterations per second as the median of 3 runs - Bigger Number = Better
Since .NET doesn't support action Verbs (*FAIL, etc.) I couldn't test the solutions P1 and P2.
Summary:
The overall most readable and performance-wise fastest solution seems to be 03 with a simple negative lookahead. This is also the fastest solution for JavaScript, since JS does not support the more advanced Regex Features for the other solutions.
由于没有人直接回答所提出的问题,所以我会这样做。
答案是,使用 POSIX
grep
,不可能真正满足这个请求:原因是没有标志,POSIX
grep
只需要与 基本正则表达式(BRE),其功能根本不足以实现这一目标任务,因为子表达式缺乏交替。 它支持的唯一一种交替涉及提供由换行符分隔的多个正则表达式,并且这并不涵盖所有正则语言,例如,没有与 扩展正则表达式 (ERE)^(ab|cd)*$
。但是,GNU
grep
实现了允许它的扩展。 特别是,\|
是 GNU BRE 实现中的交替运算符。 如果您的正则表达式引擎支持交替、括号和 Kleene 星号,并且能够锚定到字符串的开头和结尾,那么这就是这种方法所需要的。 但请注意,除此之外,负集[^ ... ]
也非常方便,因为否则,您需要将它们替换为(a|b|c| 形式的表达式) ...)
列出了集合中没有的每个字符,这是极其乏味且过长的,如果整个字符集是 Unicode,则更是如此。感谢形式语言理论,我们可以看到这样的表达是什么样子的。 使用 GNU grep ,答案将类似于:(
通过 Grail 以及一些手工进一步优化)。
您还可以使用实现 ERE 的工具(例如
egrep
)来删除反斜杠,或者等效地将-E
标志传递给 POSIXgrep (尽管我的印象是这个问题需要避免任何
grep
标志):这是一个测试它的脚本(注意它生成一个文件
testinput.txt
在当前目录中)。 其他答案中提出的几个表达式未通过此测试。在我的系统中它打印:
如预期的那样。
对于那些对细节感兴趣的人,所采用的技术是将匹配单词的正则表达式转换为有限自动机,然后通过将每个接受状态更改为不接受状态来反转自动机,反之亦然,然后将结果 FA 转换回一个正则表达式。
正如大家所指出的,如果您的正则表达式引擎支持负向前瞻,则正则表达式会简单得多。 例如,使用 GNU grep:
但是,这种方法的缺点是需要回溯正则表达式引擎。 这使得它不适合使用安全正则表达式引擎的安装,例如 RE2,在某些情况下更喜欢生成方法的原因之一。
使用 Kendall Hopkins 优秀的 FormalTheory 库,用 PHP 编写,它提供了类似于 Grail 的功能,并且我自己编写的简化器,我已经能够编写一个给定输入短语的负正则表达式的在线生成器(当前仅支持字母数字和空格字符,并且长度有限): formauri.es/personal/pgimeno/misc/non-match-regex/" rel="nofollow noreferrer">http://www.formauri.es/personal/pgimeno/misc/non-match-regex/
对于
hede
,它输出:相当于上面的内容。
Since no one else has given a direct answer to the question that was asked, I'll do it.
The answer is that with POSIX
grep
, it's impossible to literally satisfy this request:The reason is that with no flags, POSIX
grep
is only required to work with Basic Regular Expressions (BREs), which are simply not powerful enough for accomplishing that task, because of lack of alternation in subexpressions. The only kind of alternation it supports involves providing multiple regular expressions separated by newlines, and that doesn't cover all regular languages, e.g. there's no finite collection of BREs that matches the same regular language as the extended regular expression (ERE)^(ab|cd)*$
.However, GNU
grep
implements extensions that allow it. In particular,\|
is the alternation operator in GNU's implementation of BREs. If your regular expression engine supports alternation, parentheses and the Kleene star, and is able to anchor to the beginning and end of the string, that's all you need for this approach. Note however that negative sets[^ ... ]
are very convenient in addition to those, because otherwise, you need to replace them with an expression of the form(a|b|c| ... )
that lists every character that is not in the set, which is extremely tedious and overly long, even more so if the whole character set is Unicode.Thanks to formal language theory, we get to see how such an expression looks like. With GNU
grep
, the answer would be something like:(found with Grail and some further optimizations made by hand).
You can also use a tool that implements EREs, like
egrep
, to get rid of the backslashes, or equivalently, pass the-E
flag to POSIXgrep
(although I was under the impression that the question required avoiding any flags togrep
whatsoever):Here's a script to test it (note it generates a file
testinput.txt
in the current directory). Several of the expressions presented in other answers fail this test.In my system it prints:
as expected.
For those interested in the details, the technique employed is to convert the regular expression that matches the word into a finite automaton, then invert the automaton by changing every acceptance state to non-acceptance and vice versa, and then converting the resulting FA back to a regular expression.
As everyone has noted, if your regular expression engine supports negative lookahead, the regular expression is much simpler. For example, with GNU grep:
However, this approach has the disadvantage that it requires a backtracking regular expression engine. This makes it unsuitable in installations that are using secure regular expression engines like RE2, which is one reason to prefer the generated approach in some circumstances.
Using Kendall Hopkins' excellent FormalTheory library, written in PHP, which provides a functionality similar to Grail, and a simplifier written by myself, I've been able to write an online generator of negative regular expressions given an input phrase (only alphanumeric and space characters currently supported, and the length is limited): http://www.formauri.es/personal/pgimeno/misc/non-match-regex/
For
hede
it outputs:which is equivalent to the above.
不是正则表达式,但我发现使用带有管道的串行 grep 来消除噪音是合乎逻辑且有用的。
例如。 搜索没有所有注释的 apache 配置文件
- 并且
串行 grep 的逻辑是(不是注释)和(匹配目录)
Not regex, but I've found it logical and useful to use serial greps with pipe to eliminate noise.
eg. search an apache config file without all the comments-
and
The logic of serial grep's is (not a comment) and (matches dir)
这样,您就可以避免在每个位置上测试前瞻:
相当于(对于.net):
旧答案:
with this, you avoid to test a lookahead on each positions:
equivalent to (for .net):
Old answer:
前面提到的
(?:(?!hede).)*
很棒,因为它可以锚定。但在这种情况下,以下内容就足够了:
这种简化可以添加“AND”子句:
Aforementioned
(?:(?!hede).)*
is great because it can be anchored.But the following would suffice in this case:
This simplification is ready to have "AND" clauses added:
在我看来,最佳答案的更易读的变体:
基本上,“当且仅当其中没有“hede”时才在行的开头匹配” - 因此要求几乎直接翻译成正则表达式。
当然,可能有多个失败要求:
详细信息: ^ 锚点确保正则表达式引擎不会在字符串中的每个位置重试匹配,这将匹配每个字符串。
开头的 ^ 锚点代表行的开头。 grep 工具一次匹配每一行,在使用多行字符串的上下文中,您可以使用“m”标志:
或
An, in my opinon, more readable variant of the top answer:
Basically, "match at the beginning of the line if and only if it does not have 'hede' in it" - so the requirement translated almost directly into regex.
Of course, it's possible to have multiple failure requirements:
Details: The ^ anchor ensures the regex engine doesn't retry the match at every location in the string, which would match every string.
The ^ anchor in the beginning is meant to represent the beginning of the line. The grep tool matches each line one at a time, in contexts where you're working with a multiline string, you can use the "m" flag:
or
我会这样做:
比其他答案更准确、更有效。 它实现了 Friedl 的“展开循环”效率技术,并且需要更少的回溯。
Here's how I'd do it:
Accurate and more efficient than the other answers. It implements Friedl's "unrolling-the-loop" efficiency technique and requires much less backtracking.
另一种选择是添加正向前视并检查 hede 是否位于输入行中的任何位置,然后我们将使用类似于以下内容的表达式来否定它:
with word borders。
如果您想探索/,请在 regex101.com 的右上角面板上解释该表达式/简化/修改它,并在此链接中,您可以观看它如何与某些如果您愿意,可以使用示例输入。
正则表达式电路
jex.im 可视化正则表达式:
Another option is that to add a positive look-ahead and check if
hede
is anywhere in the input line, then we would negate that, with an expression similar to:with word boundaries.
The expression is explained on the top right panel of regex101.com, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs, if you like.
RegEx Circuit
jex.im visualizes regular expressions:
如果要匹配一个字符来否定一个单词,类似于否定字符类:
例如字符串:
不要使用:
使用:
注意
"(?!bbb)."
既不是lookbehind,也不是lookahead ,它是当前的,例如:If you want to match a character to negate a word similar to negate character class:
For example, a string:
Do not use:
Use:
Notice
"(?!bbb)."
is neither lookbehind nor lookahead, it's lookcurrent, for example:OP 没有指定或标记帖子来指示正则表达式将在其中使用的上下文(编程语言、编辑器、工具)。
对我来说,有时需要在使用
Textpad
编辑文件时执行此操作。Textpad
支持一些正则表达式,但不支持lookahead或lookbehind,所以需要几个步骤。如果我希望保留 Do NOT 包含字符串
hede
的所有行,我会这样做:现在,原始文本已删除包含字符串
hede
的所有行。如果我希望Do Something Else仅查找Do NOT包含字符串
hede
,我会这样做:The OP did not specify or Tag the post to indicate the context (programming language, editor, tool) the Regex will be used within.
For me, I sometimes need to do this while editing a file using
Textpad
.Textpad
supports some Regex, but does not support lookahead or lookbehind, so it takes a few steps.If I am looking to retain all lines that Do NOT contain the string
hede
, I would do it like this:Now you have the original text with all lines containing the string
hede
removed.If I am looking to Do Something Else to only lines that Do NOT contain the string
hede
, I would do it like this:自 ruby-2.4.1 推出以来,我们可以使用新的 Absent Operator 的 Ruby 正则表达式中
在官方 doc
,因此,在您的情况下
^ (?~hede)$
为您完成这项工作Since the introduction of ruby-2.4.1, we can use the new Absent Operator in Ruby’s Regular Expressions
from the official doc
Thus, in your case
^(?~hede)$
does the job for you通过 PCRE 动词
(*SKIP)(*F)
这将完全跳过包含确切字符串
hede
的行并匹配所有剩余的行。DEMO
各部分的执行:
让我们通过拆分来考虑上述正则表达式它分为两部分。
|
符号之前的部分。 部分不应匹配。<前><代码>^hede$(*SKIP)(*F)
|
符号后面的部分。 部分应该匹配。<前><代码>^.*$
第 1 部分
正则表达式引擎将从第一部分开始执行。
解释:
^
断言我们正处于开始阶段。hede
匹配字符串hede
$
断言我们位于行尾。因此包含字符串
hede
的行将被匹配。 一旦正则表达式引擎看到以下(*SKIP)(*F)
(注意:您可以将(*F)
写为(*FAIL)
) 动词,它会跳过并使匹配失败。|
称为更改或逻辑 OR 运算符,添加在 PCRE 动词旁边,该运算符又匹配所有行上每个字符之间存在的所有边界,但包含确切字符串hede
。 请参阅此处的演示。 也就是说,它尝试匹配剩余字符串中的字符。 现在第二部分中的正则表达式将被执行。第 2 部分
说明:
^
断言我们正处于开始阶段。 即,它匹配除hede
行中的行之外的所有行开头。 请参阅此处的演示。.*
在多行模式下,.
将匹配除换行符或回车符之外的任何字符。*
会重复前一个字符零次或多次。 因此.*
将匹配整行。 请参阅此处的演示。嘿,为什么你添加了 .* 而不是 .+ ?
因为
.*
会匹配空白行,但.+
不会匹配空白。 我们想要匹配除hede
之外的所有行,输入中也可能存在空行。 所以你必须使用.*
而不是.+
。.+
将重复前一个字符一次或多次。 请参阅.*
与空行匹配此处。$
此处不需要行结束锚点。Through PCRE verb
(*SKIP)(*F)
This would completely skips the line which contains the exact string
hede
and matches all the remaining lines.DEMO
Execution of the parts:
Let us consider the above regex by splitting it into two parts.
Part before the
|
symbol. Part shouldn't be matched.Part after the
|
symbol. Part should be matched.PART 1
Regex engine will start its execution from the first part.
Explanation:
^
Asserts that we are at the start.hede
Matches the stringhede
$
Asserts that we are at the line end.So the line which contains the string
hede
would be matched. Once the regex engine sees the following(*SKIP)(*F)
(Note: You could write(*F)
as(*FAIL)
) verb, it skips and make the match to fail.|
called alteration or logical OR operator added next to the PCRE verb which inturn matches all the boundaries exists between each and every character on all the lines except the line contains the exact stringhede
. See the demo here. That is, it tries to match the characters from the remaining string. Now the regex in the second part would be executed.PART 2
Explanation:
^
Asserts that we are at the start. ie, it matches all the line starts except the one in thehede
line. See the demo here..*
In the Multiline mode,.
would match any character except newline or carriage return characters. And*
would repeat the previous character zero or more times. So.*
would match the whole line. See the demo here.Hey why you added .* instead of .+ ?
Because
.*
would match a blank line but.+
won't match a blank. We want to match all the lines excepthede
, there may be a possibility of blank lines also in the input . so you must use.*
instead of.+
..+
would repeat the previous character one or more times. See.*
matches a blank line here.$
End of the line anchor is not necessary here.代码中的两个正则表达式可能更易于维护,一个用于执行第一个匹配,然后如果匹配则运行第二个正则表达式来检查您希望阻止的异常情况,例如
^.*(hede).*
然后在你的代码中有适当的逻辑。好的,我承认这并不是对已发布问题的真正答案,它也可能比单个正则表达式使用稍微多的处理。 但对于来这里寻求快速紧急修复异常情况的开发人员来说,这个解决方案不应被忽视。
It may be more maintainable to two regexes in your code, one to do the first match, and then if it matches run the second regex to check for outlier cases you wish to block for example
^.*(hede).*
then have appropriate logic in your code.OK, I admit this is not really an answer to the posted question posted and it may also use slightly more processing than a single regex. But for developers who came here looking for a fast emergency fix for an outlier case then this solution should not be overlooked.
TXR 语言 支持正则表达式否定。
一个更复杂的例子:匹配所有以
a
开头并以z
结尾的行,但不包含子字符串hede
:正则表达式否定不是其本身特别有用,但是当您也有交集时,事情会变得有趣,因为您拥有一整套布尔集操作:您可以表达“与此匹配的集合,除了与该匹配的事物”。
The TXR Language supports regex negation.
A more complicated example: match all lines that start with
a
and end withz
, but do not contain the substringhede
:Regex negation is not particularly useful on its own but when you also have intersection, things get interesting, since you have a full set of boolean set operations: you can express "the set which matches this, except for things which match that".
如果您尝试匹配包含字符串 X 但不包含字符串 Y 的整行,我想添加另一个示例。
例如,假设我们要检查 URL/字符串是否包含“tasty-treats”,只要它不包含“chocolate”即可。
这个正则表达式模式可以工作(也可以在 JavaScript 中工作)
(示例中的全局、多行标志)
交互式示例:https://regexr.com/53gv4
匹配
(这些网址包含“tasty-treats”且不包含“chocolate”)
不匹配
(这些网址在某处包含“巧克力” - 所以他们不会'即使它们包含“tasty-treats”,也不匹配)
I wanted to add another example for if you are trying to match an entire line that contains string X, but does not also contain string Y.
For example, let's say we want to check if our URL / string contains "tasty-treats", so long as it does not also contain "chocolate" anywhere.
This regex pattern would work (works in JavaScript too)
(global, multiline flags in example)
Interactive Example: https://regexr.com/53gv4
Matches
(These urls contain "tasty-treats" and also do not contain "chocolate")
Does Not Match
(These urls contain "chocolate" somewhere - so they won't match even though they contain "tasty-treats")
只要您正在处理行,只需标记负面匹配并瞄准其余的。
事实上,我在 sed 中使用这个技巧,因为
^((?!hede).)*$
看起来不受它支持。对于所需的输出
标记否定匹配:(例如带有
hede
的行),使用根本不包含在整个文本中的字符。 为此,表情符号可能是一个不错的选择。As long as you are dealing with lines, simply mark the negative matches and target the rest.
In fact, I use this trick with sed because
^((?!hede).)*$
looks not supported by it.For the desired output
Mark the negative match: (e.g. lines with
hede
), using a character not included in the whole text at all. An emoji could probably be a good choice for this purpose.Target the rest (the unmarked strings: e.g. lines without
hede
). Suppose you want to keep only the target and delete the rest (as you want):For a better understanding
Suppose you want to delete the target:
Mark the negative match: (e.g. lines with
hede
), using a character not included in the whole text at all. An emoji could probably be a good choice for this purpose.Target the rest (the unmarked strings: e.g. lines without
hede
). Suppose you want to delete the target:Remove the mark:
以下功能将帮助您获得所需的输出
The below function will help you get your desired output
^((?!hede).)*$
是一个优雅的解决方案,但由于它消耗字符,因此您无法将其与其他条件结合起来。 例如,假设您想检查是否存在“hede”和是否存在“haha”。 该解决方案可行,因为它不会消耗字符:^((?!hede).)*$
is an elegant solution, except since it consumes characters you won't be able to combine it with other criteria. For instance, say you wanted to check for the non-presence of "hede" and the presence of "haha." This solution would work because it won't consume characters:如何使用 PCRE 的回溯控制动词来匹配不包含单词的行
这是我以前从未见过的方法:
它是如何工作的
首先,它尝试在行中的某个位置找到“hede”。 如果成功,此时
(*COMMIT)
会告诉引擎不仅在失败时不要回溯,而且在这种情况下也不要尝试任何进一步的匹配。 然后,我们尝试匹配不可能匹配的内容(在本例中为^
)。如果一行不包含“hede”,则第二个替代方案(空子模式)成功匹配主题字符串。
这种方法并不比负向前瞻更有效,但我想我应该把它放在这里,以防有人发现它很漂亮,并发现它可以用于其他更有趣的应用程序。
How to use PCRE's backtracking control verbs to match a line not containing a word
Here's a method that I haven't seen used before:
How it works
First, it tries to find "hede" somewhere in the line. If successful, at this point,
(*COMMIT)
tells the engine to, not only not backtrack in the event of a failure, but also not to attempt any further matching in that case. Then, we try to match something that cannot possibly match (in this case,^
).If a line does not contain "hede" then the second alternative, an empty subpattern, successfully matches the subject string.
This method is no more efficient than a negative lookahead, but I figured I'd just throw it on here in case someone finds it nifty and finds a use for it for other, more interesting applications.
一个更简单的解决方案是使用 not 运算符 !
您的 if 语句需要匹配“包含”而不匹配“排除”。
我相信 RegEx 的设计者预料到了 not 运算符的使用。
A simpler solution is to use the not operator !
Your if statement will need to match "contains" and not match "excludes".
I believe the designers of RegEx anticipated the use of not operators.
也许您会在尝试编写能够匹配不包含子字符串的行段(而不是整行)的正则表达式时在 Google 上找到此内容。 我花了一段时间才弄清楚,所以我将分享:
给定一个字符串:
我想匹配不包含子字符串“bad”的
标签。
/
将匹配和
请注意,有两组(层)括号:
Ruby 演示:
Maybe you'll find this on Google while trying to write a regex that is able to match segments of a line (as opposed to entire lines) which do not contain a substring. Tooke me a while to figure out, so I'll share:
Given a string:
I want to match
<span>
tags which do not contain the substring "bad"./<span(?:(?!bad).)*?>
will match<span class=\"good\">
and<span class=\"ugly\">
.Notice that there are two sets (layers) of parentheses:
Demo in Ruby:
通过ConyEdit,您可以使用命令行
cc.gl !/hede/
获取不包含正则表达式匹配的行,或使用命令行 cc.dl /hede/ 删除包含正则表达式匹配的行。 他们有相同的结果。With ConyEdit, you can use the command line
cc.gl !/hede/
to get lines that do not contain the regex matching, or use the command linecc.dl /hede/
to delete lines that contain the regex matching. They have the same result.