在同一行中将具有模式的行匹配 n 次
我有一个文件,我需要过滤出现(或不出现)N 次模式的行。 即,如果我的模式是字母 o
并且我要匹配字母 o
恰好出现 4 次的行,则表达式应匹配以下示例行中的第一行,但是不是其他的:
foo foo
foo
foo foo foo
我想我可以使用 vim、sed、awk 或任何其他工具中的正则表达式来完成此操作。 我用谷歌搜索过,没有发现有人做过类似的事情。 可能会做一个脚本或类似的东西来解析每一行。 有人做过类似的事情吗?
谢谢
I have a file and I need to filter lines that have (or don't have) N occurrences of a pattern.
I.e., if my pattern is the letter o
and I what to match lines where the letter o
occurs exactly 4 times, the expression should match the first of the following example lines but not the others:
foo foo
foo
foo foo foo
I thouth I could do it with a regex in vim, or sed, awk, or any other tool.
I've googled and haven't found anyone that has done a similar thing.
Probably will have do a script or something similar to parse each line.
Does anyone have done a similar thing?
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
您可以使用如下正则表达式:
Regexr - http://regexr.com?2toro
这应该适用于任何模式你想要的。例如,您想要查找包含四个 foo 的行,请使用:
Regexr - http://regexr.com?2tosa< /a>
You can use a regex like below:
Regexr - http://regexr.com?2toro
This should work with any pattern you want. For instance, you want to find lines with exactly four foos in it, use:
Regexr - http://regexr.com?2tosa
Perl 一行代码:
A Perl one-liner :
在 awk 中...
如果您要使用不同的模式/匹配计数一遍又一遍地执行此操作,并且模式不是正则表达式,您也可以执行以下操作:
In awk...
If you're going to be doing this over and over with different patterns/match counts, and pattern isn't a regular expression, you could also do something like...
如果您想编写代码,那么您可以构建基于 DFA 的字符串匹配,或者我会告诉您查看移位或字符串匹配算法,您可以轻松编写。然后,您可以根据算法需要将字符串输入到正确的数据结构中。阅读 http://en.wikipedia.org/wiki/Shift_Or_Algorithm 了解移位或字符串匹配算法。
If you want to write code, then you can construct a DFA based string matching or i would tell you to have a look at the shift or string matching algorithm, which you can easily write. Then you can input the string to the proper datastructure as per the algorithm needs. Read http://en.wikipedia.org/wiki/Shift_Or_Algorithm for the shift-or string matching algorithm.
这是可能的,但并不容易。
对于单字母大小写,可以使用诸如
^[^o]*o[^o]*o[^o]*o[^o]*o[^o]*$
之类的表达式。它基本上查找“not o”(零个或多个),后跟“o”四次,并允许在末尾添加额外的“not o”字符。但较长的表达式有点问题。例如,为了不找到单词“foo”,您必须允许“f”和“fo”,但不允许“foo”。因此,要找到恰好包含两次“foo”的行,您必须允许“ffofofoofoffoffoofoffofofo”行,这不是那么容易定义的。
要匹配“除 'foo' 之外的任何内容”,您可以使用表达式
([^f]|f[^o]|fo[^o])*
,它允许“f”和“fo”并且其他东西,但不是“foo”。但是您可以看到,如果单词较长并且您必须将其匹配四次,这会变得多么烦人。It's possible, but not easy.
For the single letter case, an expression such as
^[^o]*o[^o]*o[^o]*o[^o]*o[^o]*$
would work. It basically looks for "not o" (zero or more) followed by "o" four times, and allows extra "not o" characters at the end.But longer expressions are bit of a problem. For example, in order not to find the word "foo", you have to allow "f" and "fo" but not "foo". So to find a line with exactly twice "foo", you have to allow the line "ffofofoofoffoffoofoffofofo" which is not so easy to define.
To match "anything but 'foo'" you could use the expression
([^f]|f[^o]|fo[^o])*
which allows "f" and "fo" and other things, but not "foo". But you can see how this can become annoying if the word is longer and you have to match it four times.