grep 两次或使用正则表达式一次是否更有效?
我正在尝试解析几个 2gb 以上的文件,并希望在几个级别上进行 grep 。
假设我想获取包含“foo”的行和也包含“bar”的行。
我可以做 grep foo file.log | grep bar
,但我担心运行两次会很昂贵。
使用诸如 grep -E '(foo.*bar|bar.*foo)' 之类的东西代替会有好处吗?
I'm trying to parse a couple of 2gb+ files and want to grep on a couple of levels.
Say I want to fetch lines that contain "foo" and lines that also contain "bar".
I could do grep foo file.log | grep bar
, but my concern is that it will be expensive running it twice.
Would it be beneficial to use something like grep -E '(foo.*bar|bar.*foo)'
instead?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
grep -E '(foo|bar)'
将查找包含 'foo' OR 'bar' 的行。您需要包含 BOTH 'foo' AND 'bar' 的行。这些命令中的任何一个都可以:
理论上,这两个命令应该比您的
cat | 高效得多。查询 | grep
构造是因为:sed
和awk
都执行自己的文件读取;不需要管道开销sed
和awk
的“程序”使用布尔短路来快速跳过不包含“foo”的行,从而仅测试包含'foo' 到 /bar/ 正则表达式但是,我还没有测试它们。嗯嗯:)
grep -E '(foo|bar)'
will find lines containing 'foo' OR 'bar'.You want lines containing BOTH 'foo' AND 'bar'. Either of these commands will do:
Both commands -- in theory -- should be much more efficient than your
cat | grep | grep
construct because:sed
andawk
perform their own file reading; no need for pipe overheadsed
andawk
above use Boolean short-circuiting to quickly skip lines not containing 'foo', thus testing only lines containing 'foo' to the /bar/ regexHowever, I haven't tested them. YMMV :)
理论上,最快的方法应该是:
出于以下几个原因: 首先,grep 直接从文件中读取,而不是添加让 cat 读取它并将其塞入管道以供 grep 读取的步骤。其次,它仅使用一个 grep 实例,因此文件的每一行只需处理一次。第三,
grep -E
在大文件上通常比普通 grep 更快(但在小文件上更慢),尽管这取决于 grep 的实现。最后,grep(及其所有变体)针对字符串搜索进行了优化,而 sed 和 awk 是恰好能够搜索的通用工具(但没有针对它进行优化)。In theory, the fastest way should be:
For several reasons: First, grep reads directly from the file, rather than adding the step of having cat read it and stuff it down a pipe for grep to read. Second, it uses only a single instance of grep, so each line of the file only has to be processed once. Third,
grep -E
is generally faster than plain grep on large files (but slower on small files), although this will depend on your implementation of grep. Finally, grep (in all its variants) is optimized for string searching, while sed and awk are general-purpose tools that happen to be able to search (but aren't optimized for it).这两种操作有本质的不同。这一个:
在 file.log 中查找 foo,然后在最后一个 grep 输出中查找 bar。而
cat file.log | grep -E '(foo|bar)'
在 file.log 中查找 foo 或 bar。输出应该有很大不同。使用您需要的任何行为。至于效率,它们实际上没有可比性,因为它们做的事情不同。不过,两者都应该足够快。
These two operations are fundamentally different. This one:
looks for foo in file.log, then looks for bar in whatever the last grep output. Whereas
cat file.log | grep -E '(foo|bar)'
looks for either foo or bar in file.log. The output should be very different. Use whatever behavior you need.As for efficiency, they're not really comparable because they do different things. Both should be fast enough, though.
如果您这样做:
您只是以任意顺序打印同时包含
foo
和bar
的行。如果这是您的意图:会更高效,因为我只需解析输出一次。
请注意,我不需要
cat
,它本身效率更高。您很少需要cat
,除非您连接文件(这是该命令的目的)。 99% 的情况下,您可以将文件名添加到管道中第一个命令的末尾,或者如果您有像tr
这样的命令不允许您使用文件,则您可以总是可以像这样重定向输入:但是,关于无用的猫已经足够了。我家里有两个。
您可以将多个正则表达式传递给单个
grep
,这通常比管道传输多个grep
更有效。但是,如果您可以消除正则表达式,您可能会发现这是最有效的:与
grep
不同,fgrep
不解析正则表达式,这意味着它可以解析很多行快点。试试这个:和
看看哪个更快。
If you're doing this:
You're only printing lines that contain both
foo
andbar
in any order. If this is your intention:Will be more efficient since I only have to parse the output once.
Notice I don't need the
cat
which is more efficient in itself. You rarely ever needcat
unless you are concatinating files (which is the purpose of the command). 99% of the time you can either add a file name to the end of the first command in a pipe, or if you have a command liketr
that doesn't allow you to use a file, you can always redirect the input like this:But, enough about useless
cat
s. I have two at home.You can pass multiple regular expressions to a single
grep
which is usually a bit more efficient than piping multiplegreps
. However, if you can eliminate regular expressions, you might find this the most efficient:Unlike
grep
,fgrep
doesn't parse regular expressions which means it can parse lines much, much faster. Try this:and
And see which is faster.