grep 两次或使用正则表达式一次是否更有效?

发布于 2024-11-08 03:05:26 字数 207 浏览 0 评论 0原文

我正在尝试解析几个 2gb 以上的文件,并希望在几个级别上进行 grep 。

假设我想获取包含“foo”的行和也包含“bar”的行。

我可以做 grep foo file.log | grep bar,但我担心运行两次会很昂贵。

使用诸如 grep -E '(foo.*bar|bar.*foo)' 之类的东西代替会有好处吗?

I'm trying to parse a couple of 2gb+ files and want to grep on a couple of levels.

Say I want to fetch lines that contain "foo" and lines that also contain "bar".

I could do grep foo file.log | grep bar, but my concern is that it will be expensive running it twice.

Would it be beneficial to use something like grep -E '(foo.*bar|bar.*foo)' instead?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

别想她 2024-11-15 03:05:26

grep -E '(foo|bar)' 将查找包含 'foo' OR 'bar' 的行。

您需要包含 BOTH 'foo' AND 'bar' 的行。这些命令中的任何一个都可以:

sed '/foo/!d;/bar/!d' file.log

awk '/foo/ && /bar/' file.log

理论上,这两个命令应该比您的cat | 高效得多。查询 | grep 构造是因为:

  • sedawk 都执行自己的文件读取;不需要管道开销
  • 我上面给 sedawk 的“程序”使用布尔短路来快速跳过不包含“foo”的行,从而仅测试包含'foo' 到 /bar/ 正则表达式

但是,我还没有测试它们。嗯嗯:)

grep -E '(foo|bar)' will find lines containing 'foo' OR 'bar'.

You want lines containing BOTH 'foo' AND 'bar'. Either of these commands will do:

sed '/foo/!d;/bar/!d' file.log

awk '/foo/ && /bar/' file.log

Both commands -- in theory -- should be much more efficient than your cat | grep | grep construct because:

  • Both sed and awk perform their own file reading; no need for pipe overhead
  • The 'programs' I gave to sed and awk above use Boolean short-circuiting to quickly skip lines not containing 'foo', thus testing only lines containing 'foo' to the /bar/ regex

However, I haven't tested them. YMMV :)

女中豪杰 2024-11-15 03:05:26

理论上,最快的方法应该是:

grep -E '(foo.*bar|bar.*foo)' file.log

出于以下几个原因: 首先,grep 直接从文件中读取,而不是添加让 cat 读取它并将其塞入管道以供 grep 读取的步骤。其次,它仅使用一个 grep 实例,因此文件的每一行只需处理一次。第三,grep -E 在大文件上通常比普通 grep 更快(但在小文件上更慢),尽管这取决于 grep 的实现。最后,grep(及其所有变体)针对字符串搜索进行了优化,而 sed 和 awk 是恰好能够搜索的通用工具(但没有针对它进行优化)。

In theory, the fastest way should be:

grep -E '(foo.*bar|bar.*foo)' file.log

For several reasons: First, grep reads directly from the file, rather than adding the step of having cat read it and stuff it down a pipe for grep to read. Second, it uses only a single instance of grep, so each line of the file only has to be processed once. Third, grep -E is generally faster than plain grep on large files (but slower on small files), although this will depend on your implementation of grep. Finally, grep (in all its variants) is optimized for string searching, while sed and awk are general-purpose tools that happen to be able to search (but aren't optimized for it).

月亮是我掰弯的 2024-11-15 03:05:26

这两种操作有本质的不同。这一个:

cat file.log | grep foo | grep bar

在 file.log 中查找 foo,然后在最后一个 grep 输出中查找 bar。而cat file.log | grep -E '(foo|bar)' 在 file.log 中查找 foo 或 bar。输出应该有很大不同。使用您需要的任何行为。

至于效率,它们实际上没有可比性,因为它们做的事情不同。不过,两者都应该足够快。

These two operations are fundamentally different. This one:

cat file.log | grep foo | grep bar

looks for foo in file.log, then looks for bar in whatever the last grep output. Whereas cat file.log | grep -E '(foo|bar)' looks for either foo or bar in file.log. The output should be very different. Use whatever behavior you need.

As for efficiency, they're not really comparable because they do different things. Both should be fast enough, though.

傲娇萝莉攻 2024-11-15 03:05:26

如果您这样做:

cat file.log | grep foo | grep bar

您只是以任意顺序打印同时包含 foobar 的行。如果这是您的意图:

grep -e "foo.*bar" -e "bar.*foo" file.log

会更高效,因为我只需解析输出一次。

请注意,我不需要 cat ,它本身效率更高。您很少需要cat,除非您连接文件(这是该命令的目的)。 99% 的情况下,您可以将文件名添加到管道中第一个命令的末尾,或者如果您有像 tr 这样的命令不允许您使用文件,则您可以总是可以像这样重定向输入:

tr `a-z` `A-Z` < $fileName

但是,关于无用的猫已经足够了。我家里有两个。

您可以将多个正则表达式传递给单个 grep,这通常比管道传输多个 grep 更有效。但是,如果您可以消除正则表达式,您可能会发现这是最有效的:

fgrep "foo" file.log | fgrep "bar"

grep 不同,fgrep 不解析正则表达式,这意味着它可以解析很多行快点。试试这个:

time fgrep "foo" file.log | fgrep "bar"

time grep -e "foo.*bar" -e "bar.*foo" file.log

看看哪个更快。

If you're doing this:

cat file.log | grep foo | grep bar

You're only printing lines that contain both foo and bar in any order. If this is your intention:

grep -e "foo.*bar" -e "bar.*foo" file.log

Will be more efficient since I only have to parse the output once.

Notice I don't need the cat which is more efficient in itself. You rarely ever need cat unless you are concatinating files (which is the purpose of the command). 99% of the time you can either add a file name to the end of the first command in a pipe, or if you have a command like tr that doesn't allow you to use a file, you can always redirect the input like this:

tr `a-z` `A-Z` < $fileName

But, enough about useless cats. I have two at home.

You can pass multiple regular expressions to a single grep which is usually a bit more efficient than piping multiple greps. However, if you can eliminate regular expressions, you might find this the most efficient:

fgrep "foo" file.log | fgrep "bar"

Unlike grep, fgrep doesn't parse regular expressions which means it can parse lines much, much faster. Try this:

time fgrep "foo" file.log | fgrep "bar"

and

time grep -e "foo.*bar" -e "bar.*foo" file.log

And see which is faster.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文