使用 awk、grep、sed 解析大型日志文件 (~5GB) 时出现性能问题
我目前正在处理大小约为的日志文件。 5GB。我对解析日志文件和使用 UNIX bash 还很陌生,所以我会尽力做到尽可能精确。在搜索日志文件时,我执行以下操作:提供要查找的请求号,然后可以选择提供操作作为辅助过滤器。典型的命令如下所示:
fgrep '2064351200' example.log | fgrep 'action: example'
这对于处理较小的文件来说很好,但是对于 5GB 的日志文件来说,速度慢得难以忍受。我在网上读到使用 sed 或 awk 来提高性能(或者甚至可能两者的组合)很棒,但我不确定这是如何实现的。例如,使用 awk,我有一个典型的命令:
awk '/2064351200/ {print}' example.log
基本上我的最终目标是能够打印/返回包含字符串的记录(或行号)(可能最多 4-5 个,我读过管道是bad)以有效地匹配日志文件。
顺便说一句,在 bash shell 中,如果我想使用 awk 并进行一些处理,这是如何实现的?例如:
BEGIN { print "File\tOwner" }
{ print $8, "\t", \
$3}
END { print " - DONE -" }
这是一个非常简单的 awk 脚本,我假设有一种方法可以将其放入单行 bash 命令中?但我不确定结构是怎样的。
预先感谢您的帮助。干杯。
I am currently dealing with log files with sizes approx. 5gb. I'm quite new to parsing log files and using UNIX bash, so I'll try to be as precise as possible. While searching through log files, I do the following: provide the request number to look for, then optionally to provide the action as a secondary filter. A typical command looks like this:
fgrep '2064351200' example.log | fgrep 'action: example'
This is fine dealing with smaller files, but with a log file that is 5gb, it's unbearably slow. I've read online it's great to use sed or awk to improve performance (or possibly even combination of both), but I'm not sure how this is accomplished. For example, using awk, I have a typical command:
awk '/2064351200/ {print}' example.log
Basically my ultimate goal is to be able print/return the records (or line number) that contain the strings (could be up to 4-5, and I've read piping is bad) to match in a log file efficiently.
On a side note, in bash shell, if I want to use awk and do some processing, how is that achieved? For example:
BEGIN { print "File\tOwner" }
{ print $8, "\t", \
$3}
END { print " - DONE -" }
That is a pretty simple awk script, and I would assume there's a way to put this into a one liner bash command? But I'm not sure how the structure is.
Thanks in advance for the help. Cheers.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您需要执行一些测试来找出瓶颈在哪里,以及各种工具的执行速度。尝试一些这样的测试:
传统上,egrep 应该是最快的(是的,比 fgrep 更快),但一些现代实现是自适应的,会自动切换到最合适的搜索算法。如果您有 bmgrep(它使用 Boyer-Moore 搜索算法),请尝试一下。一般来说,sed 和 awk 会更慢,因为它们被设计为更通用的文本操作工具,而不是针对特定的搜索工作进行调整。但这实际上取决于实现,而找出答案的正确方法是运行测试。分别运行它们几次,这样您就不会被缓存和竞争进程等问题搞乱。
正如@Ron 指出的,您的搜索过程可能受磁盘 I/O 限制。如果您要多次搜索同一个日志文件,首先压缩日志文件可能会更快;这使得从磁盘读取的速度更快,但随后需要更多的 CPU 时间来处理,因为必须先解压缩。尝试这样的操作:
我刚刚使用一个相当可压缩的文本文件进行了快速测试,发现 bzip2 压缩效果最好,但随后需要更多的 CPU 时间来解压缩,因此 zgip 选项总体上是最快的。您的计算机的磁盘和 CPU 性能与我的不同,因此您的结果可能会有所不同。如果您有任何其他压缩器,也可以尝试一下,和/或尝试不同级别的 gzip 压缩等。
说到预处理:如果您一遍又一遍地搜索相同的日志,有没有办法预先选择出来您可能感兴趣的日志行?如果是这样,请将它们放入一个较小的(可能是压缩的)文件中,然后搜索该文件而不是整个文件。与压缩一样,您预先花费了一些额外的时间,但随后每个单独的搜索运行得更快。
关于管道的注意事项:在其他条件相同的情况下,通过多个命令管道传输一个大文件将比使用单个命令完成所有工作要慢。但这里并不是所有事情都是平等的,如果在管道中使用多个命令(这就是 zgrep 和 bzgrep 所做的)可以为您带来更好的整体性能,那就去做吧。另外,请考虑您是否确实通过整个管道传递所有数据。在您给出的示例中,
fgrep '2064351200' example.log | fgrep 'action: example'
,第一个 fgrep 将丢弃大部分文件;管道和第二个命令只需处理包含“2064351200”的日志的一小部分,因此速度下降可能可以忽略不计。tl;dr 测试所有的东西!
编辑:如果日志文件是“实时”的(即正在添加新条目),但其中大部分是静态的,您也许可以使用部分预处理方法:压缩(可能是预扫描)日志,然后当扫描使用压缩(&/预扫描)版本以及自执行预扫描以来添加的日志部分的尾部。像这样的事情:
如果您要执行多个相关搜索(例如特定请求,然后对该请求执行特定操作),您可以保存预扫描版本:
You need to perform some tests to find out where your bottlenecks are, and how fast your various tools perform. Try some tests like this:
Traditionally, egrep should be the fastest of the bunch (yes, faster than fgrep), but some modern implementations are adaptive and automatically switch to the most appropriate searching algorithm. If you have bmgrep (which uses the Boyer-Moore search algorithm), try that. Generally, sed and awk will be slower because they're designed as more general-purpose text manipulation tools rather than being tuned for the specific job of searching. But it really depends on the implementation, and the correct way to find out is to run tests. Run them each several times so you don't get messed up by things like caching and competing processes.
As @Ron pointed out, your search process may be disk I/O bound. If you will be searching the same log file a number of times, it may be faster to compress the log file first; this makes it faster to read off disk, but then require more CPU time to process because it has to be decompressed first. Try something like this:
I just ran a quick test with a fairly compressible text file, and found that bzip2 compressed best, but then took far more CPU time to decompress, so the zgip option wound up being fastest overall. Your computer will have different disk and CPU performance than mine, so your results may be different. If you have any other compressors lying around, try them as well, and/or try different levels of gzip compression, etc.
Speaking of preprocessing: if you're searching the same log over and over, is there a way to preselect out just the log lines that you might be interested in? If so, grep them out into a smaller (maybe compressed) file, then search that instead of the whole thing. As with compression, you spend some extra time up front, but then each individual search runs faster.
A note about piping: other things being equal, piping a huge file through multiple commands will be slower than having a single command do all the work. But all things are not equal here, and if using multiple commands in a pipe (which is what zgrep and bzgrep do) buys you better overall performance, go for it. Also, consider whether you're actually passing all of the data through the entire pipe. In the example you gave,
fgrep '2064351200' example.log | fgrep 'action: example'
, the first fgrep will discard most of the file; the pipe and second command only have to process the small fraction of the log that contains '2064351200', so the slowdown will likely be negligible.tl;dr TEST ALL THE THINGS!
EDIT: if the log file is "live" (i.e. new entries are being added), but the bulk of it is static, you may be able to use a partial preprocess approach: compress (& maybe prescan) the log, then when scanning use the compressed (&/prescanned) version plus a tail of the part of the log added since you did the prescan. Something like this:
If you're going to be doing several related searches (e.g. a particular request, then specific actions with that request), you can save prescanned versions:
如果您不知道字符串的顺序,那么:
如果您知道它们将在行中一个接一个地出现:(
对于 awk,
{print}
是默认操作块,因此如果给定条件,可以省略)无论您如何切片,处理这么大的文件都会很慢。
If you don't know the sequence of your strings, then:
If you know that they will appear one following another in the line:
(note for awk,
{print}
is the default action block, so it can be omitted if the condition is given)Dealing with files that large is going to be slow no matter how you slice it.
对于命令行上的多行程序,
请注意单引号。
As to multi-line programs on the command line,
Note the single quotes.
如果多次处理同一文件,将其读入数据库甚至创建索引可能会更快。
If you process the same file multiple times, it might be faster to read it into a database, and perhaps even create an index.