如何确定 shell 管道中最慢的组件?
我设置了一个非常长且复杂的 shell 管道来获取 2.2Gb 的数据并对其进行处理。目前处理需要 45 分钟。该管道由许多 cut、grep、sort、uniq、grep 和 awk 命令捆绑在一起。我怀疑是 grep 部分导致它花费了这么多时间,但我无法确认这一点。
是否有办法从头到尾“分析”整个管道以确定哪个组件最慢以及它是否受 CPU 或 IO 限制,以便可以对其进行优化?
不幸的是,我无法在这里发布整个命令,因为它需要发布专有信息,但我怀疑它是用 htop 检查的以下位:
grep -v ^[0-9]
I have an extremely long and complicated shell pipeline set up to grab 2.2Gb of data and process it. It currently takes 45 minutes to process. The pipeline is a number of cut, grep, sort, uniq, grep and awk commands tied together. I have my suspicion that it's the grep portion that is causing it to take so much time but I have no way of confirming it.
Is there anyway to "profile" the entire pipeline from end to end to determine which component is the slowest and if it is CPU or IO bound so it can be optimised?
I cannot post the entire command here unfortunately as it would require posting proprietary information but I suspect it is the following bit checking it out with htop:
grep -v ^[0-9]
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
实现此目的的一种方法是逐步构建管道,对每次添加进行计时,并尽可能多地从等式中获取(例如输出到终端或文件)。下面显示了一个非常简单的示例:
如果将上面的用户时间和系统时间相加,您将看到增量为:
cat
为 0.304 (0.004 + 0.300) 秒;tr
为 0.436 (0.312 + 0.428 - 0.304) 秒;尾部
0.464 (0.516 + 0.688 - 0.436 - 0.304) 秒; 。排序
耗时 0.108 (0.556 + 0.756 - 0.464 - 0.436 - 0.304) 秒这告诉我,要研究的主要内容是
tail
和tr
。显然,这仅适用于 CPU,我可能应该在每个阶段进行多次运行以求平均值,但这是我将采取的基本第一种方法。
如果事实证明它确实是您的
grep
,那么还有其他一些选项可供您使用。还有许多其他命令可以去除不以数字开头的行,但您可能发现执行此操作的自定义命令可能更快,伪代码如(未经测试,但您应该得到这个想法):像这样的自定义、有针对性的代码有时可以比通用正则表达式处理引擎更高效,因为它可以针对特定情况进行优化。无论是这种情况还是任何情况,您都应该测试一下。我的首要优化原则是衡量,不要猜测!
One way to do this is to gradually build up the pipeline, timing each addition, and taking as much out of the equation as possible (such as outputting to a terminal or file). A very simple example is shown below:
If you add up the user and system times above, you'll see that the incremental increases are:
cat
;tr
;tail
; andsort
.This tells me that the main things to look into are the
tail
and thetr
.Now obviously, that's for CPU only, and I probably should have done multiple runs at each stage for averaging purposes, but that's the basic first approach I would take.
If it turns out it really is your
grep
, there are a few other options available to you. There are numerous other commands that can strip lines not starting with a digit but you may find that a custom-built command for doing this may be faster still, pseudo-code like (untested, but you should get the idea):Custom, targeted code like this can sometimes be made more efficient than a general-purpose regex processing engine, simply because it can be optimised to the specific case. Whether that's true is this case, or any case for that matter, is something you should test. My number one optimisation mantra is measure, don't guess!
经过进一步的实验,我自己发现了这个问题。这似乎是由于 grep 中的编码支持所致。使用以下命令挂起管道:
我将其替换为 sed,如下所示,它在 45 秒内完成!
I found the problem myself after some further experimentation. It appears to be due to the encoding support in grep. Using the following hung the pipeline:
I replaced it with sed as follows and it finished in under 45 seconds!
这对于 zsh 来说很简单:
This is straightforward with zsh: