如何确定 shell 管道中最慢的组件?

发布于 2024-12-09 01:42:21 字数 307 浏览 1 评论 0原文

我设置了一个非常长且复杂的 shell 管道来获取 2.2Gb 的数据并对其进行处理。目前处理需要 45 分钟。该管道由许多 cut、grep、sort、uniq、grep 和 awk 命令捆绑在一起。我怀疑是 grep 部分导致它花费了这么多时间,但我无法确认这一点。

是否有办法从头到尾“分析”整个管道以确定哪个组件最慢以及它是否受 CPU 或 IO 限制,以便可以对其进行优化?

不幸的是,我无法在这里发布整个命令,因为它需要发布专有信息,但我怀疑它是用 htop 检查的以下位:

grep -v ^[0-9]

I have an extremely long and complicated shell pipeline set up to grab 2.2Gb of data and process it. It currently takes 45 minutes to process. The pipeline is a number of cut, grep, sort, uniq, grep and awk commands tied together. I have my suspicion that it's the grep portion that is causing it to take so much time but I have no way of confirming it.

Is there anyway to "profile" the entire pipeline from end to end to determine which component is the slowest and if it is CPU or IO bound so it can be optimised?

I cannot post the entire command here unfortunately as it would require posting proprietary information but I suspect it is the following bit checking it out with htop:

grep -v ^[0-9]

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

自控 2024-12-16 01:42:21

实现此目的的一种方法是逐步构建管道,对每次添加进行计时,并尽可能多地从等式中获取(例如输出到终端或文件)。下面显示了一个非常简单的示例:

pax:~$ time ( cat bigfile >/dev/null )
real 0m4.364s
user 0m0.004s
sys  0m0.300s

pax:~$ time ( cat bigfile | tr 'a' 'b' >/dev/null )
real 0m0.446s
user 0m0.312s
sys  0m0.428s

pax:~$ time ( cat bigfile | tr 'a' 'b' | tail -1000l >/dev/null )
real 0m0.796s
user 0m0.516s
sys  0m0.688s

pax:~$ time ( cat bigfile | tr 'a' 'b' | tail -1000l | sort -u >/dev/null )
real 0m0.892s
user 0m0.556s
sys  0m0.756s

如果将上面的用户时间和系统时间相加,您将看到增量为:

  • cat 为 0.304 (0.004 + 0.300) 秒;
  • tr 为 0.436 (0.312 + 0.428 - 0.304) 秒;
  • 尾部 0.464 (0.516 + 0.688 - 0.436 - 0.304) 秒; 。
  • 排序耗时 0.108 (0.556 + 0.756 - 0.464 - 0.436 - 0.304) 秒

这告诉我,要研究的主要内容是 tailtr

显然,这仅适用于 CPU,我可能应该在每个阶段进行多次运行以求平均值,但这是我将采取的基本第一种方法。

如果事实证明它确实是您的 grep,那么还有其他一些选项可供您使用。还有许多其他命令可以去除不以数字开头的行,但您可能发现执行此操作的自定义命令可能更快,伪代码如(未经测试,但您应该得到这个想法):

state = echo
lastchar = newline
while not end of file:
    read big chunk from file
    for every char in chunk:
        if lastchar is newline:
            if state is echo and char is non-digit:
                state = skip
            else if state is skip and and char is digit:
                state = echo
        if state is echo:
            output char
        lastchar = char

像这样的自定义、有针对性的代码有时可以比通用正则表达式处理引擎更高效,因为它可以针对特定情况进行优化。无论是这种情况还是任何情况,您都应该测试一下。我的首要优化原则是衡量,不要猜测!

One way to do this is to gradually build up the pipeline, timing each addition, and taking as much out of the equation as possible (such as outputting to a terminal or file). A very simple example is shown below:

pax:~$ time ( cat bigfile >/dev/null )
real 0m4.364s
user 0m0.004s
sys  0m0.300s

pax:~$ time ( cat bigfile | tr 'a' 'b' >/dev/null )
real 0m0.446s
user 0m0.312s
sys  0m0.428s

pax:~$ time ( cat bigfile | tr 'a' 'b' | tail -1000l >/dev/null )
real 0m0.796s
user 0m0.516s
sys  0m0.688s

pax:~$ time ( cat bigfile | tr 'a' 'b' | tail -1000l | sort -u >/dev/null )
real 0m0.892s
user 0m0.556s
sys  0m0.756s

If you add up the user and system times above, you'll see that the incremental increases are:

  • 0.304 (0.004 + 0.300) seconds for the cat;
  • 0.436 (0.312 + 0.428 - 0.304) seconds for the tr;
  • 0.464 (0.516 + 0.688 - 0.436 - 0.304) seconds for the tail; and
  • 0.108 (0.556 + 0.756 - 0.464 - 0.436 - 0.304) seconds for the sort.

This tells me that the main things to look into are the tail and the tr.

Now obviously, that's for CPU only, and I probably should have done multiple runs at each stage for averaging purposes, but that's the basic first approach I would take.

If it turns out it really is your grep, there are a few other options available to you. There are numerous other commands that can strip lines not starting with a digit but you may find that a custom-built command for doing this may be faster still, pseudo-code like (untested, but you should get the idea):

state = echo
lastchar = newline
while not end of file:
    read big chunk from file
    for every char in chunk:
        if lastchar is newline:
            if state is echo and char is non-digit:
                state = skip
            else if state is skip and and char is digit:
                state = echo
        if state is echo:
            output char
        lastchar = char

Custom, targeted code like this can sometimes be made more efficient than a general-purpose regex processing engine, simply because it can be optimised to the specific case. Whether that's true is this case, or any case for that matter, is something you should test. My number one optimisation mantra is measure, don't guess!

橘亓 2024-12-16 01:42:21

经过进一步的实验,我自己发现了这个问题。这似乎是由于 grep 中的编码支持所致。使用以下命令挂起管道:

grep -v ^[0-9]

我将其替换为 sed,如下所示,它在 45 秒内完成!

sed '/^[0-9]/d'

I found the problem myself after some further experimentation. It appears to be due to the encoding support in grep. Using the following hung the pipeline:

grep -v ^[0-9]

I replaced it with sed as follows and it finished in under 45 seconds!

sed '/^[0-9]/d'
撕心裂肺的伤痛 2024-12-16 01:42:21

这对于 zsh 来说很简单:

zsh-4.3.12[sysadmin]% time sleep 3 | sleep 5 | sleep 2
sleep 3  0.01s user 0.03s system 1% cpu 3.182 total
sleep 5  0.01s user 0.01s system 0% cpu 5.105 total
sleep 2  0.00s user 0.05s system 2% cpu 2.121 total

This is straightforward with zsh:

zsh-4.3.12[sysadmin]% time sleep 3 | sleep 5 | sleep 2
sleep 3  0.01s user 0.03s system 1% cpu 3.182 total
sleep 5  0.01s user 0.01s system 0% cpu 5.105 total
sleep 2  0.00s user 0.05s system 2% cpu 2.121 total
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文