cut、colrm、awk 和 sed 的奇怪问题:无法从管道流中剪切字符

发布于 2024-10-14 10:03:58 字数 1408 浏览 8 评论 0原文

我创建了一个脚本来枚举目录及其下面的所有文件。我想通过使用 pv 添加一些进度反馈,因为我通常从根目录使用它。

问题是 find 的时间输出(%TT)中总是包含小数秒,但我不想记录这么多细节。

如果我编写脚本一次性完成所有事情,我就会得到正确的输出。但是,如果我使用中间文件在“第二次”传递期间进行估计,结果会发生变化,我不明白为什么。

这个版本给出了正确的结果:

#!/bin/bash

find -printf "%11s %TY-%Tm-%Td %TT %p\n" 2> /dev/null |
# - Remove the fractional seconds from the time
# before:       4096 2011-01-19 22:43:51.0000000000 .
# after :       4096 2011-01-19 22:43:51 .
colrm 32 42 |
pv -ltrbN "Enumerating files..." |
# - Sort every thing by filename
sort -k 4

但是排序可能需要很长时间,所以我尝试了这样的方法,以获得更多反馈:

#!/bin/bash

TMPFILE1=$(mktemp)
TMPFILE2=$(mktemp)

# Erase temporary files before quitting
trap "rm $TMPFILE1 $TMPFILE2" EXIT

find -printf "%11s %TY-%Tm-%Td %TT %p\n" 2> /dev/null |
pv -ltrbN "Enumerating files..." > $TMPFILE1
LINE_COUNT="$(wc -l $TMPFILE1)"

#cat $TMPFILE1 | colrm 32 42 |                   #1
#cat $TMPFILE1 | cut -c1-31,43- |                #2
#cut -c1-31,43- $TMPFILE1 |                      #3
#sed s/.0000000000// $TMPFILE1 |                 #4
awk -F".0000000000" '{print $1 $2}' $TMPFILE1 |  #5
pv -lN "Removing fractional seconds..." -s $LINE_COUNT > $TMPFILE2

echo "Sorting list by filenames..." >&2
cat $TMPFILE2 |
sort -k 4

这 5 个“解决方案”都不起作用。 “.0000000000”部分保留在输出中。

有人可以解释为什么吗?

我的最终解决方案是将剪切操作与查找结合起来,仅使用一个临时文件。仅排序是单独完成的。

I have created a script to enumerate all files in a directory and below it. I wanted to add some progression feed-back by using pv, because I usually use it from the root directory.

The problem is find which always include fractional seconds in its time output (%TT), but I don't want to record so much detail.

If I write the script to do every thing in one pass, I get the right output. But if I use intermediate files to have an estimation during a "second" pass, the result change and I do not see why.

This version give the right result:

#!/bin/bash

find -printf "%11s %TY-%Tm-%Td %TT %p\n" 2> /dev/null |
# - Remove the fractional seconds from the time
# before:       4096 2011-01-19 22:43:51.0000000000 .
# after :       4096 2011-01-19 22:43:51 .
colrm 32 42 |
pv -ltrbN "Enumerating files..." |
# - Sort every thing by filename
sort -k 4

But the sort can take a long time, so I tried something like this, to have a little more feed-back:

#!/bin/bash

TMPFILE1=$(mktemp)
TMPFILE2=$(mktemp)

# Erase temporary files before quitting
trap "rm $TMPFILE1 $TMPFILE2" EXIT

find -printf "%11s %TY-%Tm-%Td %TT %p\n" 2> /dev/null |
pv -ltrbN "Enumerating files..." > $TMPFILE1
LINE_COUNT="$(wc -l $TMPFILE1)"

#cat $TMPFILE1 | colrm 32 42 |                   #1
#cat $TMPFILE1 | cut -c1-31,43- |                #2
#cut -c1-31,43- $TMPFILE1 |                      #3
#sed s/.0000000000// $TMPFILE1 |                 #4
awk -F".0000000000" '{print $1 $2}' $TMPFILE1 |  #5
pv -lN "Removing fractional seconds..." -s $LINE_COUNT > $TMPFILE2

echo "Sorting list by filenames..." >&2
cat $TMPFILE2 |
sort -k 4

None of the 5 "solutions" works. The ".0000000000" part is left in the output.

Can someone explain why?

My final solution is to combine the cutting operation with the find and use only one temporary file. Only the sort is done separately.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

峩卟喜欢 2024-10-21 10:03:59

您可以使用字段精度说明符(至少使用 GNU find 4.4.2)截断 -printf 参数中的秒:

find -printf "%11s %TY-%Tm-%Td %.8TT %p\n"

这会在“HH:”中留下八个字符MM:SS”。

我的答案的其余部分可能没有实际意义:

您的 #1-5 不起作用的原因是 wc 的输出包含文件名(尤其是空格)。该空格使 pvwc 命令中的文件名视为输入文件。命令行参数的优先级高于标准输入。由于它恰好与通过管道传递的输入文件相同,因此输出文件看起来像未处理的输入文件(因为它是,因为管道被忽略)。

仅捕获计数而不捕获文件名:

LINE_COUNT=$(wc -l < "$TMPFILE1")

以下是一些小的改进:

< $TMPFILE1 colrm 32 42 |                   #1 No need for cat

colrm 32 42 < $TMPFILE1 |                   #1

< $TMPFILE1 cut -c1-31,43- |                #2

cut -c1-31,43- < $TMPFILE1 |                #2

sed s/\.0000000000// $TMPFILE1 |            #4 The dot should be escaped

You can truncate the seconds within the argument to -printf using a field precision specifier (at least using GNU find 4.4.2):

find -printf "%11s %TY-%Tm-%Td %.8TT %p\n"

which leaves the eight characters in "HH:MM:SS".

The rest of my answer is possibly moot:

The reason your #1-5 don't work is that the output of wc includes the filename (and especially a space). The space causes pv to see the filename from the wc command as an input file. The command line argument has higher precedence than stdin. Since it happens to be the same as the input file that's being passed through the pipe, the output file looks like an unprocessed input file (because it is, since the pipeline is ignored).

To capture only the count without the filename:

LINE_COUNT=$(wc -l < "$TMPFILE1")

Here are some minor improvements:

< $TMPFILE1 colrm 32 42 |                   #1 No need for cat

or

colrm 32 42 < $TMPFILE1 |                   #1

< $TMPFILE1 cut -c1-31,43- |                #2

or

cut -c1-31,43- < $TMPFILE1 |                #2

sed s/\.0000000000// $TMPFILE1 |            #4 The dot should be escaped
绮筵 2024-10-21 10:03:59

如果这是一个实际的工作工具,而不仅仅是一个玩具,那么我就会把“进度反馈”全部放弃……也许当它不会让你的生活变得复杂时再回来。与此同时,您可能花费更多时间尝试找出如何提供反馈,而不是等待脚本返回。

如果您绝对必须提供某种反馈,那么就
echo "Sorting wc -l $TMPFILElines ..."

根据经验,您会感觉到对这么多行进行排序需要多长时间。

吻它,我的儿子,吻它。

If this an actual working tool, and not just a toy, then I'd just drop the "progress feedback" all together... maybe comeback to it when it doesn't complicate your life. In the meantime you've probably spent more time trying to figure out how to give feedback than you will ever spent waiting for your script to return.

If you absolutely MUST give some sort of feedback then just
echo "Sorting wc -l $TMPFILE lines ..."

You'll get a feeling for how long it'll take to sort so-many lines from experience.

Kiss it my son, kiss it.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文