awk 计算出现奇怪行为的次数

发布于 2024-12-09 16:50:27 字数 661 浏览 4 评论 0原文

我需要统计大量文件的第二列元素出现的次数。我使用的脚本是这样的：

{
 el[$2]++
}
END {
    for (i in el) {
    print i, el[i] >> "rank.txt"
    }
 }

为了在大量文件上运行它，我使用 find | xargs这样：

find . -name "*.txt" | xargs awk -f script.awk

问题是，如果我计算输出文件rank.txt的行数（使用wc -lrank.txt）我得到的数字（例如 7600）大于第二行的唯一元素数（例如 7300），这是我通过 a 获得的：

find . -name "*.txt" | xargs awk '{print $2}' | sort | uniq | wc -l

事实上，给出 a ：

awk '{print $1}' rank.txt | sort | uniq | wc -l

我获得了正确的元素数量（按照示例我） '二得到7300）。所以这意味着输出文件第一列的元素不是唯一的。但是，这不应该发生！

原文

I need to count the number of occurrences of elements of the second column of a large number of files. The script I'm using is this:

{
 el[$2]++
}
END {
    for (i in el) {
    print i, el[i] >> "rank.txt"
    }
 }

For running it over a large number of files I'm using find | xargs this way :

find . -name "*.txt" | xargs awk -f script.awk

The problems is that if I count the number of lines of the output files rank.txt (with a wc -l rank.txt) the number I get (for example 7600) is bigger than the number of unique elements of the second row (for example 7300), that I obtain with a :

find . -name "*.txt" | xargs awk '{print $2}' | sort | uniq | wc -l

In fact giving a :

awk '{print $1}' rank.txt | sort | uniq | wc -l

I obtain the right number of elements (following the example I'll gett 7300). So it means that the elements of the first column of the output files are not unique. But, this shouldn't happen!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

动听の歌 2024-12-16 16:50:27

这可能是输入文件 (*.txt) 包含非唯一元素和 xargs 功能这一事实的结合。
请记住，当存在大量文件时，会使用不同的参数集重复调用 xargs。这意味着在第一个示例中，如果文件数量较多，则某些文件不会在一次 awk 运行中处理，这会导致输出中出现较多数量的“唯一”元素。

你可以试试这个：

find . -name "*.txt" | xargs cat | awk -f script.awk

This is probably combination of the fact that the input files (*.txt) contain non-unique elements, and the xargs functionality.
Remember that xargs, when there is a large number of files, is called repeatedly with different set of arguments. This means that in the first example, if there is larger number of files, some of the files are not processed in one awk run, which results in higher number of "unique" elements in the output.

You could try this:

find . -name "*.txt" | xargs cat | awk -f script.awk

回复收藏 0 原文

冷了相思 2024-12-16 16:50:27

您可以通过使用找出 $1 中的非重复项在哪里，

find . -name "*.txt" | xargs awk '{print $2}' | sort | uniq -c | awk '$1 > 1 {print}'

我现在没有办法测试它，最后一个 awk 的目的是过滤 uniq -c 的输出以仅显示计数大于 1 的记录。

我希望这有帮助。

YOu can find out where the non-duplicates in $1 are by using

find . -name "*.txt" | xargs awk '{print $2}' | sort | uniq -c | awk '$1 > 1 {print}'

I don't have a way to test this right now, the intent of last awk is to filter output of uniq -c to show only records that have a count greater than one.

I hope this helps.

回复收藏 0 原文

~没有更多了~