awk 计算出现奇怪行为的次数
我需要统计大量文件的第二列元素出现的次数。我使用的脚本是这样的:
{
el[$2]++
}
END {
for (i in el) {
print i, el[i] >> "rank.txt"
}
}
为了在大量文件上运行它,我使用 find | xargs
这样:
find . -name "*.txt" | xargs awk -f script.awk
问题是,如果我计算输出文件rank.txt
的行数(使用wc -lrank.txt
)我得到的数字(例如 7600)大于第二行的唯一元素数(例如 7300),这是我通过 a 获得的:
find . -name "*.txt" | xargs awk '{print $2}' | sort | uniq | wc -l
事实上,给出 a :
awk '{print $1}' rank.txt | sort | uniq | wc -l
我获得了正确的元素数量(按照示例我) '二得到7300)。所以这意味着输出文件第一列的元素不是唯一的。但是,这不应该发生!
I need to count the number of occurrences of elements of the second column of a large number of files. The script I'm using is this:
{
el[$2]++
}
END {
for (i in el) {
print i, el[i] >> "rank.txt"
}
}
For running it over a large number of files I'm using find | xargs
this way :
find . -name "*.txt" | xargs awk -f script.awk
The problems is that if I count the number of lines of the output files rank.txt
(with a wc -l rank.txt
) the number I get (for example 7600) is bigger than the number of unique elements of the second row (for example 7300), that I obtain with a :
find . -name "*.txt" | xargs awk '{print $2}' | sort | uniq | wc -l
In fact giving a :
awk '{print $1}' rank.txt | sort | uniq | wc -l
I obtain the right number of elements (following the example I'll gett 7300). So it means that the elements of the first column of the output files are not unique. But, this shouldn't happen!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这可能是输入文件 (
*.txt
) 包含非唯一元素和xargs
功能这一事实的结合。请记住,当存在大量文件时,会使用不同的参数集重复调用 xargs。这意味着在第一个示例中,如果文件数量较多,则某些文件不会在一次 awk 运行中处理,这会导致输出中出现较多数量的“唯一”元素。
你可以试试这个:
This is probably combination of the fact that the input files (
*.txt
) contain non-unique elements, and thexargs
functionality.Remember that xargs, when there is a large number of files, is called repeatedly with different set of arguments. This means that in the first example, if there is larger number of files, some of the files are not processed in one awk run, which results in higher number of "unique" elements in the output.
You could try this:
您可以通过使用找出 $1 中的非重复项在哪里,
我现在没有办法测试它,最后一个 awk 的目的是过滤
uniq -c
的输出以仅显示计数大于 1 的记录。我希望这有帮助。
YOu can find out where the non-duplicates in $1 are by using
I don't have a way to test this right now, the intent of last awk is to filter output of
uniq -c
to show only records that have a count greater than one.I hope this helps.