获取 bash 中列中唯一值的计数
我有包含多列的制表符分隔文件。我想计算文件夹中所有文件的列中不同值的出现频率,并按计数降序对它们进行排序(首先是最高计数)。我如何在 Linux 命令行环境中完成此任务?
它可以使用任何常见的命令行语言,如 awk、perl、python 等。
I have tab delimited files with several columns. I want to count the frequency of occurrence of the different values in a column for all the files in a folder and sort them in decreasing order of count (highest count first). How would I accomplish this in a Linux command line environment?
It can use any common command line language like awk, perl, python etc.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
要查看第二列的频率计数(例如):
fileA.txt
fileB.txt
fileC.txt
结果:
To see a frequency count for column two (for example):
fileA.txt
fileB.txt
fileC.txt
Result:
这是在 shell 中执行此操作的一种方法:
这正是 bash 所擅长的事情。
Here is a way to do it in the shell:
This is the sort of thing bash is great at.
GNU 站点 建议使用这个不错的 awk 脚本,它打印单词及其频率。
可能的更改:
sort -nr
(以及反向word
和freq[word]
)进行管道传输,以降序查看结果。freq[3]++
- 将 3 替换为列号。这里是:
The GNU site suggests this nice awk script, which prints both the words and their frequency.
Possible changes:
sort -nr
(and reverseword
andfreq[word]
) to see the result in descending order.freq[3]++
- replace 3 with the column number.Here goes:
Perl
此代码计算所有列的出现次数,并为每个列打印排序报告:
将文本保存为columnvalues.pl
运行它:
perl columnvalues.pl files*
解释
在顶级 while 循环中:
* 循环组合输入文件的每一行
* 将行分割成@Fields数组
* 对于每一列,递增结果哈希数组数据结构
在顶级 for 循环中:
* 循环结果数组
* 打印列号
* 获取该列中使用的值
* 按出现次数对值进行排序
* 根据值进行二次排序(例如 b、g、m、z)
* 使用排序列表迭代结果哈希
* 打印每次出现的值和数量
结果基于 @Dennis 提供的示例输入文件
.csv 输入
如果您的输入文件是 .csv,请将
/\s+/
更改为/,/
混淆
在一场丑陋的竞赛中,Perl 的装备特别好。
这个单行代码做了同样的事情:
Perl
This code computes the occurrences of all columns, and prints a sorted report for each of them:
Save the text as columnvalues.pl
Run it as:
perl columnvalues.pl files*
Explanation
In the top-level while loop:
* Loop over each line of the combined input files
* Split the line into the @Fields array
* For every column, increment the result array-of-hashes data structure
In the top-level for loop:
* Loop over the result array
* Print the column number
* Get the values used in that column
* Sort the values by the number of occurrences
* Secondary sort based on the value (for example b vs g vs m vs z)
* Iterate through the result hash, using the sorted list
* Print the value and number of each occurrence
Results based on the sample input files provided by @Dennis
.csv input
If your input files are .csv, change
/\s+/
to/,/
Obfuscation
In an ugly contest, Perl is particularly well equipped.
This one-liner does the same:
红宝石(1.9+)
Ruby(1.9+)
这是一个接近线性时间的棘手方法(但可能不会更快!),方法是避免
sort
和uniq
(最终排序除外)。它基于...tee
和wc
!Here is a tricky one approaching linear time (but probably not faster!) by avoiding
sort
anduniq
, except for the final sort. It is based on...tee
andwc
instead!Pure-Bash 版本:
关键逻辑是填充一个关联数组,其中键是在文件中找到的值,数组的值是出现的次数:
$FIELD
是选定的列号${line[$FIELD]}
是文件中该行的列值${...:-(empty)}
是空值的特殊情况(会发生什么如果列数少于预期?)要以预期的 OP 格式对输出进行排序,需要做更多的工作:
警告:它适用于制表符分隔和空格分隔的文件,但是对于其中包含空格的值效果不佳。
Pure-Bash version:
The key logic is to fill an associative array which keys are the values found in the file and the array's value is the number of occurrence:
$FIELD
is the selected column number${line[$FIELD]}
is the column value from that line in the file${...:-(empty)}
is a special case for empty values (what happens if there is less columns than expected?)To have the output sorted in the expected OP format, a little more work is needed:
Warning: it works well for tab-delimited and space-delimited files, but works bad for values with spaces in it.