获取 bash 中列中唯一值的计数

发布于 2024-10-16 12:42:09 字数 137 浏览 4 评论 0原文

我有包含多列的制表符分隔文件。我想计算文件夹中所有文件的列中不同值的出现频率,并按计数降序对它们进行排序(首先是最高计数)。我如何在 Linux 命令行环境中完成此任务?

它可以使用任何常见的命令行语言,如 awk、perl、python 等。

I have tab delimited files with several columns. I want to count the frequency of occurrence of the different values in a column for all the files in a folder and sort them in decreasing order of count (highest count first). How would I accomplish this in a Linux command line environment?

It can use any common command line language like awk, perl, python etc.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

时光是把杀猪刀 2024-10-23 12:42:09

要查看第二列的频率计数(例如):

awk -F '\t' '{print $2}' * | sort | uniq -c | sort -nr

fileA.txt

z    z    a
a    b    c
w    d    e

fileB.txt

t    r    e
z    d    a
a    g    c

fileC.txt

z    r    a
v    d    c
a    m    c

结果:

  3 d
  2 r
  1 z
  1 m
  1 g
  1 b

To see a frequency count for column two (for example):

awk -F '\t' '{print $2}' * | sort | uniq -c | sort -nr

fileA.txt

z    z    a
a    b    c
w    d    e

fileB.txt

t    r    e
z    d    a
a    g    c

fileC.txt

z    r    a
v    d    c
a    m    c

Result:

  3 d
  2 r
  1 z
  1 m
  1 g
  1 b
千纸鹤 2024-10-23 12:42:09

这是在 shell 中执行此操作的一种方法:

FIELD=2
cut -f $FIELD * | sort| uniq -c |sort -nr

这正是 bash 所擅长的事情。

Here is a way to do it in the shell:

FIELD=2
cut -f $FIELD * | sort| uniq -c |sort -nr

This is the sort of thing bash is great at.

想念有你 2024-10-23 12:42:09

GNU 站点 建议使用这个不错的 awk 脚本,它打印单词及其频率。

可能的更改:

  • 您可以通过 sort -nr(以及反向 wordfreq[word])进行管道传输,以降序查看结果。
  • 如果您想要特定的列,可以省略 for 循环,只需编写 freq[3]++ - 将 3 替换为列号。

这里是:

 # wordfreq.awk --- print list of word frequencies
 
 {
     $0 = tolower($0)    # remove case distinctions
     # remove punctuation
     gsub(/[^[:alnum:]_[:blank:]]/, "", $0)
     for (i = 1; i <= NF; i++)
         freq[$i]++
 }
 
 END {
     for (word in freq)
         printf "%s\t%d\n", word, freq[word]
 }

The GNU site suggests this nice awk script, which prints both the words and their frequency.

Possible changes:

  • You can pipe through sort -nr (and reverse word and freq[word]) to see the result in descending order.
  • If you want a specific column, you can omit the for loop and simply write freq[3]++ - replace 3 with the column number.

Here goes:

 # wordfreq.awk --- print list of word frequencies
 
 {
     $0 = tolower($0)    # remove case distinctions
     # remove punctuation
     gsub(/[^[:alnum:]_[:blank:]]/, "", $0)
     for (i = 1; i <= NF; i++)
         freq[$i]++
 }
 
 END {
     for (word in freq)
         printf "%s\t%d\n", word, freq[word]
 }
厌倦 2024-10-23 12:42:09

Perl

此代码计算所有列的出现次数,并为每个列打印排序报告:

# columnvalues.pl
while (<>) {
    @Fields = split /\s+/;
    for $i ( 0 .. $#Fields ) {
        $result[$i]{$Fields[$i]}++
    };
}
for $j ( 0 .. $#result ) {
    print "column $j:\n";
    @values = keys %{$result[$j]};
    @sorted = sort { $result[$j]{$b} <=> $result[$j]{$a}  ||  $a cmp $b } @values;
    for $k ( @sorted ) {
        print " $k $result[$j]{$k}\n"
    }
}

将文本保存为columnvalues.pl
运行它:perl columnvalues.pl files*

解释

在顶级 while 循环中:
* 循环组合输入文件的每一行
* 将行分割成@Fields数组
* 对于每一列,递增结果哈希数组数据结构

在顶级 for 循环中:
* 循环结果数组
* 打印列号
* 获取该列中使用的值
* 按出现次数对值进行排序
* 根据值进行二次排序(例如 b、g、m、z)
* 使用排序列表迭代结果哈希
* 打印每次出现的值和数量

结果基于 @Dennis 提供的示例输入文件

column 0:
 a 3
 z 3
 t 1
 v 1
 w 1
column 1:
 d 3
 r 2
 b 1
 g 1
 m 1
 z 1
column 2:
 c 4
 a 3
 e 2

.csv 输入

如果您的输入文件是 .csv,请将 /\s+/ 更改为 /,/

混淆

在一场丑陋的竞赛中,Perl 的装备特别好。
这个单行代码做了同样的事情:

perl -lane 'for $i (0..$#F){$g[$i]{$F[$i]}++};END{for $j (0..$#g){print "$j:";for $k (sort{$g[$j]{$b}<=>$g[$j]{$a}||$a cmp $b} keys %{$g[$j]}){print " $k $g[$j]{$k}"}}}' files*

Perl

This code computes the occurrences of all columns, and prints a sorted report for each of them:

# columnvalues.pl
while (<>) {
    @Fields = split /\s+/;
    for $i ( 0 .. $#Fields ) {
        $result[$i]{$Fields[$i]}++
    };
}
for $j ( 0 .. $#result ) {
    print "column $j:\n";
    @values = keys %{$result[$j]};
    @sorted = sort { $result[$j]{$b} <=> $result[$j]{$a}  ||  $a cmp $b } @values;
    for $k ( @sorted ) {
        print " $k $result[$j]{$k}\n"
    }
}

Save the text as columnvalues.pl
Run it as: perl columnvalues.pl files*

Explanation

In the top-level while loop:
* Loop over each line of the combined input files
* Split the line into the @Fields array
* For every column, increment the result array-of-hashes data structure

In the top-level for loop:
* Loop over the result array
* Print the column number
* Get the values used in that column
* Sort the values by the number of occurrences
* Secondary sort based on the value (for example b vs g vs m vs z)
* Iterate through the result hash, using the sorted list
* Print the value and number of each occurrence

Results based on the sample input files provided by @Dennis

column 0:
 a 3
 z 3
 t 1
 v 1
 w 1
column 1:
 d 3
 r 2
 b 1
 g 1
 m 1
 z 1
column 2:
 c 4
 a 3
 e 2

.csv input

If your input files are .csv, change /\s+/ to /,/

Obfuscation

In an ugly contest, Perl is particularly well equipped.
This one-liner does the same:

perl -lane 'for $i (0..$#F){$g[$i]{$F[$i]}++};END{for $j (0..$#g){print "$j:";for $k (sort{$g[$j]{$b}<=>$g[$j]{$a}||$a cmp $b} keys %{$g[$j]}){print " $k $g[$j]{$k}"}}}' files*
一杯敬自由 2024-10-23 12:42:09

红宝石(1.9+)

#!/usr/bin/env ruby
Dir["*"].each do |file|
    h=Hash.new(0)
    open(file).each do |row|
        row.chomp.split("\t").each do |w|
            h[ w ] += 1
        end
    end
    h.sort{|a,b| b[1]<=>a[1] }.each{|x,y| print "#{x}:#{y}\n" }
end

Ruby(1.9+)

#!/usr/bin/env ruby
Dir["*"].each do |file|
    h=Hash.new(0)
    open(file).each do |row|
        row.chomp.split("\t").each do |w|
            h[ w ] += 1
        end
    end
    h.sort{|a,b| b[1]<=>a[1] }.each{|x,y| print "#{x}:#{y}\n" }
end
木緿 2024-10-23 12:42:09

这是一个接近线性时间的棘手方法(但可能不会更快!),方法是避免 sortuniq(最终排序除外)。它基于... teewc

$ FIELD=2
$ values="$(cut -f $FIELD *)"
$ mkdir /tmp/counts
$ cd /tmp/counts
$ echo | tee -a $values
$ wc -l * | sort -nr
9 total
3 d
2 r
1 z
1 m
1 g
1 b
$

Here is a tricky one approaching linear time (but probably not faster!) by avoiding sort and uniq, except for the final sort. It is based on... tee and wc instead!

$ FIELD=2
$ values="$(cut -f $FIELD *)"
$ mkdir /tmp/counts
$ cd /tmp/counts
$ echo | tee -a $values
$ wc -l * | sort -nr
9 total
3 d
2 r
1 z
1 m
1 g
1 b
$
情愿 2024-10-23 12:42:09

Pure-Bash 版本:

FIELD=1
declare -A results
while read -a line; do
    results[${line[$FIELD]:-(empty)}]=$((results[${line[$FIELD]:-(empty)}]+1));
done < file.txt
echo ${results[@]@A}

关键逻辑是填充一个关联数组,其中键是在文件中找到的值,数组的值是出现的次数:

  • $FIELD 是选定的列号
  • ${line[$FIELD]} 是文件中该行的列值
  • ${...:-(empty)} 是空值的特殊情况(会发生什么如果列数少于预期?)

要以预期的 OP 格式对输出进行排序,需要做更多的工作:

sort -rn < <(
    for k in "${!results[@]}"; do
        echo "${results[$k]} $k";
    done
)

警告:它适用于制表符分隔和空格分隔的文件,但是对于其中包含空格的值效果不佳。

Pure-Bash version:

FIELD=1
declare -A results
while read -a line; do
    results[${line[$FIELD]:-(empty)}]=$((results[${line[$FIELD]:-(empty)}]+1));
done < file.txt
echo ${results[@]@A}

The key logic is to fill an associative array which keys are the values found in the file and the array's value is the number of occurrence:

  • $FIELD is the selected column number
  • ${line[$FIELD]} is the column value from that line in the file
  • ${...:-(empty)} is a special case for empty values (what happens if there is less columns than expected?)

To have the output sorted in the expected OP format, a little more work is needed:

sort -rn < <(
    for k in "${!results[@]}"; do
        echo "${results[$k]} $k";
    done
)

Warning: it works well for tab-delimited and space-delimited files, but works bad for values with spaces in it.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文