当前位置：文江博客话题详情

获取 bash 中列中唯一值的计数

发布于 2024-10-16 12:42:09 字数 137 浏览 4 评论 0原文

我有包含多列的制表符分隔文件。我想计算文件夹中所有文件的列中不同值的出现频率，并按计数降序对它们进行排序（首先是最高计数）。我如何在 Linux 命令行环境中完成此任务？

它可以使用任何常见的命令行语言，如 awk、perl、python 等。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

时光是把杀猪刀 2024-10-23 12:42:09

要查看第二列的频率计数（例如）：

awk -F '\t' '{print $2}' * | sort | uniq -c | sort -nr

fileA.txt

z    z    a
a    b    c
w    d    e

fileB.txt

t    r    e
z    d    a
a    g    c

fileC.txt

z    r    a
v    d    c
a    m    c

结果：

To see a frequency count for column two (for example):

awk -F '\t' '{print $2}' * | sort | uniq -c | sort -nr

fileA.txt

z    z    a
a    b    c
w    d    e

fileB.txt

t    r    e
z    d    a
a    g    c

fileC.txt

z    r    a
v    d    c
a    m    c

Result:

回复收藏 0 原文

千纸鹤 2024-10-23 12:42:09

这是在 shell 中执行此操作的一种方法：

FIELD=2
cut -f $FIELD * | sort| uniq -c |sort -nr

这正是 bash 所擅长的事情。

Here is a way to do it in the shell:

FIELD=2
cut -f $FIELD * | sort| uniq -c |sort -nr

This is the sort of thing bash is great at.

回复收藏 0 原文

想念有你 2024-10-23 12:42:09

GNU 站点建议使用这个不错的 awk 脚本，它打印单词及其频率。

可能的更改：

您可以通过 sort -nr（以及反向 word 和 freq[word]）进行管道传输，以降序查看结果。
如果您想要特定的列，可以省略 for 循环，只需编写 freq[3]++ - 将 3 替换为列号。

这里是：

 # wordfreq.awk --- print list of word frequencies
 
 {
     $0 = tolower($0)    # remove case distinctions
     # remove punctuation
     gsub(/[^[:alnum:]_[:blank:]]/, "", $0)
     for (i = 1; i <= NF; i++)
         freq[$i]++
 }
 
 END {
     for (word in freq)
         printf "%s\t%d\n", word, freq[word]
 }

The GNU site suggests this nice awk script, which prints both the words and their frequency.

Possible changes:

You can pipe through sort -nr (and reverse word and freq[word]) to see the result in descending order.
If you want a specific column, you can omit the for loop and simply write freq[3]++ - replace 3 with the column number.

Here goes:

 # wordfreq.awk --- print list of word frequencies
 
 {
     $0 = tolower($0)    # remove case distinctions
     # remove punctuation
     gsub(/[^[:alnum:]_[:blank:]]/, "", $0)
     for (i = 1; i <= NF; i++)
         freq[$i]++
 }
 
 END {
     for (word in freq)
         printf "%s\t%d\n", word, freq[word]
 }

回复收藏 0 原文

厌倦 2024-10-23 12:42:09

Perl

此代码计算所有列的出现次数，并为每个列打印排序报告：

# columnvalues.pl
while (<>) {
    @Fields = split /\s+/;
    for $i ( 0 .. $#Fields ) {
        $result[$i]{$Fields[$i]}++
    };
}
for $j ( 0 .. $#result ) {
    print "column $j:\n";
    @values = keys %{$result[$j]};
    @sorted = sort { $result[$j]{$b} <=> $result[$j]{$a}  ||  $a cmp $b } @values;
    for $k ( @sorted ) {
        print " $k $result[$j]{$k}\n"
    }
}

将文本保存为columnvalues.pl
运行它：perl columnvalues.pl files*

解释

在顶级 while 循环中：
* 循环组合输入文件的每一行
* 将行分割成@Fields数组
* 对于每一列，递增结果哈希数组数据结构

在顶级 for 循环中：
* 循环结果数组
* 打印列号
* 获取该列中使用的值
* 按出现次数对值进行排序
* 根据值进行二次排序（例如 b、g、m、z）
* 使用排序列表迭代结果哈希
* 打印每次出现的值和数量

结果基于 @Dennis 提供的示例输入文件

column 0:
 a 3
 z 3
 t 1
 v 1
 w 1
column 1:
 d 3
 r 2
 b 1
 g 1
 m 1
 z 1
column 2:
 c 4
 a 3
 e 2

.csv 输入

如果您的输入文件是 .csv，请将 /\s+/ 更改为 /,/

混淆

在一场丑陋的竞赛中，Perl 的装备特别好。
这个单行代码做了同样的事情：

perl -lane 'for $i (0..$#F){$g[$i]{$F[$i]}++};END{for $j (0..$#g){print "$j:";for $k (sort{$g[$j]{$b}<=>$g[$j]{$a}||$a cmp $b} keys %{$g[$j]}){print " $k $g[$j]{$k}"}}}' files*

Perl

This code computes the occurrences of all columns, and prints a sorted report for each of them:

# columnvalues.pl
while (<>) {
    @Fields = split /\s+/;
    for $i ( 0 .. $#Fields ) {
        $result[$i]{$Fields[$i]}++
    };
}
for $j ( 0 .. $#result ) {
    print "column $j:\n";
    @values = keys %{$result[$j]};
    @sorted = sort { $result[$j]{$b} <=> $result[$j]{$a}  ||  $a cmp $b } @values;
    for $k ( @sorted ) {
        print " $k $result[$j]{$k}\n"
    }
}

Save the text as columnvalues.pl
Run it as: perl columnvalues.pl files*

Explanation

In the top-level while loop:
* Loop over each line of the combined input files
* Split the line into the @Fields array
* For every column, increment the result array-of-hashes data structure

In the top-level for loop:
* Loop over the result array
* Print the column number
* Get the values used in that column
* Sort the values by the number of occurrences
* Secondary sort based on the value (for example b vs g vs m vs z)
* Iterate through the result hash, using the sorted list
* Print the value and number of each occurrence

Results based on the sample input files provided by @Dennis

column 0:
 a 3
 z 3
 t 1
 v 1
 w 1
column 1:
 d 3
 r 2
 b 1
 g 1
 m 1
 z 1
column 2:
 c 4
 a 3
 e 2

.csv input

If your input files are .csv, change /\s+/ to /,/

Obfuscation

In an ugly contest, Perl is particularly well equipped.
This one-liner does the same:

perl -lane 'for $i (0..$#F){$g[$i]{$F[$i]}++};END{for $j (0..$#g){print "$j:";for $k (sort{$g[$j]{$b}<=>$g[$j]{$a}||$a cmp $b} keys %{$g[$j]}){print " $k $g[$j]{$k}"}}}' files*

回复收藏 0 原文

一杯敬自由 2024-10-23 12:42:09

红宝石(1.9+)

#!/usr/bin/env ruby
Dir["*"].each do |file|
    h=Hash.new(0)
    open(file).each do |row|
        row.chomp.split("\t").each do |w|
            h[ w ] += 1
        end
    end
    h.sort{|a,b| b[1]<=>a[1] }.each{|x,y| print "#{x}:#{y}\n" }
end

Ruby(1.9+)

#!/usr/bin/env ruby
Dir["*"].each do |file|
    h=Hash.new(0)
    open(file).each do |row|
        row.chomp.split("\t").each do |w|
            h[ w ] += 1
        end
    end
    h.sort{|a,b| b[1]<=>a[1] }.each{|x,y| print "#{x}:#{y}\n" }
end

回复收藏 0 原文

木緿 2024-10-23 12:42:09

这是一个接近线性时间的棘手方法（但可能不会更快！），方法是避免 sort 和 uniq（最终排序除外）。它基于... tee 和 wc！

$ FIELD=2
$ values="$(cut -f $FIELD *)"
$ mkdir /tmp/counts
$ cd /tmp/counts
$ echo | tee -a $values
$ wc -l * | sort -nr
9 total
3 d
2 r
1 z
1 m
1 g
1 b
$

Here is a tricky one approaching linear time (but probably not faster!) by avoiding sort and uniq, except for the final sort. It is based on... tee and wc instead!

$ FIELD=2
$ values="$(cut -f $FIELD *)"
$ mkdir /tmp/counts
$ cd /tmp/counts
$ echo | tee -a $values
$ wc -l * | sort -nr
9 total
3 d
2 r
1 z
1 m
1 g
1 b
$

回复收藏 0 原文

情愿 2024-10-23 12:42:09

Pure-Bash 版本：

FIELD=1
declare -A results
while read -a line; do
    results[${line[$FIELD]:-(empty)}]=$((results[${line[$FIELD]:-(empty)}]+1));
done < file.txt
echo ${results[@]@A}

关键逻辑是填充一个关联数组，其中键是在文件中找到的值，数组的值是出现的次数：

$FIELD 是选定的列号
${line[$FIELD]} 是文件中该行的列值
${...:-(empty)} 是空值的特殊情况（会发生什么如果列数少于预期？）

要以预期的 OP 格式对输出进行排序，需要做更多的工作：

sort -rn < <(
    for k in "${!results[@]}"; do
        echo "${results[$k]} $k";
    done
)

警告：它适用于制表符分隔和空格分隔的文件，但是对于其中包含空格的值效果不佳。

Pure-Bash version:

FIELD=1
declare -A results
while read -a line; do
    results[${line[$FIELD]:-(empty)}]=$((results[${line[$FIELD]:-(empty)}]+1));
done < file.txt
echo ${results[@]@A}

The key logic is to fill an associative array which keys are the values found in the file and the array's value is the number of occurrence:

$FIELD is the selected column number
${line[$FIELD]} is the column value from that line in the file
${...:-(empty)} is a special case for empty values (what happens if there is less columns than expected?)

To have the output sorted in the expected OP format, a little more work is needed:

sort -rn < <(
    for k in "${!results[@]}"; do
        echo "${results[$k]} $k";
    done
)

Warning: it works well for tab-delimited and space-delimited files, but works bad for values with spaces in it.

回复收藏 0 原文

~没有更多了~

关于作者

梦途

暂无简介

文章

25 人气

关注发私信

友情链接

文江博客

获取 bash 中列中唯一值的计数

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（7）

Perl

解释

结果基于 @Dennis 提供的示例输入文件

.csv 输入

混淆

Perl

Explanation

Results based on the sample input files provided by @Dennis

.csv input

Obfuscation

关于作者

相关话题

热门标签

推荐作者

知足的幸福

我一向站在原地

慕烟庭风

秉忠贞之诚守退让之实

小兔几

mb_3y7WUgWY

友情链接

获取 bash 中列中唯一值的计数

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（7）

Perl

解释

结果基于 @Dennis 提供的示例输入文件

.csv 输入

混淆

Perl

Explanation

Results based on the sample input files provided by @Dennis

.csv input

Obfuscation

关于作者

相关话题

热门标签

推荐作者

知足的幸福

我一向站在原地

慕烟庭风

秉忠贞之诚 守退让之实

小兔几

mb_3y7WUgWY

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

秉忠贞之诚守退让之实