awk - 不同值表

发布于 2024-12-23 00:49:10 字数 1820 浏览 1 评论 0原文

有了这个“|”分隔文件:dummy.dat

sid|storeNo|latitude|longitude
2|1|-28.03720000
9|2
10
jgn352|1|-28.03720000
9|2|fdjkjhn422-405
0000543210|gfdjk39

例如,纬度字段中的值“-28.03720000”出现两次,然后在输出中它将出现一次,但末尾有 (2)。另一个例子,值“2”在 sid 字段中出现一次,但在 storeno 字段中出现两次 - 因此对于输出,它将在 sid 字段下有一个条目(末尾带有“(1)”),在 storeno 下有一个条目字段(末尾带有“(2)”)。

期望的结果:

sid|storeNo|latitude|longitude
9(2)|1(2)|-28.03720000(2)
0000543210(1)|2(2)|fdjkjhn422-405(1)
10(1)|gfdjk39(1)    
2(1)
jgn352(1)

可接受的期望结果的另一个例子(给定相同的输入文件):

sid|storeNo|latitude|longitude
9(2)|2(2)|-28.03720000(2)
jgn352(1)|1(2)|fdjkjhn422-405(1)
10(1)|gfdjk39(1)    
0000543210(1)
2(1)

产生上述输出的通用解决方案是什么?我对 awk、bash、perl 等持开放态度 它是每个字段的不同值(使用“()”中该值出现的次数,然后按这些出现次数排序):

找到了这两个代码片段,它们获得了总体思路,但只是在不同的情况下输出格式:

Script 1:
awk -F"|" ' {
                for( i = 1; i <= NF; i++ )
                {
                        count[i " " $(i)]++;    # count by field number and field value
                        uniq[$(i)] = 1;         # save a list of unique strings
                }
                if( NF > fields )
                        fields = NF;            # in case a variable number in file; capture max
        }
        END {
                for( i = 1; i <= fields; i++ )
                {
                        printf( "field %d\n", i );
                        for( x in uniq )
                                if( count[i " " x] )
                                        printf( "%s (%d)\n", x, count[i " " x] );  # print by field and value
                        printf( "\n" );
                }
        } ' dummy.dat

Script 2:
awk -F"|" '{for (i=1;i<=NF;i++) a[i FS $i]++} END {for (i in a) print i,"(",a[i],")" |"sort -n" } ' dummy.dat

With this "|" delimited file: dummy.dat

sid|storeNo|latitude|longitude
2|1|-28.03720000
9|2
10
jgn352|1|-28.03720000
9|2|fdjkjhn422-405
0000543210|gfdjk39

For example the value "-28.03720000" in the latitude field appears twice, then in the output it will appear once but have at the end of it(2). Another example, the value "2" appeared once in the sid field but twice in the storeno field - so for output it will have one entry under the sid field (with "(1)" at the end) and one entry under the storeno field (with "(2)" at the end).

Desired result:

sid|storeNo|latitude|longitude
9(2)|1(2)|-28.03720000(2)
0000543210(1)|2(2)|fdjkjhn422-405(1)
10(1)|gfdjk39(1)    
2(1)
jgn352(1)

Another example of acceptable desired result (given the same input file):

sid|storeNo|latitude|longitude
9(2)|2(2)|-28.03720000(2)
jgn352(1)|1(2)|fdjkjhn422-405(1)
10(1)|gfdjk39(1)    
0000543210(1)
2(1)

What is the generic solution to produce such output as above? I am open to awk, bash, perl.etc
It is the distinct values of each field (with the count of the occurences of that value in "()" and then ordered desc by those count of occurrences):

Have found these 2 code snippets that get the general idea but just in a different output format:

Script 1:
awk -F"|" ' {
                for( i = 1; i <= NF; i++ )
                {
                        count[i " " $(i)]++;    # count by field number and field value
                        uniq[$(i)] = 1;         # save a list of unique strings
                }
                if( NF > fields )
                        fields = NF;            # in case a variable number in file; capture max
        }
        END {
                for( i = 1; i <= fields; i++ )
                {
                        printf( "field %d\n", i );
                        for( x in uniq )
                                if( count[i " " x] )
                                        printf( "%s (%d)\n", x, count[i " " x] );  # print by field and value
                        printf( "\n" );
                }
        } ' dummy.dat

Script 2:
awk -F"|" '{for (i=1;i<=NF;i++) a[i FS $i]++} END {for (i in a) print i,"(",a[i],")" |"sort -n" } ' dummy.dat

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

少年亿悲伤 2024-12-30 00:49:10
awk -F'|' '

FNR==NR{
  if(FNR>1)
    for(i=1;i<=NF;i++)
      a[$i,i]++
  next
}
FNR==1{print}
FNR>1{
  for(j=1;j<=NF;j++)
    if(b[$j,j]++)
      printf("|")
    else
      printf("%s(%s)|",$j,a[$j,j])
  print ""
}' ./dummy.dat ./dummy.dat | sed 's/|*$//'

输出

sid|storeNo|latitude|longitude
2(1)|1(2)|-28.03720000(2)
9(2)|2(2)
10(1)
jgn352(1)
||fdjkjhn422-405(1)
0000543210(1)|gfdjk39(1)

注意: 摆脱尾随的|需要一些额外的工作。希望这就足够了。
我刚刚将最终输出传递给 sed 's/|*$//'

awk -F'|' '

FNR==NR{
  if(FNR>1)
    for(i=1;i<=NF;i++)
      a[$i,i]++
  next
}
FNR==1{print}
FNR>1{
  for(j=1;j<=NF;j++)
    if(b[$j,j]++)
      printf("|")
    else
      printf("%s(%s)|",$j,a[$j,j])
  print ""
}' ./dummy.dat ./dummy.dat | sed 's/|*$//'

Output

sid|storeNo|latitude|longitude
2(1)|1(2)|-28.03720000(2)
9(2)|2(2)
10(1)
jgn352(1)
||fdjkjhn422-405(1)
0000543210(1)|gfdjk39(1)

Note: Getting rid of the trailing | is going to take some extra work. Hopefully this will suffice.
I just passed the final output to sed 's/|*$//'

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文