awk - 不同值表
有了这个“|”分隔文件:dummy.dat
sid|storeNo|latitude|longitude
2|1|-28.03720000
9|2
10
jgn352|1|-28.03720000
9|2|fdjkjhn422-405
0000543210|gfdjk39
例如,纬度字段中的值“-28.03720000”出现两次,然后在输出中它将出现一次,但末尾有 (2)。另一个例子,值“2”在 sid 字段中出现一次,但在 storeno 字段中出现两次 - 因此对于输出,它将在 sid 字段下有一个条目(末尾带有“(1)”),在 storeno 下有一个条目字段(末尾带有“(2)”)。
期望的结果:
sid|storeNo|latitude|longitude
9(2)|1(2)|-28.03720000(2)
0000543210(1)|2(2)|fdjkjhn422-405(1)
10(1)|gfdjk39(1)
2(1)
jgn352(1)
可接受的期望结果的另一个例子(给定相同的输入文件):
sid|storeNo|latitude|longitude
9(2)|2(2)|-28.03720000(2)
jgn352(1)|1(2)|fdjkjhn422-405(1)
10(1)|gfdjk39(1)
0000543210(1)
2(1)
产生上述输出的通用解决方案是什么?我对 awk、bash、perl 等持开放态度 它是每个字段的不同值(使用“()”中该值出现的次数,然后按这些出现次数排序):
找到了这两个代码片段,它们获得了总体思路,但只是在不同的情况下输出格式:
Script 1:
awk -F"|" ' {
for( i = 1; i <= NF; i++ )
{
count[i " " $(i)]++; # count by field number and field value
uniq[$(i)] = 1; # save a list of unique strings
}
if( NF > fields )
fields = NF; # in case a variable number in file; capture max
}
END {
for( i = 1; i <= fields; i++ )
{
printf( "field %d\n", i );
for( x in uniq )
if( count[i " " x] )
printf( "%s (%d)\n", x, count[i " " x] ); # print by field and value
printf( "\n" );
}
} ' dummy.dat
Script 2:
awk -F"|" '{for (i=1;i<=NF;i++) a[i FS $i]++} END {for (i in a) print i,"(",a[i],")" |"sort -n" } ' dummy.dat
With this "|" delimited file: dummy.dat
sid|storeNo|latitude|longitude
2|1|-28.03720000
9|2
10
jgn352|1|-28.03720000
9|2|fdjkjhn422-405
0000543210|gfdjk39
For example the value "-28.03720000" in the latitude field appears twice, then in the output it will appear once but have at the end of it(2). Another example, the value "2" appeared once in the sid field but twice in the storeno field - so for output it will have one entry under the sid field (with "(1)" at the end) and one entry under the storeno field (with "(2)" at the end).
Desired result:
sid|storeNo|latitude|longitude
9(2)|1(2)|-28.03720000(2)
0000543210(1)|2(2)|fdjkjhn422-405(1)
10(1)|gfdjk39(1)
2(1)
jgn352(1)
Another example of acceptable desired result (given the same input file):
sid|storeNo|latitude|longitude
9(2)|2(2)|-28.03720000(2)
jgn352(1)|1(2)|fdjkjhn422-405(1)
10(1)|gfdjk39(1)
0000543210(1)
2(1)
What is the generic solution to produce such output as above? I am open to awk, bash, perl.etc
It is the distinct values of each field (with the count of the occurences of that value in "()" and then ordered desc by those count of occurrences):
Have found these 2 code snippets that get the general idea but just in a different output format:
Script 1:
awk -F"|" ' {
for( i = 1; i <= NF; i++ )
{
count[i " " $(i)]++; # count by field number and field value
uniq[$(i)] = 1; # save a list of unique strings
}
if( NF > fields )
fields = NF; # in case a variable number in file; capture max
}
END {
for( i = 1; i <= fields; i++ )
{
printf( "field %d\n", i );
for( x in uniq )
if( count[i " " x] )
printf( "%s (%d)\n", x, count[i " " x] ); # print by field and value
printf( "\n" );
}
} ' dummy.dat
Script 2:
awk -F"|" '{for (i=1;i<=NF;i++) a[i FS $i]++} END {for (i in a) print i,"(",a[i],")" |"sort -n" } ' dummy.dat
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
输出
注意:
摆脱尾随的|
需要一些额外的工作。希望这就足够了。我刚刚将最终输出传递给 sed 's/|*$//'
Output
Note:
Getting rid of the trailing|
is going to take some extra work. Hopefully this will suffice.I just passed the final output to
sed 's/|*$//'