如何按字母顺序对 csv 数据进行排序,然后按列按数字排序?
如果我有一组具有重复名称值但每个重复值具有不同变体的数据,如何按每个重复值的顶部进行排序?希望这是有道理的,但我希望在下面进一步证明我的意思。
以制表符分隔的 csv 文件中的这组数据为例
Ranking ID Year Make Model Total
1 128 2010 Infiniti G37 128
2 124 2015 Jeep Wrangler 124
3 15 014 Audi S4 120
4 113 2012 Acura Tsx sportwagon 116
5 83 2014 Honda Accord 112
6 112 2008 Acura TL 110
7 65 2009 Honda Fit 106
8 91 2010 Mitsu Lancer 102
9 50 2015 Acura TLX 102
10 31 2007 Honda Fit 102
11 216 2007 Chrystler 300 96
12 126 2010 Volkswagen Eos 92
13 13 2016 Honda Civic 1.5t 92
,如果您查看“品牌”列,您可以看到 Acura 和 Honda 等名称重复,但“型号”和“总计”列中存在差异。假设 csv 文件中有 200 行左右。如何对文件进行排序,以便按品牌对项目进行分组,并且每个品牌仅显示“总计”列下值最高的三个项目?
下面
Ranking ID Year Make Model Total
1 113 2012 Acura Tsx sportwagon 116
2 112 2008 Acura TL 110
3 50 2015 Acura TLX 106
4 83 2014 Honda Accord 112
5 31 2007 Honda Fit 102
6 13 2016 Honda Civic 1.5t 92
...
是我的 awk 代码,到目前为止,我什至无法尝试按总列对品牌进行分组
BEGIN {
FS = OFS = "\t";
}
FNR == 1 {
print;
next;
}
FNR > 1 {
a[NR] = $4;
}
END {
PROCINFO["sorted_in"] = "@val_str_desc"
for(i = 1; i < FN-1; i++) {
print a[i];
}
}
目前,我的代码读取文本文件,打印标题(列标题),然后停在那里,它不会继续按字母顺序打印其余数据。有什么想法吗?
If I have a set of data that has repeating name values but with different variations per repeating value, how can I sort by the top of each of those repeating values? Hopefully that made sense, but I hope to demonstrate what I mean further below.
Take for example this set of data in a tab separated csv file
Ranking ID Year Make Model Total
1 128 2010 Infiniti G37 128
2 124 2015 Jeep Wrangler 124
3 15 014 Audi S4 120
4 113 2012 Acura Tsx sportwagon 116
5 83 2014 Honda Accord 112
6 112 2008 Acura TL 110
7 65 2009 Honda Fit 106
8 91 2010 Mitsu Lancer 102
9 50 2015 Acura TLX 102
10 31 2007 Honda Fit 102
11 216 2007 Chrystler 300 96
12 126 2010 Volkswagen Eos 92
13 13 2016 Honda Civic 1.5t 92
If you look in the Make column, you can see names like Acura and Honda repeat, with differences in the Model and Total column. Assume that there's 200 or so rows of this in the csv file. How can I sort the file so that the items are grouped by Make with only three of the highest in value under the Total column being displayed by each Make?
Expected output below
Ranking ID Year Make Model Total
1 113 2012 Acura Tsx sportwagon 116
2 112 2008 Acura TL 110
3 50 2015 Acura TLX 106
4 83 2014 Honda Accord 112
5 31 2007 Honda Fit 102
6 13 2016 Honda Civic 1.5t 92
...
Here is my awk code so far, I can't get past this part to even attempt grouping the makes by total column
BEGIN {
FS = OFS = "\t";
}
FNR == 1 {
print;
next;
}
FNR > 1 {
a[NR] = $4;
}
END {
PROCINFO["sorted_in"] = "@val_str_desc"
for(i = 1; i < FN-1; i++) {
print a[i];
}
}
Currently, my code reads the text file, prints the headers (column titles) and then stops there, it doesn't go on to print out the rest of the data in alphabetical order. Any ideas?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
以下假设使用 bash(如果不使用 bash,请将
$'\t'
替换为带引号的实制表符)和 GNU coreutils。它还假设您希望首先按Make
列按字母顺序排序,然后按Total
按数字降序排序,最后最多保留每个Make 的前 3 个
条目。排序是
sort
的工作,head
和tail
可以用来隔离标题行,awk
可以用于保留每个Make
最多 3 个,并对第一列重新编号:请注意,这与您预期的输出不同:
Make
按字母顺序排序(< code>Audi 排在Acura
之后,而不是Honda
),并且仅保留 3 个最大的Total
(Honda
为112, 106, 102
,而不是>112, 102, 92
)。如果您使用 GNU
awk
,并且您的输入文件足够小,可以容纳在内存中,那么您也可以仅使用awk
来完成所有这些操作,这要归功于它的多维数组及其 < code>asorti 函数,根据索引对数组进行排序:The following assumes bash (if you don't use bash replace
$'\t'
by a quoted real tab character) and GNU coreutils. It also assumes that you want to sort alphabetically byMake
column first, then numerically in decreasing order byTotal
, and finally keep at most the first 3 of eachMake
entries.Sorting is a job for
sort
,head
andtail
can be used to isolate the header line, andawk
can be used to keep maximum 3 of eachMake
, and re-number the first column:Note that this is different from your expected output:
Make
is sorted in alphabetic order (Audi
comes afterAcura
, notHonda
) and only the 3 largestTotal
are kept (112, 106, 102
forHonda
, not112, 102, 92
).If you use GNU
awk
, and your input file is small enough to fit in memory, you can also do all this with justawk
, thanks to its multidimensional arrays and itsasorti
function, that sorts arrays based on indices:使用 GNU awk 处理数组数组和
sorted_in
:上面将处理 1 个品牌的多辆汽车具有相同总数的情况,方法是始终只打印该品牌的前 3 行,例如,鉴于此输入,其中 4 辆 Acuras 总共有 116 辆:
这是输出,仅显示这 4 116 辆汽车中的 3 辆Acuras:
如果这不是您想要的,则将
if ( ++cnt <= 3 )
测试移至外循环或按您想要的其他方式处理它。Using GNU awk for arrays of arrays and
sorted_in
:The above will handle cases where multiple cars of 1 make have the same total by always just printing the top 3 rows for that make, e.g. gven this input where 4 Acuras all have 116 total:
this is the output showing just 3 of those 4 116 Acuras:
If that's not what you want then move the
if ( ++cnt <= 3 )
test to the outer loop or handle it however else you want.