如何按字母顺序对 csv 数据进行排序，然后按列按数字排序？

发布于 2025-01-15 02:02:59 字数 1758 浏览 2 评论 0原文

如果我有一组具有重复名称值但每个重复值具有不同变体的数据，如何按每个重复值的顶部进行排序？希望这是有道理的，但我希望在下面进一步证明我的意思。

以制表符分隔的 csv 文件中的这组数据为例

Ranking  ID     Year  Make        Model     Total
1        128    2010  Infiniti    G37       128
2        124    2015  Jeep        Wrangler  124
3        15     014   Audi        S4        120
4        113    2012  Acura       Tsx       sportwagon  116
5        83     2014  Honda       Accord    112
6        112    2008  Acura       TL        110
7        65     2009  Honda       Fit       106
8        91     2010  Mitsu       Lancer    102
9        50     2015  Acura       TLX       102
10       31     2007  Honda       Fit       102
11       216    2007  Chrystler   300       96
12       126    2010  Volkswagen  Eos       92
13       13     2016  Honda       Civic     1.5t        92

，如果您查看“品牌”列，您可以看到 Acura 和 Honda 等名称重复，但“型号”和“总计”列中存在差异。假设 csv 文件中有 200 行左右。如何对文件进行排序，以便按品牌对项目进行分组，并且每个品牌仅显示“总计”列下值最高的三个项目？

下面

Ranking   ID      Year    Make     Model           Total
1         113     2012    Acura    Tsx sportwagon  116
2         112     2008    Acura    TL              110
3         50      2015    Acura    TLX             106
4         83      2014    Honda    Accord          112
5         31      2007    Honda    Fit             102
6         13      2016    Honda    Civic 1.5t      92
...

是我的 awk 代码，到目前为止，我什至无法尝试按总列对品牌进行分组

BEGIN {
        FS = OFS = "\t";
}
FNR == 1 {
        print;
        next;
}
FNR > 1 {
        a[NR] = $4;
}
END {
        PROCINFO["sorted_in"] = "@val_str_desc"
        for(i = 1; i < FN-1; i++) {
                print a[i];
        }
}

目前，我的代码读取文本文件，打印标题（列标题），然后停在那里，它不会继续按字母顺序打印其余数据。有什么想法吗？

原文

If I have a set of data that has repeating name values but with different variations per repeating value, how can I sort by the top of each of those repeating values? Hopefully that made sense, but I hope to demonstrate what I mean further below.

Take for example this set of data in a tab separated csv file

Ranking  ID     Year  Make        Model     Total
1        128    2010  Infiniti    G37       128
2        124    2015  Jeep        Wrangler  124
3        15     014   Audi        S4        120
4        113    2012  Acura       Tsx       sportwagon  116
5        83     2014  Honda       Accord    112
6        112    2008  Acura       TL        110
7        65     2009  Honda       Fit       106
8        91     2010  Mitsu       Lancer    102
9        50     2015  Acura       TLX       102
10       31     2007  Honda       Fit       102
11       216    2007  Chrystler   300       96
12       126    2010  Volkswagen  Eos       92
13       13     2016  Honda       Civic     1.5t        92

If you look in the Make column, you can see names like Acura and Honda repeat, with differences in the Model and Total column. Assume that there's 200 or so rows of this in the csv file. How can I sort the file so that the items are grouped by Make with only three of the highest in value under the Total column being displayed by each Make?

Expected output below

Ranking   ID      Year    Make     Model           Total
1         113     2012    Acura    Tsx sportwagon  116
2         112     2008    Acura    TL              110
3         50      2015    Acura    TLX             106
4         83      2014    Honda    Accord          112
5         31      2007    Honda    Fit             102
6         13      2016    Honda    Civic 1.5t      92
...

Here is my awk code so far, I can't get past this part to even attempt grouping the makes by total column

BEGIN {
        FS = OFS = "\t";
}
FNR == 1 {
        print;
        next;
}
FNR > 1 {
        a[NR] = $4;
}
END {
        PROCINFO["sorted_in"] = "@val_str_desc"
        for(i = 1; i < FN-1; i++) {
                print a[i];
        }
}

Currently, my code reads the text file, prints the headers (column titles) and then stops there, it doesn't go on to print out the rest of the data in alphabetical order. Any ideas?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

禾厶谷欠 2025-01-22 02:02:59

以下假设使用 bash（如果不使用 bash，请将 $'\t' 替换为带引号的实制表符）和 GNU coreutils。它还假设您希望首先按 Make 列按字母顺序排序，然后按 Total 按数字降序排序，最后最多保留每个 Make 的前 3 个 条目。

排序是sort的工作，head和tail可以用来隔离标题行，awk可以用于保留每个 Make 最多 3 个，并对第一列重新编号：

$ head -n1 data.tsv; tail -n+2 data.tsv | sort -t
请注意，这与您预期的输出不同：Make 按字母顺序排序（< code>Audi 排在 Acura 之后，而不是Honda），并且仅保留 3 个最大的 Total（Honda 为 112, 106, 102，而不是  >112, 102, 92)。
如果您使用 GNU awk，并且您的输入文件足够小，可以容纳在内存中，那么您也可以仅使用 awk 来完成所有这些操作，这要归功于它的多维数组及其 < code>asorti 函数，根据索引对数组进行排序：
$ awk -F'\t' -vOFS='\t' 'NR==1 {print; next} {l[$4][$6][$0]}
  END {
    PROCINFO["sorted_in"] = "@ind_str_asc"
    for(m in l) {
      n = asorti(l[m], t, "@ind_num_desc"); n = (n>3) ? 3 : n
      for(i=1; i<=n; i++) for(s in l[m][t[i]]) {$0 = s; $1 = ++r; print}
    }
  }' data.tsv
Ranking  ID   Year  Make        Model           Total
1        113  2012  Acura       Tsx sportwagon  116
2        112  2008  Acura       TL              110
3        50   2015  Acura       TLX             102
4        15   014   Audi        S4              120
5        216  2007  Chrystler   300             96
6        83   2014  Honda       Accord          112
7        65   2009  Honda       Fit             106
8        31   2007  Honda       Fit             102
9        128  2010  Infiniti    G37             128
10       124  2015  Jeep        Wrangler        124
11       91   2010  Mitsu       Lancer          102
12       126  2010  Volkswagen  Eos             92

\t' -k4,4 -k6,6rn |
  awk -F'\t' -vOFS='\t' '$4==p {n+=1} $4!=p {n=1;p=$4} {$1=++r} n<=3'
Ranking  ID   Year  Make        Model           Total
1        113  2012  Acura       Tsx sportwagon  116
2        112  2008  Acura       TL              110
3        50   2015  Acura       TLX             102
4        15   014   Audi        S4              120
5        216  2007  Chrystler   300             96
6        83   2014  Honda       Accord          112
7        65   2009  Honda       Fit             106
8        31   2007  Honda       Fit             102
10       128  2010  Infiniti    G37             128
11       124  2015  Jeep        Wrangler        124
12       91   2010  Mitsu       Lancer          102
13       126  2010  Volkswagen  Eos             92

请注意，这与您预期的输出不同：Make 按字母顺序排序（< code>Audi 排在 Acura 之后，而不是Honda），并且仅保留 3 个最大的 Total（Honda 为 112, 106, 102，而不是 >112, 102, 92)。

如果您使用 GNU awk，并且您的输入文件足够小，可以容纳在内存中，那么您也可以仅使用 awk 来完成所有这些操作，这要归功于它的多维数组及其 < code>asorti 函数，根据索引对数组进行排序：

The following assumes bash (if you don't use bash replace $'\t' by a quoted real tab character) and GNU coreutils. It also assumes that you want to sort alphabetically by Make column first, then numerically in decreasing order by Total, and finally keep at most the first 3 of each Make entries.

Sorting is a job for sort, head and tail can be used to isolate the header line, and awk can be used to keep maximum 3 of each Make, and re-number the first column:

$ head -n1 data.tsv; tail -n+2 data.tsv | sort -t
Note that this is different from your expected output: Make is sorted in alphabetic order (Audi comes after Acura, not Honda) and only the 3 largest Total are kept (112, 106, 102 for Honda, not 112, 102, 92).
If you use GNU awk, and your input file is small enough to fit in memory, you can also do all this with just awk, thanks to its multidimensional arrays and its asorti function, that sorts arrays based on indices:
$ awk -F'\t' -vOFS='\t' 'NR==1 {print; next} {l[$4][$6][$0]}
  END {
    PROCINFO["sorted_in"] = "@ind_str_asc"
    for(m in l) {
      n = asorti(l[m], t, "@ind_num_desc"); n = (n>3) ? 3 : n
      for(i=1; i<=n; i++) for(s in l[m][t[i]]) {$0 = s; $1 = ++r; print}
    }
  }' data.tsv
Ranking  ID   Year  Make        Model           Total
1        113  2012  Acura       Tsx sportwagon  116
2        112  2008  Acura       TL              110
3        50   2015  Acura       TLX             102
4        15   014   Audi        S4              120
5        216  2007  Chrystler   300             96
6        83   2014  Honda       Accord          112
7        65   2009  Honda       Fit             106
8        31   2007  Honda       Fit             102
9        128  2010  Infiniti    G37             128
10       124  2015  Jeep        Wrangler        124
11       91   2010  Mitsu       Lancer          102
12       126  2010  Volkswagen  Eos             92

\t' -k4,4 -k6,6rn |
  awk -F'\t' -vOFS='\t' '$4==p {n+=1} $4!=p {n=1;p=$4} {$1=++r} n<=3'
Ranking  ID   Year  Make        Model           Total
1        113  2012  Acura       Tsx sportwagon  116
2        112  2008  Acura       TL              110
3        50   2015  Acura       TLX             102
4        15   014   Audi        S4              120
5        216  2007  Chrystler   300             96
6        83   2014  Honda       Accord          112
7        65   2009  Honda       Fit             106
8        31   2007  Honda       Fit             102
10       128  2010  Infiniti    G37             128
11       124  2015  Jeep        Wrangler        124
12       91   2010  Mitsu       Lancer          102
13       126  2010  Volkswagen  Eos             92

Note that this is different from your expected output: Make is sorted in alphabetic order (Audi comes after Acura, not Honda) and only the 3 largest Total are kept (112, 106, 102 for Honda, not 112, 102, 92).

If you use GNU awk, and your input file is small enough to fit in memory, you can also do all this with just awk, thanks to its multidimensional arrays and its asorti function, that sorts arrays based on indices:

回复收藏 0 原文

可爱暴击 2025-01-22 02:02:59

使用 GNU awk 处理数组数组和 sorted_in：

$ cat tst.awk
BEGIN { FS=OFS="\t" }
NR == 1 {
    print
    next
}
{
    rows[$4][$6][++numRows[$4,$6]] = $0
}
END {
    PROCINFO["sorted_in"] = "@ind_str_asc"
    for ( make in rows ) {
        PROCINFO["sorted_in"] = "@ind_num_desc"
        cnt = 0
        for ( total in rows[make] ) {
            for ( rowNr=1; rowNr<=numRows[make,total]; rowNr++ ) {
                if ( ++cnt <= 3 ) {
                    row = rows[make][total][rowNr]
                    print row, cnt
                }
            }
        }
    }
}

$ awk -f tst.awk file
Ranking ID      Year    Make    Model   Total
4       113     2012    Acura   Tsx sportwagon  116     1
6       112     2008    Acura   TL      110     2
9       50      2015    Acura   TLX     102     3
3       15      014     Audi    S4      120     1
11      216     2007    Chrystler       300     96      1
5       83      2014    Honda   Accord  112     1
7       65      2009    Honda   Fit     106     2
10      31      2007    Honda   Fit     102     3
1       128     2010    Infiniti        G37     128     1
2       124     2015    Jeep    Wrangler        124     1
8       91      2010    Mitsu   Lancer  102     1
12      126     2010    Volkswagen      Eos     92      1

上面将处理 1 个品牌的多辆汽车具有相同总数的情况，方法是始终只打印该品牌的前 3 行，例如，鉴于此输入，其中 4 辆 Acuras 总共有 116 辆：

$ cat file
Ranking ID      Year    Make    Model   Total
1       128     2010    Infiniti        G37     128
2       124     2015    Jeep    Wrangler        124
3       15      014     Audi    S4      120
4       113     2012    Acura   Tsx sportwagon  116
4       113     2012    Acura   Foo     116
4       113     2012    Acura   Bar     116
4       113     2012    Acura   Other   116
5       83      2014    Honda   Accord  112
6       112     2008    Acura   TL      110
7       65      2009    Honda   Fit     106
8       91      2010    Mitsu   Lancer  102
9       50      2015    Acura   TLX     102
10      31      2007    Honda   Fit     102
11      216     2007    Chrystler       300     96
12      126     2010    Volkswagen      Eos     92
13      13      2016    Honda   Civic 1.5t      92

这是输出，仅显示这 4 116 辆汽车中的 3 辆Acuras：

$ awk -f tst.awk file
Ranking ID      Year    Make    Model   Total
4       113     2012    Acura   Tsx sportwagon  116     1
4       113     2012    Acura   Foo     116     2
4       113     2012    Acura   Bar     116     3
3       15      014     Audi    S4      120     1
11      216     2007    Chrystler       300     96      1
5       83      2014    Honda   Accord  112     1
7       65      2009    Honda   Fit     106     2
10      31      2007    Honda   Fit     102     3
1       128     2010    Infiniti        G37     128     1
2       124     2015    Jeep    Wrangler        124     1
8       91      2010    Mitsu   Lancer  102     1
12      126     2010    Volkswagen      Eos     92      1

如果这不是您想要的，则将 if ( ++cnt <= 3 ) 测试移至外循环或按您想要的其他方式处理它。

Using GNU awk for arrays of arrays and sorted_in:

$ cat tst.awk
BEGIN { FS=OFS="\t" }
NR == 1 {
    print
    next
}
{
    rows[$4][$6][++numRows[$4,$6]] = $0
}
END {
    PROCINFO["sorted_in"] = "@ind_str_asc"
    for ( make in rows ) {
        PROCINFO["sorted_in"] = "@ind_num_desc"
        cnt = 0
        for ( total in rows[make] ) {
            for ( rowNr=1; rowNr<=numRows[make,total]; rowNr++ ) {
                if ( ++cnt <= 3 ) {
                    row = rows[make][total][rowNr]
                    print row, cnt
                }
            }
        }
    }
}

$ awk -f tst.awk file
Ranking ID      Year    Make    Model   Total
4       113     2012    Acura   Tsx sportwagon  116     1
6       112     2008    Acura   TL      110     2
9       50      2015    Acura   TLX     102     3
3       15      014     Audi    S4      120     1
11      216     2007    Chrystler       300     96      1
5       83      2014    Honda   Accord  112     1
7       65      2009    Honda   Fit     106     2
10      31      2007    Honda   Fit     102     3
1       128     2010    Infiniti        G37     128     1
2       124     2015    Jeep    Wrangler        124     1
8       91      2010    Mitsu   Lancer  102     1
12      126     2010    Volkswagen      Eos     92      1

The above will handle cases where multiple cars of 1 make have the same total by always just printing the top 3 rows for that make, e.g. gven this input where 4 Acuras all have 116 total:

$ cat file
Ranking ID      Year    Make    Model   Total
1       128     2010    Infiniti        G37     128
2       124     2015    Jeep    Wrangler        124
3       15      014     Audi    S4      120
4       113     2012    Acura   Tsx sportwagon  116
4       113     2012    Acura   Foo     116
4       113     2012    Acura   Bar     116
4       113     2012    Acura   Other   116
5       83      2014    Honda   Accord  112
6       112     2008    Acura   TL      110
7       65      2009    Honda   Fit     106
8       91      2010    Mitsu   Lancer  102
9       50      2015    Acura   TLX     102
10      31      2007    Honda   Fit     102
11      216     2007    Chrystler       300     96
12      126     2010    Volkswagen      Eos     92
13      13      2016    Honda   Civic 1.5t      92

this is the output showing just 3 of those 4 116 Acuras:

$ awk -f tst.awk file
Ranking ID      Year    Make    Model   Total
4       113     2012    Acura   Tsx sportwagon  116     1
4       113     2012    Acura   Foo     116     2
4       113     2012    Acura   Bar     116     3
3       15      014     Audi    S4      120     1
11      216     2007    Chrystler       300     96      1
5       83      2014    Honda   Accord  112     1
7       65      2009    Honda   Fit     106     2
10      31      2007    Honda   Fit     102     3
1       128     2010    Infiniti        G37     128     1
2       124     2015    Jeep    Wrangler        124     1
8       91      2010    Mitsu   Lancer  102     1
12      126     2010    Volkswagen      Eos     92      1

If that's not what you want then move the if ( ++cnt <= 3 ) test to the outer loop or handle it however else you want.

回复收藏 0 原文

~没有更多了~