多列排序

发布于 2024-12-04 13:22:56 字数 1570 浏览 5 评论 0原文

我有一些以下格式的数据：

1298501934.311 42.048
1298501934.311 60.096
1298501934.311 64.128
1298501934.311 64.839
1298501944.203 28.352
1298501966.283 6.144
1298501972.900 0
1298501972.939 0
1298501972.943 0
1298501972.960 0
1298501972.961 0
1298501972.964 0
1298501973.964 28.636
1298501974.215 27.52
1298501974.407 25.984
1298501974.527 27.072
1298501974.527 31.168
1298501974.591 30.144
1298501974.591 31.296
1298501974.83 27.605
1298501975.804 28.096
1298501976.271 23.879
1298501978.488 25.472
1298501978.744 25.088
1298501978.808 25.088
1298501978.936 26.24
1298501979.123 26.048
1298501980.470 23.75
1298501980.86 17.53
1298501982.392 22.336
1298501990.199 8.064
1298501997.943 0.256
1298501997.943 0.448
1298501997.943 0.512
1298501997.943 5.952
1298501997.946 0.448
1298501997.946 0.576
1298501997.946 5.44

我的目标是从右列中获取左列中每个唯一值的最大值。例如，处理以下 4 行后：

1298501997.943 0.256
1298501997.943 0.448
1298501997.943 0.512
1298501997.943 5.952

我只想获取最后一行，

1298501997.943 5.952

因为“5.952”是 1298501997.943 的最大值

类似地，对于以下几行：

1298501997.946 0.448
1298501997.946 0.576
1298501997.946 5.44

我想获取：

1298501997.946 5.44

对于：

1298501990.199 8.064

简单：

1298501990.199 8.064

等等...

我尝试在 awk/uniq/ 等中搜索一些提示，但甚至不确定如何制定查询。我可以编写一个 Python 脚本，但感觉使用 awk 或其他一些标准工具会更有效（特别是因为我有大量数据 - 数百万/数千万行）。

PS：有没有适合这样的文本处理场景的Python模块？

谢谢

原文

I have some data in the following format:

1298501934.311 42.048
1298501934.311 60.096
1298501934.311 64.128
1298501934.311 64.839
1298501944.203 28.352
1298501966.283 6.144
1298501972.900 0
1298501972.939 0
1298501972.943 0
1298501972.960 0
1298501972.961 0
1298501972.964 0
1298501973.964 28.636
1298501974.215 27.52
1298501974.407 25.984
1298501974.527 27.072
1298501974.527 31.168
1298501974.591 30.144
1298501974.591 31.296
1298501974.83 27.605
1298501975.804 28.096
1298501976.271 23.879
1298501978.488 25.472
1298501978.744 25.088
1298501978.808 25.088
1298501978.936 26.24
1298501979.123 26.048
1298501980.470 23.75
1298501980.86 17.53
1298501982.392 22.336
1298501990.199 8.064
1298501997.943 0.256
1298501997.943 0.448
1298501997.943 0.512
1298501997.943 5.952
1298501997.946 0.448
1298501997.946 0.576
1298501997.946 5.44

My goal is to get the maximum value from the right column for each unique value in the left column. For instance, after processing the following 4 lines:

1298501997.943 0.256
1298501997.943 0.448
1298501997.943 0.512
1298501997.943 5.952

I would like to get just the last line,

1298501997.943 5.952

since "5.952" is the largest value for 1298501997.943

Similarly, for the following lines:

1298501997.946 0.448
1298501997.946 0.576
1298501997.946 5.44

I would like to get:

1298501997.946 5.44

And for:

1298501990.199 8.064

simply:

1298501990.199 8.064

and so on...

I tried searching for some hints in awk/uniq/etc., but not sure even how to formulate the query.
I could write a Python script, but it feels that proceeding with awk or some other standard tools would be more efficient (especially since I have a lot of data - millions/tens of millions of lines).

PS: Is there any Python module for text processing scenarios like that?

Thank you

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

烟火散人牵绊 2024-12-11 13:22:56

您可以将其放入 Excel 中（通过按空格字符拆分来导入）并以这种方式对其进行排序。这是一个相当暴力的解决方案，但很简单。

回复收藏 0 原文

一个人的旅程 2024-12-11 13:22:56

使用 awk：

{
    if (array[$1] < $2)
        array[$1]=$2
}
END {
    printf("%-20s%s\n", "Value", "Max")
    printf("%-20s%s\n", "-----", "---")
    for (i in array)
        printf("%-20s%s\n", i, array[i])
}

输出：

$ awk -f sort.awk log 
Value               Max
-----               ---
1298501980.86       17.53
1298501978.808      25.088
1298501974.215      27.52
1298501973.964      28.636
1298501979.123      26.048
1298501978.936      26.24
1298501975.804      28.096
1298501972.964      
1298501944.203      28.352
1298501974.83       27.605
1298501974.407      25.984
1298501997.943      5.952    <---- as in your example
1298501978.488      25.472
1298501972.939      
1298501972.900      
1298501982.392      22.336
1298501974.527      31.168
1298501997.946      5.44     <---- as in your example
1298501980.470      23.75
1298501974.591      31.296
1298501990.199      8.064    <---- as in your example
1298501966.283      6.144
1298501934.311      64.839
1298501976.271      23.879
1298501972.960      
1298501978.744      25.088
1298501972.961      
1298501972.943

Use awk:

{
    if (array[$1] < $2)
        array[$1]=$2
}
END {
    printf("%-20s%s\n", "Value", "Max")
    printf("%-20s%s\n", "-----", "---")
    for (i in array)
        printf("%-20s%s\n", i, array[i])
}

Output:

$ awk -f sort.awk log 
Value               Max
-----               ---
1298501980.86       17.53
1298501978.808      25.088
1298501974.215      27.52
1298501973.964      28.636
1298501979.123      26.048
1298501978.936      26.24
1298501975.804      28.096
1298501972.964      
1298501944.203      28.352
1298501974.83       27.605
1298501974.407      25.984
1298501997.943      5.952    <---- as in your example
1298501978.488      25.472
1298501972.939      
1298501972.900      
1298501982.392      22.336
1298501974.527      31.168
1298501997.946      5.44     <---- as in your example
1298501980.470      23.75
1298501974.591      31.296
1298501990.199      8.064    <---- as in your example
1298501966.283      6.144
1298501934.311      64.839
1298501976.271      23.879
1298501972.960      
1298501978.744      25.088
1298501972.961      
1298501972.943

回复收藏 0 原文

纸伞微斜 2024-12-11 13:22:56

一个简单的 sort -g 就可以解决问题。它是通用的数字排序并且可以处理空间。

回复收藏 0 原文

任性一次 2024-12-11 13:22:56

我怀疑 python 在这里的效率会比其他工具低得多（除非你需要每秒处理数百万个数据）。你可以做这样的事情：

import sys
d={}
for l in open(sys.argv[1]):
    a,b=[float(item) for item in l.split()]
    d[a]=max(d.get(a,b),b)
 for a in d: print a,d[a]

并运行它

$ python script.py dataFile

I doubt python would be significantly less efficient here than other tools (unless you need to process millions of data every fraction of second). You can do something like this:

import sys
d={}
for l in open(sys.argv[1]):
    a,b=[float(item) for item in l.split()]
    d[a]=max(d.get(a,b),b)
 for a in d: print a,d[a]

and run it with

$ python script.py dataFile

回复收藏 0 原文

老旧海报 2024-12-11 13:22:56

作为 shell 单行（使用 uniq 的 -f 参数，它忽略前 n 列；要忽略第二列，将交换列两次）

cat yourData | sort -g | awk '{print $2,$1};'  | uniq -f1 | awk '{print $2,$1};'

As a shell one-liner (uses the -f argument of uniq, which ignores first n columns; to ignore the second, columns are swapped twice)

cat yourData | sort -g | awk '{print $2,$1};'  | uniq -f1 | awk '{print $2,$1};'

回复收藏 0 原文

~没有更多了~

关于作者

归属感

暂无简介

文章

20981 人气

关注发私信

友情链接

文江博客

多列排序

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

櫻之舞

弥枳

m2429

寻找一个思念的角度

野却迷人

我怀念的。

友情链接

多列排序

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

櫻之舞

弥枳

m2429

寻找一个思念的角度

野却迷人

我怀念的。

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。