多列排序
我有一些以下格式的数据:
1298501934.311 42.048
1298501934.311 60.096
1298501934.311 64.128
1298501934.311 64.839
1298501944.203 28.352
1298501966.283 6.144
1298501972.900 0
1298501972.939 0
1298501972.943 0
1298501972.960 0
1298501972.961 0
1298501972.964 0
1298501973.964 28.636
1298501974.215 27.52
1298501974.407 25.984
1298501974.527 27.072
1298501974.527 31.168
1298501974.591 30.144
1298501974.591 31.296
1298501974.83 27.605
1298501975.804 28.096
1298501976.271 23.879
1298501978.488 25.472
1298501978.744 25.088
1298501978.808 25.088
1298501978.936 26.24
1298501979.123 26.048
1298501980.470 23.75
1298501980.86 17.53
1298501982.392 22.336
1298501990.199 8.064
1298501997.943 0.256
1298501997.943 0.448
1298501997.943 0.512
1298501997.943 5.952
1298501997.946 0.448
1298501997.946 0.576
1298501997.946 5.44
我的目标是从右列中获取左列中每个唯一值的最大值。例如,处理以下 4 行后:
1298501997.943 0.256
1298501997.943 0.448
1298501997.943 0.512
1298501997.943 5.952
我只想获取最后一行,
1298501997.943 5.952
因为“5.952”是 1298501997.943
的最大值
类似地,对于以下几行:
1298501997.946 0.448
1298501997.946 0.576
1298501997.946 5.44
我想获取:
1298501997.946 5.44
对于:
1298501990.199 8.064
简单:
1298501990.199 8.064
等等...
我尝试在 awk/uniq/ 等中搜索一些提示,但甚至不确定如何制定查询。 我可以编写一个 Python 脚本,但感觉使用 awk 或其他一些标准工具会更有效(特别是因为我有大量数据 - 数百万/数千万行)。
PS:有没有适合这样的文本处理场景的Python模块?
谢谢
I have some data in the following format:
1298501934.311 42.048
1298501934.311 60.096
1298501934.311 64.128
1298501934.311 64.839
1298501944.203 28.352
1298501966.283 6.144
1298501972.900 0
1298501972.939 0
1298501972.943 0
1298501972.960 0
1298501972.961 0
1298501972.964 0
1298501973.964 28.636
1298501974.215 27.52
1298501974.407 25.984
1298501974.527 27.072
1298501974.527 31.168
1298501974.591 30.144
1298501974.591 31.296
1298501974.83 27.605
1298501975.804 28.096
1298501976.271 23.879
1298501978.488 25.472
1298501978.744 25.088
1298501978.808 25.088
1298501978.936 26.24
1298501979.123 26.048
1298501980.470 23.75
1298501980.86 17.53
1298501982.392 22.336
1298501990.199 8.064
1298501997.943 0.256
1298501997.943 0.448
1298501997.943 0.512
1298501997.943 5.952
1298501997.946 0.448
1298501997.946 0.576
1298501997.946 5.44
My goal is to get the maximum value from the right column for each unique value in the left column. For instance, after processing the following 4 lines:
1298501997.943 0.256
1298501997.943 0.448
1298501997.943 0.512
1298501997.943 5.952
I would like to get just the last line,
1298501997.943 5.952
since "5.952" is the largest value for 1298501997.943
Similarly, for the following lines:
1298501997.946 0.448
1298501997.946 0.576
1298501997.946 5.44
I would like to get:
1298501997.946 5.44
And for:
1298501990.199 8.064
simply:
1298501990.199 8.064
and so on...
I tried searching for some hints in awk/uniq/etc., but not sure even how to formulate the query.
I could write a Python script, but it feels that proceeding with awk or some other standard tools would be more efficient (especially since I have a lot of data - millions/tens of millions of lines).
PS: Is there any Python module for text processing scenarios like that?
Thank you
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
您可以将其放入 Excel 中(通过按空格字符拆分来导入)并以这种方式对其进行排序。这是一个相当暴力的解决方案,但很简单。
You could put it in Excel (importing it by splitting on the SPACE character) and sort it that way. This is a rather brute-force solution, but it's simple.
使用 awk:
输出:
Use awk:
Output:
一个简单的
sort -g
就可以解决问题。它是通用的数字排序并且可以处理空间。A simple
sort -g
does the trick. It is general numeric sort and can handle space.我怀疑 python 在这里的效率会比其他工具低得多(除非你需要每秒处理数百万个数据)。你可以做这样的事情:
并运行它
I doubt python would be significantly less efficient here than other tools (unless you need to process millions of data every fraction of second). You can do something like this:
and run it with
作为 shell 单行(使用
uniq
的-f
参数,它忽略前 n 列;要忽略第二列,将交换列两次)As a shell one-liner (uses the
-f
argument ofuniq
, which ignores first n columns; to ignore the second, columns are swapped twice)