如何使用 python 对 CSV 文件的多列进行排序？

发布于 2024-12-14 06:24:13 字数 545 浏览 0 评论 0原文

我在弄清楚如何将大型数据集分类为更有用的数据时遇到问题。

CSV 格式的原始文件如下所示 - 数据表示 x、y、z 位置以及最后的能量。 x、y、z 坐标分布很广，这是下面的一个小片段 - 基本上它是对一个体积的能量搜索。

-2.800000,-1.000000,5.470000,-0.26488315
-3.000000,1.000000,4.070000,-0.81185718
-2.800000,-1.000000,3.270000,1.29303723
-2.800000,-0.400000,4.870000,-0.51165026

不幸的是，很难在必需的四个维度上进行绘制，因此我需要修剪这些数据。我想以这样一种方式来做到这一点：将体积转变为最低能量 z 轴上的表面。在较小的数据集上，这很简单，在 Excel 中按 X 排序，然后按 Y 排序，然后按能量排序，然后删除高于最低值的所有能量。这对于小数据集来说很容易，但很快就出现了问题。

我尝试了各种方法来执行此操作，例如拆分 csv 和使用排序命令，但我运气不佳。任何有关如何解决此问题的建议将不胜感激。

原文

I am having issues figuring out how to sort a large data set into more useful data.

The original file in CSV format is shown below- the data indicates x,y,z positions and finally energy. The x,y,z coordinates spread for quite a ways this is a small snippet below- basically it was an energy search over a volume.

-2.800000,-1.000000,5.470000,-0.26488315
-3.000000,1.000000,4.070000,-0.81185718
-2.800000,-1.000000,3.270000,1.29303723
-2.800000,-0.400000,4.870000,-0.51165026

Unfortunately its very difficult to plot in the requisite four dimensions so I need to trim this data. I would like to do this in such a way that I will turn the volume into a surface on the lowest energy z axis. On smaller data sets this was simple, in excel sort by X then Y and then energy, then delete all energies above the lowest. This was easy enough for small sets of data but has quickly become problematic.

I have tried various ways of doing this such as splitting the csv and using the sort command, but I am having little luck. Any advice on how to approach this would be much appreciated.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

扛刀软妹 2024-12-21 06:24:13

这就是您在对 Raymond 的答案的评论中提出的要求 - 仅返回每个 x, y 对的具有最低 z 的行

from operator import itemgetter
from itertools import groupby
from csv import reader


def min_z(iterable):
    # the data converted from strings to numbers
    floats = [[float(n) for n in row] for row in iterable]
    # the data sorted by x, y, z
    floats.sort(key=lambda (x, y, z, e): (x, y, z))
    # group the data by x, y
    grouped_floats = groupby(floats, key=itemgetter(slice(0, 2)))
    # return the first item from each group
    # because the data is sorted
    # the first item is the smallest z for the x, y group
    return [next(rowgroup) for xy, rowgroup in grouped_floats]


data = """-2.800000,-1.000000,5.470000,-0.26488315
-3.000000,1.000000,4.070000,-0.81185718
-2.800000,-1.000000,3.270000,1.29303723
-2.800000,-0.400000,4.870000,-0.51165026""".splitlines()


print min_z(reader(data))

：

[[-3.0, 1.0, 4.07, -0.81185718], 
 [-2.8, -1.0, 3.27, 1.29303723], 
 [-2.8, -0.4, 4.87, -0.51165026]]

This does what you ask in your comment to Raymond's answer -- returns just the row with the lowest z for each x, y pair:

from operator import itemgetter
from itertools import groupby
from csv import reader


def min_z(iterable):
    # the data converted from strings to numbers
    floats = [[float(n) for n in row] for row in iterable]
    # the data sorted by x, y, z
    floats.sort(key=lambda (x, y, z, e): (x, y, z))
    # group the data by x, y
    grouped_floats = groupby(floats, key=itemgetter(slice(0, 2)))
    # return the first item from each group
    # because the data is sorted
    # the first item is the smallest z for the x, y group
    return [next(rowgroup) for xy, rowgroup in grouped_floats]


data = """-2.800000,-1.000000,5.470000,-0.26488315
-3.000000,1.000000,4.070000,-0.81185718
-2.800000,-1.000000,3.270000,1.29303723
-2.800000,-0.400000,4.870000,-0.51165026""".splitlines()


print min_z(reader(data))

Prints:

[[-3.0, 1.0, 4.07, -0.81185718], 
 [-2.8, -1.0, 3.27, 1.29303723], 
 [-2.8, -0.4, 4.87, -0.51165026]]

回复收藏 0 原文

猫烠⑼条掵仅有一顆心 2024-12-21 06:24:13

使用 csv.reader 将数据读入元组列表后，按 (x, y) 值对数据进行排序。为了清楚起见，请使用命名元组来标识字段。

然后使用 itertools.groupby 对相关的 (x, y) 数据点。对于每一组，使用 min 来隔离能量最低的一组：

>>> import csv, collections, itertools

>>> raw_data = '''\
-2.800000,-1.000000,5.470000,-0.26488315
-3.000000,1.000000,4.070000,-0.81185718
-2.800000,-1.000000,3.270000,1.29303723
-2.800000,-0.400000,4.870000,-0.51165026
'''.splitlines()

>>> Sample = collections.namedtuple('Sample', ['x', 'y', 'z', 'energy'])
>>> data = [Sample(*row) for row in csv.reader(raw_data)]
>>> data.sort(key=lambda s: (s.x, s.y))
>>> for xy, group in itertools.groupby(data, key=lambda s: (s.x, s.y)):
        print min(group, key=lambda s: s.energy)


Sample(x='-2.800000', y='-0.400000', z='4.870000', energy='-0.51165026')
Sample(x='-2.800000', y='-1.000000', z='5.470000', energy='-0.26488315')
Sample(x='-3.000000', y='1.000000', z='4.070000', energy='-0.81185718')

After reading the data into a list of tuples with the csv.reader, sort the data by (x, y) values. For clarity, use named tuples to identify the fields.

Then use itertools.groupby to cluster the related (x, y) data points. For each group, use min to isolate the one with the lowest energy:

>>> import csv, collections, itertools

>>> raw_data = '''\
-2.800000,-1.000000,5.470000,-0.26488315
-3.000000,1.000000,4.070000,-0.81185718
-2.800000,-1.000000,3.270000,1.29303723
-2.800000,-0.400000,4.870000,-0.51165026
'''.splitlines()

>>> Sample = collections.namedtuple('Sample', ['x', 'y', 'z', 'energy'])
>>> data = [Sample(*row) for row in csv.reader(raw_data)]
>>> data.sort(key=lambda s: (s.x, s.y))
>>> for xy, group in itertools.groupby(data, key=lambda s: (s.x, s.y)):
        print min(group, key=lambda s: s.energy)


Sample(x='-2.800000', y='-0.400000', z='4.870000', energy='-0.51165026')
Sample(x='-2.800000', y='-1.000000', z='5.470000', energy='-0.26488315')
Sample(x='-3.000000', y='1.000000', z='4.070000', energy='-0.81185718')

回复收藏 0 原文