使用 scipy/numpy 在 Python 中解析字母数字 CSV 的最终方法

发布于 2024-09-04 13:30:34 字数 1753 浏览 5 评论 0原文

我一直在尝试找到一种良好且灵活的方法来在 Python 中解析 CSV 文件,但似乎没有一个标准选项符合要求。我很想自己写一个,但我认为 numpy/scipy 和 csv 模块中存在的某种组合可以满足我的需要,所以我不想重新发明轮子。

我想要能够指定分隔符、指定是否有标题、要跳过多少行、注释分隔符、要忽略哪些列等的标准功能。我缺少的核心功能是能够解析 CSV以优雅地处理字符串数据和数字数据的方式处理文件。我的许多 CSV 文件都包含包含字符串(长度不一定相同)和数字数据的列。我希望能够拥有此数字数据的 numpy 数组功能,但也能够访问字符串。例如,假设我的文件如下所示(想象列是用制表符分隔的):

# my file
name  favorite_integer  favorite_float1  favorite_float2  short_description
johnny  5  60.2  0.52  johnny likes fruitflies
bob 1  17.52  0.001  bob, bobby, robert

data = loadcsv('myfile.csv', delimiter='\t', parse_header=True, comment='#')

我希望能够以两种方式访问​​数据:

  1. 作为值矩阵:对我来说获得 numpy.txt 文件很重要。数组,以便我可以轻松转置和访问数字列。在这种情况下,我希望能够执行以下操作:

    floats_and_ints = data.matrix

    floats_and_ints[:, 0] # 访问整数

    floats_and_ints[:, 1:3] # 访问一些浮点数 transpose(floats_and_ints) # 等..

  2. 作为一个类似字典的对象,我不必知道标题的顺序:我还想通过标题访问数据命令。例如,我想做:

    data['favorite_float1'] # 获取带标题的列的所有值 “favorite_float1”

    data['name'] # 获取行的所有名称

我不想知道 favorite_float1 是表中的第二列,因为这可能改变。

对我来说,能够迭代行并按名称访问字段也很重要。例如:

for row in data:
  # print names and favorite integers of all 
  print "Name: ", row["name"], row["favorite_int"]

(1)中的表示建议使用 numpy.array,但据我所知,这不能很好地处理字符串,并且需要我提前指定数据类型以及标题标签。

(2) 中的表示建议了一个字典列表,这就是我一直在使用的。但是,这对于具有两个字符串字段但其余列都是数字的 csv 文件来说确实很糟糕。对于数值,您确实希望有时能够访问矩阵表示形式并将其作为 numpy.array 进行操作。

是否有 csv/numpy/scipy 功能的组合可以实现两个世界的灵活性?对此的任何建议将不胜感激。

总之,主要功能是:

  1. 指定分隔符、要跳过的行数、要忽略的列等的标准能力。
  2. 获取数据的 numpy.array/matrix 表示形式的能力,以便可以操作数值
  3. 。能够按标题名称提取列和行(如上例所示)

I've been trying to find a good and flexible way to parse CSV files in Python but none of the standard options seem to fit the bill. I am tempted to write my own but I think that some combination of what exists in numpy/scipy and the csv module can do what I need, and so I don't want to reinvent the wheel.

I'd like the standard features of being able to specify delimiters, specify whether or not there's a header, how many rows to skip, comments delimiter, which columns to ignore, etc. The central feature I am missing is being able to parse CSV files in a way that gracefully handles both string data and numeric data. Many of my CSV files have columns that contain strings (not of the same length necessarily) and numeric data. I'd like to be able to have numpy array functionality for this numeric data, but also be able to access the strings. For example, suppose my file looks like this (imagine columns are tab-separated):

# my file
name  favorite_integer  favorite_float1  favorite_float2  short_description
johnny  5  60.2  0.52  johnny likes fruitflies
bob 1  17.52  0.001  bob, bobby, robert

data = loadcsv('myfile.csv', delimiter='\t', parse_header=True, comment='#')

I'd like to be able to access data in two ways:

  1. As a matrix of values: it's important for me to get a numpy.array so that I can easily transpose and access the columns that are numeric. In this case, I want to be able to do something like:

    floats_and_ints = data.matrix

    floats_and_ints[:, 0] # access the integers

    floats_and_ints[:, 1:3] # access some of the floats
    transpose(floats_and_ints) # etc..

  2. As a dictionary-like object where I don't have to know the order of the headers: I'd like to also access the data by the header order. For example, I'd like to do:

    data['favorite_float1'] # get all the values of the column with header
    "favorite_float1"

    data['name'] # get all the names of the rows

I don't want to have to know that favorite_float1 is the second column in the table, since this might change.

It's also important for me to be able to iterate through the rows and access the fields by name. For example:

for row in data:
  # print names and favorite integers of all 
  print "Name: ", row["name"], row["favorite_int"]

The representation in (1) suggest a numpy.array, but as far as I can tell, this does not deal well with strings and requires me to specify the data type ahead of time as well as the header labels.

The representation in (2) suggests a list of dictionaries, and this is what I have been using. However, this is really bad for csv files that have two string fields but the rest of the columns are numeric. For the numeric values, you really do want to be able to sometime get access to the matrix representation and manipulate it as a numpy.array.

Is there a combination of csv/numpy/scipy features that allows the flexibility of both worlds? Any advice on this would be greatly appreciated.

In summary, the main features are:

  1. Standard ability to specify delimiters, number of rows to skip, columns to ignore, etc.
  2. The ability to get a numpy.array/matrix representation of the data so that it can numeric values can be manipulated
  3. The ability to extract columns and rows by header name (as in the above example)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

人间☆小暴躁 2024-09-11 13:30:34

看一下pandas,它是在numpy之上构建的。
这是一个小例子:

In [7]: df = pd.read_csv('data.csv', sep='\t', index_col='name')
In [8]: df
Out[8]: 
        favorite_integer  favorite_float1  favorite_float2        short_description
name                                                                               
johnny                 5            60.20            0.520  johnny likes fruitflies
bob                    1            17.52            0.001       bob, bobby, robert
In [9]: df.describe()
Out[9]: 
       favorite_integer  favorite_float1  favorite_float2
count          2.000000         2.000000         2.000000
mean           3.000000        38.860000         0.260500
std            2.828427        30.179317         0.366988
min            1.000000        17.520000         0.001000
25%            2.000000        28.190000         0.130750
50%            3.000000        38.860000         0.260500
75%            4.000000        49.530000         0.390250
max            5.000000        60.200000         0.520000
In [13]: df.ix['johnny', 'favorite_integer']
Out[13]: 5
In [15]: df['favorite_float1'] # or attribute: df.favorite_float1
Out[15]: 
name
johnny    60.20
bob       17.52
Name: favorite_float1
In [16]: df['mean_favorite'] = df.mean(axis=1)
In [17]: df.ix[:, 3:]
Out[17]: 
              short_description  mean_favorite
name                                          
johnny  johnny likes fruitflies      21.906667
bob          bob, bobby, robert       6.173667

Have a look at pandas which is build on top of numpy.
Here is a small example:

In [7]: df = pd.read_csv('data.csv', sep='\t', index_col='name')
In [8]: df
Out[8]: 
        favorite_integer  favorite_float1  favorite_float2        short_description
name                                                                               
johnny                 5            60.20            0.520  johnny likes fruitflies
bob                    1            17.52            0.001       bob, bobby, robert
In [9]: df.describe()
Out[9]: 
       favorite_integer  favorite_float1  favorite_float2
count          2.000000         2.000000         2.000000
mean           3.000000        38.860000         0.260500
std            2.828427        30.179317         0.366988
min            1.000000        17.520000         0.001000
25%            2.000000        28.190000         0.130750
50%            3.000000        38.860000         0.260500
75%            4.000000        49.530000         0.390250
max            5.000000        60.200000         0.520000
In [13]: df.ix['johnny', 'favorite_integer']
Out[13]: 5
In [15]: df['favorite_float1'] # or attribute: df.favorite_float1
Out[15]: 
name
johnny    60.20
bob       17.52
Name: favorite_float1
In [16]: df['mean_favorite'] = df.mean(axis=1)
In [17]: df.ix[:, 3:]
Out[17]: 
              short_description  mean_favorite
name                                          
johnny  johnny likes fruitflies      21.906667
bob          bob, bobby, robert       6.173667
白况 2024-09-11 13:30:34

matplotlib.mlab.csv2rec 返回numpy recarray,因此您可以对任何 numpy 数组执行所有出色的 numpy 操作。各个行,为 record 实例,可以作为元组索引,但也具有为数据中的列自动命名的属性:

rows = matplotlib.mlab.csv2rec('data.csv')
row = rows[0]

print row[0]
print row.name
print row['name']

numpy.genfromtext 不同,csv2rec 还可以理解“引用的字符串”。

总的来说,我发现 csv2rec 结合了 csv.reader 和 numpy.genfromtext 的一些最佳功能。

matplotlib.mlab.csv2rec returns a numpy recarray, so you can do all the great numpy things to this that you would do with any numpy array. The individual rows, being record instances, can be indexed as tuples but also have attributes automatically named for the columns in your data:

rows = matplotlib.mlab.csv2rec('data.csv')
row = rows[0]

print row[0]
print row.name
print row['name']

csv2rec also understands "quoted strings", unlike numpy.genfromtext.

In general, I find that csv2rec combines some of the best features of csv.reader and numpy.genfromtext.

标点 2024-09-11 13:30:34

numpy.genfromtxt()

numpy.genfromtxt()

落叶缤纷 2024-09-11 13:30:34

为什么不直接使用 stdlib csv.DictReader 呢?

Why not just use the stdlib csv.DictReader?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文