使用 scipy/numpy 在 Python 中解析字母数字 CSV 的最终方法
我一直在尝试找到一种良好且灵活的方法来在 Python 中解析 CSV 文件,但似乎没有一个标准选项符合要求。我很想自己写一个,但我认为 numpy/scipy 和 csv 模块中存在的某种组合可以满足我的需要,所以我不想重新发明轮子。
我想要能够指定分隔符、指定是否有标题、要跳过多少行、注释分隔符、要忽略哪些列等的标准功能。我缺少的核心功能是能够解析 CSV以优雅地处理字符串数据和数字数据的方式处理文件。我的许多 CSV 文件都包含包含字符串(长度不一定相同)和数字数据的列。我希望能够拥有此数字数据的 numpy 数组功能,但也能够访问字符串。例如,假设我的文件如下所示(想象列是用制表符分隔的):
# my file
name favorite_integer favorite_float1 favorite_float2 short_description
johnny 5 60.2 0.52 johnny likes fruitflies
bob 1 17.52 0.001 bob, bobby, robert
data = loadcsv('myfile.csv', delimiter='\t', parse_header=True, comment='#')
我希望能够以两种方式访问数据:
作为值矩阵:对我来说获得 numpy.txt 文件很重要。数组,以便我可以轻松转置和访问数字列。在这种情况下,我希望能够执行以下操作:
floats_and_ints = data.matrix
floats_and_ints[:, 0] # 访问整数
floats_and_ints[:, 1:3] # 访问一些浮点数
transpose(floats_and_ints) # 等..
作为一个类似字典的对象,我不必知道标题的顺序:我还想通过标题访问数据命令。例如,我想做:
data['favorite_float1'] # 获取带标题的列的所有值 “favorite_float1”
data['name'] # 获取行的所有名称
我不想知道 favorite_float1 是表中的第二列,因为这可能改变。
对我来说,能够迭代行并按名称访问字段也很重要。例如:
for row in data:
# print names and favorite integers of all
print "Name: ", row["name"], row["favorite_int"]
(1)中的表示建议使用 numpy.array,但据我所知,这不能很好地处理字符串,并且需要我提前指定数据类型以及标题标签。
(2) 中的表示建议了一个字典列表,这就是我一直在使用的。但是,这对于具有两个字符串字段但其余列都是数字的 csv 文件来说确实很糟糕。对于数值,您确实希望有时能够访问矩阵表示形式并将其作为 numpy.array 进行操作。
是否有 csv/numpy/scipy 功能的组合可以实现两个世界的灵活性?对此的任何建议将不胜感激。
总之,主要功能是:
- 指定分隔符、要跳过的行数、要忽略的列等的标准能力。
- 获取数据的 numpy.array/matrix 表示形式的能力,以便可以操作数值
- 。能够按标题名称提取列和行(如上例所示)
I've been trying to find a good and flexible way to parse CSV files in Python but none of the standard options seem to fit the bill. I am tempted to write my own but I think that some combination of what exists in numpy/scipy and the csv module can do what I need, and so I don't want to reinvent the wheel.
I'd like the standard features of being able to specify delimiters, specify whether or not there's a header, how many rows to skip, comments delimiter, which columns to ignore, etc. The central feature I am missing is being able to parse CSV files in a way that gracefully handles both string data and numeric data. Many of my CSV files have columns that contain strings (not of the same length necessarily) and numeric data. I'd like to be able to have numpy array functionality for this numeric data, but also be able to access the strings. For example, suppose my file looks like this (imagine columns are tab-separated):
# my file
name favorite_integer favorite_float1 favorite_float2 short_description
johnny 5 60.2 0.52 johnny likes fruitflies
bob 1 17.52 0.001 bob, bobby, robert
data = loadcsv('myfile.csv', delimiter='\t', parse_header=True, comment='#')
I'd like to be able to access data in two ways:
As a matrix of values: it's important for me to get a numpy.array so that I can easily transpose and access the columns that are numeric. In this case, I want to be able to do something like:
floats_and_ints = data.matrix
floats_and_ints[:, 0] # access the integers
floats_and_ints[:, 1:3] # access some of the floats
transpose(floats_and_ints) # etc..
As a dictionary-like object where I don't have to know the order of the headers: I'd like to also access the data by the header order. For example, I'd like to do:
data['favorite_float1'] # get all the values of the column with header
"favorite_float1"data['name'] # get all the names of the rows
I don't want to have to know that favorite_float1 is the second column in the table, since this might change.
It's also important for me to be able to iterate through the rows and access the fields by name. For example:
for row in data:
# print names and favorite integers of all
print "Name: ", row["name"], row["favorite_int"]
The representation in (1) suggest a numpy.array, but as far as I can tell, this does not deal well with strings and requires me to specify the data type ahead of time as well as the header labels.
The representation in (2) suggests a list of dictionaries, and this is what I have been using. However, this is really bad for csv files that have two string fields but the rest of the columns are numeric. For the numeric values, you really do want to be able to sometime get access to the matrix representation and manipulate it as a numpy.array.
Is there a combination of csv/numpy/scipy features that allows the flexibility of both worlds? Any advice on this would be greatly appreciated.
In summary, the main features are:
- Standard ability to specify delimiters, number of rows to skip, columns to ignore, etc.
- The ability to get a numpy.array/matrix representation of the data so that it can numeric values can be manipulated
- The ability to extract columns and rows by header name (as in the above example)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
看一下pandas,它是在
numpy
之上构建的。这是一个小例子:
Have a look at pandas which is build on top of
numpy
.Here is a small example:
matplotlib.mlab.csv2rec
返回numpy
recarray
,因此您可以对任何
numpy
数组执行所有出色的numpy
操作。各个行,为record
实例,可以作为元组索引,但也具有为数据中的列自动命名的属性:与
numpy.genfromtext
不同,csv2rec
还可以理解“引用的字符串”。总的来说,我发现 csv2rec 结合了 csv.reader 和 numpy.genfromtext 的一些最佳功能。
matplotlib.mlab.csv2rec
returns anumpy
recarray
, so you can do all the greatnumpy
things to this that you would do with anynumpy
array. The individual rows, beingrecord
instances, can be indexed as tuples but also have attributes automatically named for the columns in your data:csv2rec
also understands "quoted strings", unlikenumpy.genfromtext
.In general, I find that
csv2rec
combines some of the best features ofcsv.reader
andnumpy.genfromtext
.numpy.genfromtxt()
numpy.genfromtxt()
为什么不直接使用 stdlib csv.DictReader 呢?
Why not just use the stdlib csv.DictReader?