加快文件解析速度
下面的函数将 CSV 文件解析为字典列表,其中列表中的每个元素都是一个字典,其中的值由文件头(假设是第一行)索引。
这个函数非常非常慢,对于相对较小的文件(少于 30,000 行)大约需要 6 秒。
如何加快速度?
def csv2dictlist_raw(filename, delimiter='\t'):
f = open(filename)
header_line = f.readline().strip()
header_fields = header_line.split(delimiter)
dictlist = []
# convert data to list of dictionaries
for line in f:
values = map(tryEval, line.strip().split(delimiter))
dictline = dict(zip(header_fields, values))
dictlist.append(dictline)
return (dictlist, header_fields)
回应评论:
我知道有一个 csv 模块,我可以像这样使用它:
data = csv.DictReader(my_csvfile, delimiter=delimiter)
这要快得多。然而,问题是它不会自动将明显是浮点数和整数的东西转换为数字,而是将它们变成字符串。我该如何解决这个问题?
使用“Sniffer”类对我不起作用。当我在我的文件上尝试它时,我收到错误:
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/csv.py", line 180, in sniff
raise Error, "Could not determine delimiter"
Error: Could not determine delimiter
当很明显时,如何使 DictReader 将字段解析为其类型?
谢谢。
谢谢。
the following function parses a CSV file into a list of dictionaries, where each element in the list is a dictionary where the values are indexed by the header of the file (assumed to be the first line.)
this function is very very slow, taking ~6 seconds for a file that's relatively small (less than 30,000 lines.)
how can I speed it up?
def csv2dictlist_raw(filename, delimiter='\t'):
f = open(filename)
header_line = f.readline().strip()
header_fields = header_line.split(delimiter)
dictlist = []
# convert data to list of dictionaries
for line in f:
values = map(tryEval, line.strip().split(delimiter))
dictline = dict(zip(header_fields, values))
dictlist.append(dictline)
return (dictlist, header_fields)
in response to comments:
I know there's a csv module and I can use it like this:
data = csv.DictReader(my_csvfile, delimiter=delimiter)
this is much faster. However, the problem is that it doesn't automatically cast things that are obviously floats and integers to be numeric and instead makes them strings. How can I fix this?
Using the "Sniffer" class does not work for me. When I try it on my files, I get the error:
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/csv.py", line 180, in sniff
raise Error, "Could not determine delimiter"
Error: Could not determine delimiter
How can I make DictReader parse the fields into their types when it's obvious?
thanks.
thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我发现您的代码有几个问题:
为什么需要字典?键存储在每个字典实例中,这会增加内存消耗。
yield
?尝试转换每个值需要时间,并且在我的选择中没有意义。如果您有一列具有值“abc”和“123”,那么最后一个值可能应该是一个字符串。因此,列的类型应该是固定的,并且应该明确进行转换。
即使您想使用转换逻辑:使用 csv 模块,然后转换值。
I see several issues with your code:
Why do you need dicts? The keys are stored in each dict instance which blows up memory consumption.
Do you really need to hold all instances in memory or would it be an option to use
yield
?Trying to convert each value takes time and makes no sense in my option. If you have a column having the values "abc" and "123" the last value should probably be a string. So the type of a column should be fixed and you should make conversion explicit.
Even if you want to use your conversion logic: Use the csv module and convert values afterwards.
熊猫呢?
What about pandas?