编写大型 CSV 文件 - 基于字典的 CSV 编写器似乎是问题所在
我有一大袋单词数组(单词及其计数),我需要将其写入大型平面 csv 文件。
在测试大约 1000 个左右的单词时,这工作得很好 - 我使用 dictwriter 如下:
self.csv_out = csv.DictWriter(open(self.loc+'.csv','w'), quoting=csv.QUOTE_ALL, fieldnames=fields)
其中 fields
是单词列表(即键,在我传递给 csv_out 的字典中) .writerow
)。
然而,这似乎是可怕的扩展,并且随着单词数量的增加 - 写入一行所需的时间呈指数增长。 csv
中的 dict_to_list
方法似乎是我的麻烦的始作俑者。
我不完全知道如何开始优化?我可以使用任何更快的 CSV 例程吗?
I have a big bag of words array (words, and their counts) that I need to write to large flat csv file.
In testing with around 1000 or so words, this works just fine - I use the dictwriter as follows:
self.csv_out = csv.DictWriter(open(self.loc+'.csv','w'), quoting=csv.QUOTE_ALL, fieldnames=fields)
where fields
is list of words (i.e. the keys, in the dictionary that I pass to csv_out.writerow
).
However, it seems that this is scaling horribly, and as the number of words increase - the time required to write a row increases exponentially. The dict_to_list
method in csv
seems to be the instigator of my troubles.
I'm not entirely as to how to even begin to optimize here ? any faster CSV routines I could use ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
好吧,这绝不是答案,但我查找了 csv 模块的源代码,并注意到模块中有一个非常昂贵的
if not
检查( python 2.6 中的第 136-141 节)。因此,一个快速的解决方法似乎是在创建编写器时传递
extrasaction="ignore"
。这似乎大大加快了速度。这不是一个完美的解决方案,也许有些明显,但只是发布它对其他人很有帮助。
Ok, this is by no means the answer but i looked up the source-code for the csv module and noticed that there is a very expensive
if not
check in the module (§ 136-141 in python 2.6).so a quick workaround seems to be to pass
extrasaction="ignore"
when creating the writer. This seems to speed up things very substantially.Not a perfect solution, and perhaps somewhat obvious, but just posting it is helpful to somebody else..
明显的优化是使用 csv.writer 而不是 DictWriter,为每行传递可迭代对象而不是字典。这没有帮助吗?
当您说“字数”时,您是指 CSV 中的列数吗?因为我从未见过需要数千个列的 CSV!也许您已经转置了数据并正在写入列而不是行?每一行应代表一个数据,其部分由列定义。如果您确实需要这种大小,也许数据库是更好的选择?
The obvious optimisation is to use a
csv.writer
instead of aDictWriter
, passing in iterables for each row instead of dictionaries. Does that not help?When you say "the number of words", do you mean the number of columns in the CSV? Because I've never seen a CSV that needs thousands of columns! Maybe you have transposed your data and are writing columns instead of rows? Each row should represent one datum, with sections as defined by the columns. If you really do need that sort of size, maybe a database is a better choice?