编写大型 CSV 文件 - 基于字典的 CSV 编写器似乎是问题所在

发布于 2024-09-16 23:14:20 字数 463 浏览 7 评论 0原文

我有一大袋单词数组（单词及其计数），我需要将其写入大型平面 csv 文件。

在测试大约 1000 个左右的单词时，这工作得很好 - 我使用 dictwriter 如下：

self.csv_out = csv.DictWriter(open(self.loc+'.csv','w'), quoting=csv.QUOTE_ALL, fieldnames=fields)

其中 fields 是单词列表（即键，在我传递给 csv_out 的字典中） .writerow）。

然而，这似乎是可怕的扩展，并且随着单词数量的增加 - 写入一行所需的时间呈指数增长。 csv 中的 dict_to_list 方法似乎是我的麻烦的始作俑者。

我不完全知道如何开始优化？我可以使用任何更快的 CSV 例程吗？

原文

I have a big bag of words array (words, and their counts) that I need to write to large flat csv file.

In testing with around 1000 or so words, this works just fine - I use the dictwriter as follows:

self.csv_out = csv.DictWriter(open(self.loc+'.csv','w'), quoting=csv.QUOTE_ALL, fieldnames=fields)

where fields is list of words (i.e. the keys, in the dictionary that I pass to csv_out.writerow).

However, it seems that this is scaling horribly, and as the number of words increase - the time required to write a row increases exponentially. The dict_to_list method in csv seems to be the instigator of my troubles.

I'm not entirely as to how to even begin to optimize here ? any faster CSV routines I could use ?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

浪漫之都 2024-09-23 23:14:20

好吧，这绝不是答案，但我查找了 csv 模块的源代码，并注意到模块中有一个非常昂贵的 if not 检查（ python 2.6 中的第 136-141 节）。

if self.extrasaction == "raise":
    wrong_fields = [k for k in rowdict if k not in self.fieldnames]
    if wrong_fields:
        raise ValueError("dict contains fields not in fieldnames: " +
                         ", ".join(wrong_fields))
return [rowdict.get(key, self.restval) for key in self.fieldnames]

因此，一个快速的解决方法似乎是在创建编写器时传递 extrasaction="ignore" 。这似乎大大加快了速度。

这不是一个完美的解决方案，也许有些明显，但只是发布它对其他人很有帮助。

Ok, this is by no means the answer but i looked up the source-code for the csv module and noticed that there is a very expensive if not check in the module (§ 136-141 in python 2.6).

if self.extrasaction == "raise":
    wrong_fields = [k for k in rowdict if k not in self.fieldnames]
    if wrong_fields:
        raise ValueError("dict contains fields not in fieldnames: " +
                         ", ".join(wrong_fields))
return [rowdict.get(key, self.restval) for key in self.fieldnames]

so a quick workaround seems to be to pass extrasaction="ignore" when creating the writer. This seems to speed up things very substantially.

Not a perfect solution, and perhaps somewhat obvious, but just posting it is helpful to somebody else..

回复收藏 0 原文