编写大型 CSV 文件 - 基于字典的 CSV 编写器似乎是问题所在

发布于 2024-09-16 23:14:20 字数 463 浏览 7 评论 0原文

我有一大袋单词数组(单词及其计数),我需要将其写入大型平面 csv 文件。

在测试大约 1000 个左右的单词时,这工作得很好 - 我使用 dictwriter 如下:

self.csv_out = csv.DictWriter(open(self.loc+'.csv','w'), quoting=csv.QUOTE_ALL, fieldnames=fields)

其中 fields 是单词列表(即键,在我传递给 csv_out 的字典中) .writerow)。

然而,这似乎是可怕的扩展,并且随着单词数量的增加 - 写入一行所需的时间呈指数增长。 csv 中的 dict_to_list 方法似乎是我的麻烦的始作俑者。

我不完全知道如何开始优化?我可以使用任何更快的 CSV 例程吗?

I have a big bag of words array (words, and their counts) that I need to write to large flat csv file.

In testing with around 1000 or so words, this works just fine - I use the dictwriter as follows:

self.csv_out = csv.DictWriter(open(self.loc+'.csv','w'), quoting=csv.QUOTE_ALL, fieldnames=fields)

where fields is list of words (i.e. the keys, in the dictionary that I pass to csv_out.writerow).

However, it seems that this is scaling horribly, and as the number of words increase - the time required to write a row increases exponentially. The dict_to_list method in csv seems to be the instigator of my troubles.

I'm not entirely as to how to even begin to optimize here ? any faster CSV routines I could use ?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

浪漫之都 2024-09-23 23:14:20

好吧,这绝不是答案,但我查找了 csv 模块的源代码,并注意到模块中有一个非常昂贵的 if not 检查( python 2.6 中的第 136-141 节)。

if self.extrasaction == "raise":
    wrong_fields = [k for k in rowdict if k not in self.fieldnames]
    if wrong_fields:
        raise ValueError("dict contains fields not in fieldnames: " +
                         ", ".join(wrong_fields))
return [rowdict.get(key, self.restval) for key in self.fieldnames]

因此,一个快速的解决方法似乎是在创建编写器时传递 extrasaction="ignore" 。这似乎大大加快了速度。

这不是一个完美的解决方案,也许有些明显,但只是发布它对其他人很有帮助。

Ok, this is by no means the answer but i looked up the source-code for the csv module and noticed that there is a very expensive if not check in the module (§ 136-141 in python 2.6).

if self.extrasaction == "raise":
    wrong_fields = [k for k in rowdict if k not in self.fieldnames]
    if wrong_fields:
        raise ValueError("dict contains fields not in fieldnames: " +
                         ", ".join(wrong_fields))
return [rowdict.get(key, self.restval) for key in self.fieldnames]

so a quick workaround seems to be to pass extrasaction="ignore" when creating the writer. This seems to speed up things very substantially.

Not a perfect solution, and perhaps somewhat obvious, but just posting it is helpful to somebody else..

懒猫 2024-09-23 23:14:20

明显的优化是使用 csv.writer 而不是 DictWriter,为每行传递可迭代对象而不是字典。这没有帮助吗?

当您说“字数”时,您是指 CSV 中的列数吗?因为我从未见过需要数千个列的 CSV!也许您已经转置了数据并正在写入列而不是行?每一行应代表一个数据,其部分由列定义。如果您确实需要这种大小,也许数据库是更好的选择?

The obvious optimisation is to use a csv.writer instead of a DictWriter, passing in iterables for each row instead of dictionaries. Does that not help?

When you say "the number of words", do you mean the number of columns in the CSV? Because I've never seen a CSV that needs thousands of columns! Maybe you have transposed your data and are writing columns instead of rows? Each row should represent one datum, with sections as defined by the columns. If you really do need that sort of size, maybe a database is a better choice?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文