如何在 Python 中将读取的大型 csv 文件分割成大小均匀的块?
基本上我有下一个过程。
import csv
reader = csv.reader(open('huge_file.csv', 'rb'))
for line in reader:
process_line(line)
请参阅此相关问题 。我想每100行发送一次流程线,以实现批量分片。
实现相关答案的问题是 csv 对象是不可订阅的并且不能使用 len。
>>> import csv
>>> reader = csv.reader(open('dataimport/tests/financial_sample.csv', 'rb'))
>>> len(reader)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: object of type '_csv.reader' has no len()
>>> reader[10:]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: '_csv.reader' object is unsubscriptable
>>> reader[10]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: '_csv.reader' object is unsubscriptable
我该如何解决这个问题?
In a basic I had the next process.
import csv
reader = csv.reader(open('huge_file.csv', 'rb'))
for line in reader:
process_line(line)
See this related question. I want to send the process line every 100 rows, to implement batch sharding.
The problem about implementing the related answer is that csv object is unsubscriptable and can not use len.
>>> import csv
>>> reader = csv.reader(open('dataimport/tests/financial_sample.csv', 'rb'))
>>> len(reader)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: object of type '_csv.reader' has no len()
>>> reader[10:]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: '_csv.reader' object is unsubscriptable
>>> reader[10]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: '_csv.reader' object is unsubscriptable
How can I solve this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
只需将您的
阅读器
包装到列表
中即可订阅。显然,这会破坏非常大的文件(请参阅下面的“更新”中的替代方案):进一步阅读:如何在 Python 中将列表拆分为大小均匀的块?
更新 1(列表版本):另一种可能的方法是在迭代行时处理每个卡盘:
更新 2(生成器版本):我还没有对它进行基准测试,但也许您可以通过使用块来提高性能< em>生成器:
有一个小问题,如 @totalhack 指出< /a>:
Just make your
reader
subscriptable by wrapping it into alist
. Obviously this will break on really large files (see alternatives in the Updates below):Further reading: How do you split a list into evenly sized chunks in Python?
Update 1 (list version): Another possible way would just process each chuck, as it arrives while iterating over the lines:
Update 2 (generator version): I haven't benchmarked it, but maybe you can increase performance by using a chunk generator:
There is a minor gotcha, as @totalhack points out:
我们可以使用 pandas 模块来处理这些大的 csv 文件。
We can use pandas module to handle these big csv files.
对于所有
.csv
文件,没有一种好的方法来执行此操作。您应该能够使用file.seek
跳过文件的一部分。然后你必须一次扫描一个字节才能找到行的末尾。您可以独立处理这两个块。类似以下(未经测试)的代码应该可以帮助您入门。
我不确定您如何知道您已完成
segment_one
的遍历。如果 CSV 中有一列是行 ID,那么当您遇到segment_two
中第一行的行 ID 时,您可以停止处理segment_one
。There isn't a good way to do this for all
.csv
files. You should be able to divide the file into chunks usingfile.seek
to skip a section of the file. Then you have to scan one byte at a time to find the end of the row. The you can process the two chunks independently. Something like the following (untested) code should get you started.I'm not sure how you can tell that you have finished traversing
segment_one
. If you have a column in the CSV that is a row id, then you can stop processingsegment_one
when you encounter the row id from the first row insegment_two
.