如何在 Python 中将读取的大型 csv 文件分割成大小均匀的块？

发布于 2024-10-16 22:46:32 字数 989 浏览 8 评论 0原文

基本上我有下一个过程。

import csv
reader = csv.reader(open('huge_file.csv', 'rb'))

for line in reader:
    process_line(line)

请参阅此相关问题。我想每100行发送一次流程线，以实现批量分片。

实现相关答案的问题是 csv 对象是不可订阅的并且不能使用 len。

>>> import csv
>>> reader = csv.reader(open('dataimport/tests/financial_sample.csv', 'rb'))
>>> len(reader)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: object of type '_csv.reader' has no len()
>>> reader[10:]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '_csv.reader' object is unsubscriptable
>>> reader[10]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '_csv.reader' object is unsubscriptable

我该如何解决这个问题？

原文

In a basic I had the next process.

import csv
reader = csv.reader(open('huge_file.csv', 'rb'))

for line in reader:
    process_line(line)

See this related question. I want to send the process line every 100 rows, to implement batch sharding.

The problem about implementing the related answer is that csv object is unsubscriptable and can not use len.

>>> import csv
>>> reader = csv.reader(open('dataimport/tests/financial_sample.csv', 'rb'))
>>> len(reader)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: object of type '_csv.reader' has no len()
>>> reader[10:]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '_csv.reader' object is unsubscriptable
>>> reader[10]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '_csv.reader' object is unsubscriptable

How can I solve this?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

幻梦 2024-10-23 22:46:33

只需将您的阅读器包装到列表中即可订阅。显然，这会破坏非常大的文件（请参阅下面的“更新”中的替代方案）：

>>> reader = csv.reader(open('big.csv', 'rb'))
>>> lines = list(reader)
>>> print lines[:100]
...

进一步阅读：如何在 Python 中将列表拆分为大小均匀的块？

更新 1（列表版本）：另一种可能的方法是在迭代行时处理每个卡盘：

#!/usr/bin/env python

import csv
reader = csv.reader(open('4956984.csv', 'rb'))

chunk, chunksize = [], 100

def process_chunk(chuck):
    print len(chuck)
    # do something useful ...

for i, line in enumerate(reader):
    if (i % chunksize == 0 and i > 0):
        process_chunk(chunk)
        del chunk[:]  # or: chunk = []
    chunk.append(line)

# process the remainder
process_chunk(chunk)

更新 2（生成器版本）：我还没有对它进行基准测试，但也许您可以通过使用块来提高性能< em>生成器：

#!/usr/bin/env python

import csv
reader = csv.reader(open('4956984.csv', 'rb'))

def gen_chunks(reader, chunksize=100):
    """ 
    Chunk generator. Take a CSV `reader` and yield
    `chunksize` sized slices. 
    """
    chunk = []
    for i, line in enumerate(reader):
        if (i % chunksize == 0 and i > 0):
            yield chunk
            del chunk[:]  # or: chunk = []
        chunk.append(line)
    yield chunk

for chunk in gen_chunks(reader):
    print chunk # process chunk

# test gen_chunk on some dummy sequence:
for chunk in gen_chunks(range(10), chunksize=3):
    print chunk # process chunk

# => yields
# [0, 1, 2]
# [3, 4, 5]
# [6, 7, 8]
# [9]

有一个小问题，如 @totalhack 指出< /a>:

请注意，这会一遍又一遍地产生具有不同内容的同一对象。如果您计划在每次迭代之间对块执行所需的所有操作，那么这种方法效果很好。

Just make your reader subscriptable by wrapping it into a list. Obviously this will break on really large files (see alternatives in the Updates below):

>>> reader = csv.reader(open('big.csv', 'rb'))
>>> lines = list(reader)
>>> print lines[:100]
...

Further reading: How do you split a list into evenly sized chunks in Python?

Update 1 (list version): Another possible way would just process each chuck, as it arrives while iterating over the lines:

#!/usr/bin/env python

import csv
reader = csv.reader(open('4956984.csv', 'rb'))

chunk, chunksize = [], 100

def process_chunk(chuck):
    print len(chuck)
    # do something useful ...

for i, line in enumerate(reader):
    if (i % chunksize == 0 and i > 0):
        process_chunk(chunk)
        del chunk[:]  # or: chunk = []
    chunk.append(line)

# process the remainder
process_chunk(chunk)

Update 2 (generator version): I haven't benchmarked it, but maybe you can increase performance by using a chunk generator:

#!/usr/bin/env python

import csv
reader = csv.reader(open('4956984.csv', 'rb'))

def gen_chunks(reader, chunksize=100):
    """ 
    Chunk generator. Take a CSV `reader` and yield
    `chunksize` sized slices. 
    """
    chunk = []
    for i, line in enumerate(reader):
        if (i % chunksize == 0 and i > 0):
            yield chunk
            del chunk[:]  # or: chunk = []
        chunk.append(line)
    yield chunk

for chunk in gen_chunks(reader):
    print chunk # process chunk

# test gen_chunk on some dummy sequence:
for chunk in gen_chunks(range(10), chunksize=3):
    print chunk # process chunk

# => yields
# [0, 1, 2]
# [3, 4, 5]
# [6, 7, 8]
# [9]

There is a minor gotcha, as @totalhack points out:

Be aware that this yields the same object over and over with different contents. This works fine if you plan on doing everything you need to with the chunk between each iteration.

回复收藏 0 原文

不再见 2024-10-23 22:46:33

我们可以使用 pandas 模块来处理这些大的 csv 文件。

df = pd.DataFrame()
temp = pd.read_csv('BIG_File.csv', iterator=True, chunksize=1000)
df = pd.concat(temp, ignore_index=True)

We can use pandas module to handle these big csv files.

df = pd.DataFrame()
temp = pd.read_csv('BIG_File.csv', iterator=True, chunksize=1000)
df = pd.concat(temp, ignore_index=True)

回复收藏 0 原文

风启觞 2024-10-23 22:46:33

对于所有 .csv 文件，没有一种好的方法来执行此操作。您应该能够使用 file.seek跳过文件的一部分。然后你必须一次扫描一个字节才能找到行的末尾。您可以独立处理这两个块。类似以下（未经测试）的代码应该可以帮助您入门。

file_one = open('foo.csv')
file_two = open('foo.csv') 
file_two.seek(0, 2)     # seek to the end of the file
sz = file_two.tell()    # fetch the offset
file_two.seek(sz / 2)   # seek back to the middle
chr = ''
while chr != '\n':
    chr = file_two.read(1)
# file_two is now positioned at the start of a record
segment_one = csv.reader(file_one)
segment_two = csv.reader(file_two)

我不确定您如何知道您已完成 segment_one 的遍历。如果 CSV 中有一列是行 ID，那么当您遇到 segment_two 中第一行的行 ID 时，您可以停止处理 segment_one。

There isn't a good way to do this for all .csv files. You should be able to divide the file into chunks using file.seek to skip a section of the file. Then you have to scan one byte at a time to find the end of the row. The you can process the two chunks independently. Something like the following (untested) code should get you started.

file_one = open('foo.csv')
file_two = open('foo.csv') 
file_two.seek(0, 2)     # seek to the end of the file
sz = file_two.tell()    # fetch the offset
file_two.seek(sz / 2)   # seek back to the middle
chr = ''
while chr != '\n':
    chr = file_two.read(1)
# file_two is now positioned at the start of a record
segment_one = csv.reader(file_one)
segment_two = csv.reader(file_two)

I'm not sure how you can tell that you have finished traversing segment_one. If you have a column in the CSV that is a row id, then you can stop processing segment_one when you encounter the row id from the first row in segment_two.

回复收藏 0 原文

~没有更多了~