处理非常大(超过 30GB)文本文件并显示进度的最佳方法是什么

发布于 2024-11-09 20:05:15 字数 336 浏览 6 评论 0原文

[新手问题]

嗨,

我正在处理一个超过 30GB 的巨大文本文件。

我必须对每一行进行一些处理,然后以 JSON 格式将其写入数据库。当我读取文件并使用“for”循环时,我的计算机在处理数据大约 10% 后崩溃并显示蓝屏。

我目前正在使用这个:

f = open(file_path,'r')
for one_line in f.readlines():
    do_some_processing(one_line)
f.close()

另外,我如何显示到目前为止已处理多少数据的总体进度?

非常感谢大家。

[newbie question]

Hi,

I'm working on a huge text file which is well over 30GB.

I have to do some processing on each line and then write it to a db in JSON format. When I read the file and loop using "for" my computer crashes and displays blue screen after about 10% of processing data.

Im currently using this:

f = open(file_path,'r')
for one_line in f.readlines():
    do_some_processing(one_line)
f.close()

Also how can I show overall progress of how much data has been crunched so far ?

Thank you all very much.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

握住你手 2024-11-16 20:05:15

文件句柄是可迭代的,您可能应该使用上下文管理器。试试这个:

with open(file_path, 'r') as fh:
  for line in fh:
    process(line)

这可能就足够了。

File handles are iterable, and you should probably use a context manager. Try this:

with open(file_path, 'r') as fh:
  for line in fh:
    process(line)

That might be enough.

陌若浮生 2024-11-16 20:05:15

我使用这样的函数来解决类似的问题。您可以用它包装任何可迭代对象。

更改此设置

for one_line in f.readlines():

您只需将代码更改为

# don't use readlines, it creates a big list of all data in memory rather than
# iterating one line at a time.
for one_line in in progress_meter(f, 10000):

您可能需要选择一个更小或更大的值,具体取决于您想要浪费多少时间打印状态消息。

def progress_meter(iterable, chunksize):
    """ Prints progress through iterable at chunksize intervals."""
    scan_start = time.time()
    since_last = time.time()
    for idx, val in enumerate(iterable):
        if idx % chunksize == 0 and idx > 0: 
            print idx
            print 'avg rate', idx / (time.time() - scan_start)
            print 'inst rate', chunksize / (time.time() - since_last)
            since_last = time.time()
            print
        yield val

I use a function like this for a similiar problem. You can wrap up any iterable with it.

Change this

for one_line in f.readlines():

You just need to change your code to

# don't use readlines, it creates a big list of all data in memory rather than
# iterating one line at a time.
for one_line in in progress_meter(f, 10000):

You might want to pick a smaller or larger value depending on how much time you want to waste printing status messages.

def progress_meter(iterable, chunksize):
    """ Prints progress through iterable at chunksize intervals."""
    scan_start = time.time()
    since_last = time.time()
    for idx, val in enumerate(iterable):
        if idx % chunksize == 0 and idx > 0: 
            print idx
            print 'avg rate', idx / (time.time() - scan_start)
            print 'inst rate', chunksize / (time.time() - since_last)
            since_last = time.time()
            print
        yield val
時窥 2024-11-16 20:05:15

使用 readline 强制查找文件中每一行的结尾。如果某些行非常长,可能会导致解释器崩溃(没有足够的内存来缓冲整行)。

为了显示进度,您可以检查文件大小,例如使用:

import os
f = open(file_path, 'r')
fsize = os.fstat(f).st_size

任务的进度可以是处理的字节数除以文件大小乘以 100 得到的百分比。

Using readline imposes to find the end of each line in your file. If some lines are very long, it might lead your interpreter to crash (not enough memory to buffer the full line).

In order to show progress you can check the file size for example using:

import os
f = open(file_path, 'r')
fsize = os.fstat(f).st_size

The progress of your task can then be the number of bytes processed divided by the file size times 100 to have a percentage.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文