处理非常大(超过 30GB)文本文件并显示进度的最佳方法是什么
[新手问题]
嗨,
我正在处理一个超过 30GB 的巨大文本文件。
我必须对每一行进行一些处理,然后以 JSON 格式将其写入数据库。当我读取文件并使用“for”循环时,我的计算机在处理数据大约 10% 后崩溃并显示蓝屏。
我目前正在使用这个:
f = open(file_path,'r')
for one_line in f.readlines():
do_some_processing(one_line)
f.close()
另外,我如何显示到目前为止已处理多少数据的总体进度?
非常感谢大家。
[newbie question]
Hi,
I'm working on a huge text file which is well over 30GB.
I have to do some processing on each line and then write it to a db in JSON format. When I read the file and loop using "for" my computer crashes and displays blue screen after about 10% of processing data.
Im currently using this:
f = open(file_path,'r')
for one_line in f.readlines():
do_some_processing(one_line)
f.close()
Also how can I show overall progress of how much data has been crunched so far ?
Thank you all very much.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
文件句柄是可迭代的,您可能应该使用上下文管理器。试试这个:
这可能就足够了。
File handles are iterable, and you should probably use a context manager. Try this:
That might be enough.
我使用这样的函数来解决类似的问题。您可以用它包装任何可迭代对象。
更改此设置
您只需将代码更改为
您可能需要选择一个更小或更大的值,具体取决于您想要浪费多少时间打印状态消息。
I use a function like this for a similiar problem. You can wrap up any iterable with it.
Change this
You just need to change your code to
You might want to pick a smaller or larger value depending on how much time you want to waste printing status messages.
使用 readline 强制查找文件中每一行的结尾。如果某些行非常长,可能会导致解释器崩溃(没有足够的内存来缓冲整行)。
为了显示进度,您可以检查文件大小,例如使用:
任务的进度可以是处理的字节数除以文件大小乘以 100 得到的百分比。
Using readline imposes to find the end of each line in your file. If some lines are very long, it might lead your interpreter to crash (not enough memory to buffer the full line).
In order to show progress you can check the file size for example using:
The progress of your task can then be the number of bytes processed divided by the file size times 100 to have a percentage.