“对于文件对象中的行”读取文件的方法
我正在尝试找出读取/处理超大文件行的最佳方法。 在这里我只是尝试
for line in f:
我的脚本的一部分如下:
o=gzip.open(file2,'w')
LIST=[]
f=gzip.open(file1,'r'):
for i,line in enumerate(f):
if i%4!=3:
LIST.append(line)
else:
LIST.append(line)
b1=[ord(x) for x in line]
ave1=(sum(b1)-10)/float(len(line)-1)
if (ave1 < 84):
del LIST[-4:]
output1=o.writelines(LIST)
我的file1
大约是10GB;当我运行脚本时,内存使用量不断增加,大约 15GB,没有任何输出。这意味着计算机仍在尝试先将整个文件读入内存,对吧?这实际上与使用 readlines() 没有什么不同,
但是在帖子中: 在 python 中读取大数据的不同方法 斯里卡告诉我: f 中的 for 行
将文件对象 f 视为可迭代对象,它会自动使用缓冲 IO 和内存管理,因此您不必担心大文件。
但显然我仍然需要担心大文件......我真的很困惑。 谢谢
编辑: 在我的数据中,每 4 行都是一组。 目的是每隔4行做一些计算;并根据该计算,决定是否需要附加这 4 行。所以写行是我的目的。
I'm trying to find out the best way to read/process lines for super large file.
Here I just try
for line in f:
Part of my script is as below:
o=gzip.open(file2,'w')
LIST=[]
f=gzip.open(file1,'r'):
for i,line in enumerate(f):
if i%4!=3:
LIST.append(line)
else:
LIST.append(line)
b1=[ord(x) for x in line]
ave1=(sum(b1)-10)/float(len(line)-1)
if (ave1 < 84):
del LIST[-4:]
output1=o.writelines(LIST)
My file1
is around 10GB; and when I run the script, the memory usage just keeps increasing to be like 15GB without any output. That means the computer is still trying to read the whole file into memory first, right? This really makes no different than using readlines()
However in the post:
Different ways to read large data in python
Srika told me:The for line in f
treats the file object f as an iterable, which automatically uses buffered IO and memory management so you don't have to worry about large files.
But obviously I still need to worry large files..I'm really confused.
thx
edit:
Every 4 lines is kind of group in my data.
THe purpose is to do some calculations on every 4th line; and based on that calculation, decide if we need to append those 4 lines.So writing lines is my purpose.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
内存保持INC的原因。即使在使用
enumerator
之后也是因为您正在使用LIST.append(line)
。这基本上将文件的所有行累积在一个列表中。显然,这一切都在记忆中。你需要找到一种方法来避免像这样累积线路。阅读、处理和继续下一步。您可以做的另一种方法是以块的形式读取文件(实际上一次读取 1 行可以符合此标准,1chunk == 1line),即读取文件处理的一小部分,然后读取下一个块等。仍然认为这是在 python 中读取大小文件的最佳方式。
with
语句处理文件的打开和关闭,包括内部块中是否引发异常。for line in f
将文件对象f
视为可迭代对象,它会自动使用缓冲 IO 和内存管理,因此您不必担心大文件。The reason the memory keeps inc. even after you use
enumerator
is because you are usingLIST.append(line)
. That basically accumulates all the lines of the file in a list. Obviously its all sitting in-memory. You need to find a way to not accumulate lines like this. Read, process & move on to next.One more way you could do is read your file in chunks (in fact reading 1 line at a time can qualify in this criteria, 1chunk == 1line), i.e. read a small part of the file process it then read next chunk etc. I still maintain that this is best way to read files in python large or small.
The
with
statement handles opening and closing the file, including if an exception is raised in the inner block. Thefor line in f
treats the file objectf
as an iterable, which automatically uses buffered IO and memory management so you don't have to worry about large files.看起来在这个函数的末尾,您将读取的所有行都存入内存,然后立即将它们写入文件。也许你可以尝试这个过程:
我还没有尝试过这个,但它可能看起来像这样:
编辑:尽管如此,由于您的文件太大,这可能不是最好的技术,因为所有的行您必须写入文件,但无论如何它可能值得调查。
It looks like at the end of this function, you're taking all of the lines you've read into memory, and then immediately writing them to a file. Maybe you can try this process:
I haven't tried this out, but it could maybe look something like this:
EDIT: As a thought though, since your file is so large, this may not be the best technique because of all the lines you would have to write to file, but it may be worth investigating regardless.
由于您将所有行添加到列表
LIST
中,并且有时只是从中删除一些行,因此LIST
我们变得越来越长。您存储在LIST
中的所有这些行都会占用内存。如果您不希望它们占用内存,请勿将所有行保留在列表中。此外,您的脚本似乎在任何地方都没有产生任何输出,因此这一切的意义不是很清楚。
Since you add all the lines to the list
LIST
and only sometimes remove some lines from it,LIST
we become longer and longer. All those lines that you store inLIST
will take up memory. Don't keep all the lines around in a list if you don't want them to take up memory.Also your script doesn't seem to produce any output anywhere, so the point of it all isn't very clear.
好吧,您已经从其他评论/答案中知道您的问题是什么,但让我简单说明一下。
您一次仅将一行读入内存,但通过附加到列表将其中的很大一部分存储在内存中。
为了避免这种情况,您需要在文件系统或数据库(在磁盘上)中存储一些内容,以便以后查找(如果您的算法足够复杂)。
从我看来,您似乎可以轻松地增量编写输出。 IE。 您当前正在使用列表来存储要写入输出的有效行以及您可能在某个时候删除的临时行。为了有效利用内存,一旦您知道这些行是有效的输出,您就需要立即从临时列表中写入这些行。
总之,使用列表仅存储进行计算所需的临时数据,一旦准备好一些有效数据可供输出,您只需将其写入磁盘并将其从主内存中删除(在 python 中,这将是意味着你不应该再有任何对它的引用。)
Ok, you know what your problem is already from the other comments/answers, but let me simply state it.
You are only reading a single line at a time into memory, but you are storing a significant portion of these in memory by appending to a list.
In order to avoid this you need to store something in the filesystem or a database (on the disk) for later look up if your algorithm is complicated enough.
From what I see it seems you can easily write the output incrementally. ie. You are currently using a list to store valid lines to write to output as well as temporary lines you may delete at some point. To be efficient with memory you want to write the lines from your temporary list as soon as you know these are valid output.
In summary, use your list to store only temporary data you need to do your calculations based off of, and once you have some valid data ready for output you can simply write it to disk and delete it from your main memory (in python this would mean you should no longer have any references to it.)
如果您不使用
with
语句,则必须关闭文件的处理程序:If you do not use the
with
statement , you must close the file's handlers: