“对于文件对象中的行”读取文件的方法

发布于 2024-12-20 15:20:26 字数 885 浏览 3 评论 0原文

我正在尝试找出读取/处理超大文件行的最佳方法。在这里我只是尝试

for line in f:

我的脚本的一部分如下：

o=gzip.open(file2,'w')
LIST=[]
f=gzip.open(file1,'r'):
for i,line in enumerate(f):
   if i%4!=3:
      LIST.append(line)

   else:
      LIST.append(line)
      b1=[ord(x) for x in line]
      ave1=(sum(b1)-10)/float(len(line)-1)
      if (ave1 < 84):
         del LIST[-4:]
output1=o.writelines(LIST)

我的file1大约是10GB；当我运行脚本时，内存使用量不断增加，大约 15GB，没有任何输出。这意味着计算机仍在尝试先将整个文件读入内存，对吧？这实际上与使用 readlines() 没有什么不同，

但是在帖子中：在 python 中读取大数据的不同方法斯里卡告诉我： f 中的 for 行 将文件对象 f 视为可迭代对象，它会自动使用缓冲 IO 和内存管理，因此您不必担心大文件。

但显然我仍然需要担心大文件......我真的很困惑。谢谢

编辑：在我的数据中，每 4 行都是一组。目的是每隔4行做一些计算；并根据该计算，决定是否需要附加这 4 行。所以写行是我的目的。

原文

I'm trying to find out the best way to read/process lines for super large file.
Here I just try

for line in f:

Part of my script is as below:

o=gzip.open(file2,'w')
LIST=[]
f=gzip.open(file1,'r'):
for i,line in enumerate(f):
   if i%4!=3:
      LIST.append(line)

   else:
      LIST.append(line)
      b1=[ord(x) for x in line]
      ave1=(sum(b1)-10)/float(len(line)-1)
      if (ave1 < 84):
         del LIST[-4:]
output1=o.writelines(LIST)

My file1 is around 10GB; and when I run the script, the memory usage just keeps increasing to be like 15GB without any output. That means the computer is still trying to read the whole file into memory first, right? This really makes no different than using readlines()

However in the post:
Different ways to read large data in python
Srika told me:
The for line in f treats the file object f as an iterable, which automatically uses buffered IO and memory management so you don't have to worry about large files.

But obviously I still need to worry large files..I'm really confused.
thx

edit:
Every 4 lines is kind of group in my data.
THe purpose is to do some calculations on every 4th line; and based on that calculation, decide if we need to append those 4 lines.So writing lines is my purpose.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

乖乖哒 2024-12-27 15:20:26

内存保持INC的原因。即使在使用enumerator之后也是因为您正在使用LIST.append(line)。这基本上将文件的所有行累积在一个列表中。显然，这一切都在记忆中。你需要找到一种方法来避免像这样累积线路。阅读、处理和继续下一步。

您可以做的另一种方法是以块的形式读取文件（实际上一次读取 1 行可以符合此标准，1chunk == 1line），即读取文件处理的一小部分，然后读取下一个块等。仍然认为这是在 python 中读取大小文件的最佳方式。

with open(...) as f:
    for line in f:
        <do something with line>

with 语句处理文件的打开和关闭，包括内部块中是否引发异常。 for line in f 将文件对象 f 视为可迭代对象，它会自动使用缓冲 IO 和内存管理，因此您不必担心大文件。

The reason the memory keeps inc. even after you use enumerator is because you are using LIST.append(line). That basically accumulates all the lines of the file in a list. Obviously its all sitting in-memory. You need to find a way to not accumulate lines like this. Read, process & move on to next.

One more way you could do is read your file in chunks (in fact reading 1 line at a time can qualify in this criteria, 1chunk == 1line), i.e. read a small part of the file process it then read next chunk etc. I still maintain that this is best way to read files in python large or small.

with open(...) as f:
    for line in f:
        <do something with line>

The with statement handles opening and closing the file, including if an exception is raised in the inner block. The for line in f treats the file object f as an iterable, which automatically uses buffered IO and memory management so you don't have to worry about large files.

回复收藏 0 原文

爱的那么颓废 2024-12-27 15:20:26

看起来在这个函数的末尾，您将读取的所有行都存入内存，然后立即将它们写入文件。也许你可以尝试这个过程：

将你需要的行读入内存（前3行）。
在第 4 行添加行 &执行你的计算。
如果您的计算正是您要查找的内容，请将集合中的值刷新到文件中。
不管接下来发生什么，创建一个新的集合实例。

我还没有尝试过这个，但它可能看起来像这样：

o=gzip.open(file2,'w')
f=gzip.open(file1,'r'):
LIST=[]

for i,line in enumerate(f):
   if i % 4 != 3:
      LIST.append(line)
   else:
      LIST.append(line)
      b1 = [ord(x) for x in line]
      ave1 = (sum(b1) - 10) / float(len(line) - 1

      # If we've found what we want, save them to the file
      if (ave1 >= 84):
         o.writelines(LIST)

      # Release the values in the list by starting a clean list to work with
      LIST = []

编辑：尽管如此，由于您的文件太大，这可能不是最好的技术，因为所有的行您必须写入文件，但无论如何它可能值得调查。

It looks like at the end of this function, you're taking all of the lines you've read into memory, and then immediately writing them to a file. Maybe you can try this process:

Read the lines you need into memory (the first 3 lines).
On the 4th line, append the line & perform your calculation.
If your calculation is what you're looking for, flush the values in your collection to the file.
Regardless of what follows, create a new collection instance.

I haven't tried this out, but it could maybe look something like this:

o=gzip.open(file2,'w')
f=gzip.open(file1,'r'):
LIST=[]

for i,line in enumerate(f):
   if i % 4 != 3:
      LIST.append(line)
   else:
      LIST.append(line)
      b1 = [ord(x) for x in line]
      ave1 = (sum(b1) - 10) / float(len(line) - 1

      # If we've found what we want, save them to the file
      if (ave1 >= 84):
         o.writelines(LIST)

      # Release the values in the list by starting a clean list to work with
      LIST = []

EDIT: As a thought though, since your file is so large, this may not be the best technique because of all the lines you would have to write to file, but it may be worth investigating regardless.

回复收藏 0 原文