“对于文件对象中的行”读取文件的方法

发布于 2024-12-20 15:20:26 字数 885 浏览 3 评论 0原文

我正在尝试找出读取/处理超大文件行的最佳方法。 在这里我只是尝试

for line in f:

我的脚本的一部分如下:

o=gzip.open(file2,'w')
LIST=[]
f=gzip.open(file1,'r'):
for i,line in enumerate(f):
   if i%4!=3:
      LIST.append(line)

   else:
      LIST.append(line)
      b1=[ord(x) for x in line]
      ave1=(sum(b1)-10)/float(len(line)-1)
      if (ave1 < 84):
         del LIST[-4:]
output1=o.writelines(LIST)

我的file1大约是10GB;当我运行脚本时,内存使用量不断增加,大约 15GB,没有任何输出。这意味着计算机仍在尝试先将整个文件读入内存,对吧?这实际上与使用 readlines() 没有什么不同,

但是在帖子中: 在 python 中读取大数据的不同方法 斯里卡告诉我: f 中的 for 行 将文件对象 f 视为可迭代对象,它会自动使用缓冲 IO 和内存管理,因此您不必担心大文件。

但显然我仍然需要担心大文件......我真的很困惑。 谢谢

编辑: 在我的数据中,每 4 行都是一组。 目的是每隔4行做一些计算;并根据该计算,决定是否需要附加这 4 行。所以写行是我的目的。

I'm trying to find out the best way to read/process lines for super large file.
Here I just try

for line in f:

Part of my script is as below:

o=gzip.open(file2,'w')
LIST=[]
f=gzip.open(file1,'r'):
for i,line in enumerate(f):
   if i%4!=3:
      LIST.append(line)

   else:
      LIST.append(line)
      b1=[ord(x) for x in line]
      ave1=(sum(b1)-10)/float(len(line)-1)
      if (ave1 < 84):
         del LIST[-4:]
output1=o.writelines(LIST)

My file1 is around 10GB; and when I run the script, the memory usage just keeps increasing to be like 15GB without any output. That means the computer is still trying to read the whole file into memory first, right? This really makes no different than using readlines()

However in the post:
Different ways to read large data in python
Srika told me:
The for line in f treats the file object f as an iterable, which automatically uses buffered IO and memory management so you don't have to worry about large files.

But obviously I still need to worry large files..I'm really confused.
thx

edit:
Every 4 lines is kind of group in my data.
THe purpose is to do some calculations on every 4th line; and based on that calculation, decide if we need to append those 4 lines.So writing lines is my purpose.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

乖乖哒 2024-12-27 15:20:26

内存保持INC的原因。即使在使用enumerator之后也是因为您正在使用LIST.append(line)。这基本上将文件的所有行累积在一个列表中。显然,这一切都在记忆中。你需要找到一种方法来避免像这样累积线路。阅读、处理和继续下一步。

您可以做的另一种方法是以块的形式读取文件(实际上一次读取 1 行可以符合此标准,1chunk == 1line),即读取文件处理的一小部分,然后读取下一个块等。仍然认为这是在 python 中读取大小文件的最佳方式。

with open(...) as f:
    for line in f:
        <do something with line>

with 语句处理文件的打开和关闭,包括内部块中是否引发异常。 for line in f 将文件对象 f 视为可迭代对象,它会自动使用缓冲 IO 和内存管理,因此您不必担心大文件。

The reason the memory keeps inc. even after you use enumerator is because you are using LIST.append(line). That basically accumulates all the lines of the file in a list. Obviously its all sitting in-memory. You need to find a way to not accumulate lines like this. Read, process & move on to next.

One more way you could do is read your file in chunks (in fact reading 1 line at a time can qualify in this criteria, 1chunk == 1line), i.e. read a small part of the file process it then read next chunk etc. I still maintain that this is best way to read files in python large or small.

with open(...) as f:
    for line in f:
        <do something with line>

The with statement handles opening and closing the file, including if an exception is raised in the inner block. The for line in f treats the file object f as an iterable, which automatically uses buffered IO and memory management so you don't have to worry about large files.

爱的那么颓废 2024-12-27 15:20:26

看起来在这个函数的末尾,您将读取的所有行都存入内存,然后立即将它们写入文件。也许你可以尝试这个过程:

  1. 将你需要的行读入内存(前3行)。
  2. 在第 4 行添加行 &执行你的计算。
  3. 如果您的计算正是您要查找的内容,请将集合中的值刷新到文件中。
  4. 不管接下来发生什么,创建一个新的集合实例。

我还没有尝试过这个,但它可能看起来像这样:

o=gzip.open(file2,'w')
f=gzip.open(file1,'r'):
LIST=[]

for i,line in enumerate(f):
   if i % 4 != 3:
      LIST.append(line)
   else:
      LIST.append(line)
      b1 = [ord(x) for x in line]
      ave1 = (sum(b1) - 10) / float(len(line) - 1

      # If we've found what we want, save them to the file
      if (ave1 >= 84):
         o.writelines(LIST)

      # Release the values in the list by starting a clean list to work with
      LIST = []

编辑:尽管如此,由于您的文件太大,这可能不是最好的技术,因为所有的行您必须写入文件,但无论如何它可能值得调查。

It looks like at the end of this function, you're taking all of the lines you've read into memory, and then immediately writing them to a file. Maybe you can try this process:

  1. Read the lines you need into memory (the first 3 lines).
  2. On the 4th line, append the line & perform your calculation.
  3. If your calculation is what you're looking for, flush the values in your collection to the file.
  4. Regardless of what follows, create a new collection instance.

I haven't tried this out, but it could maybe look something like this:

o=gzip.open(file2,'w')
f=gzip.open(file1,'r'):
LIST=[]

for i,line in enumerate(f):
   if i % 4 != 3:
      LIST.append(line)
   else:
      LIST.append(line)
      b1 = [ord(x) for x in line]
      ave1 = (sum(b1) - 10) / float(len(line) - 1

      # If we've found what we want, save them to the file
      if (ave1 >= 84):
         o.writelines(LIST)

      # Release the values in the list by starting a clean list to work with
      LIST = []

EDIT: As a thought though, since your file is so large, this may not be the best technique because of all the lines you would have to write to file, but it may be worth investigating regardless.

伊面 2024-12-27 15:20:26

由于您将所有行添加到列表 LIST 中,并且有时只是从中删除一些行,因此 LIST 我们变得越来越长。您存储在 LIST 中的所有这些行都会占用内存。如果您不希望它们占用内存,请勿将所有行保留在列表中。

此外,您的脚本似乎在任何地方都没有产生任何输出,因此这一切的意义不是很清楚。

Since you add all the lines to the list LIST and only sometimes remove some lines from it, LIST we become longer and longer. All those lines that you store in LIST will take up memory. Don't keep all the lines around in a list if you don't want them to take up memory.

Also your script doesn't seem to produce any output anywhere, so the point of it all isn't very clear.

末骤雨初歇 2024-12-27 15:20:26

好吧,您已经从其他评论/答案中知道您的问题是什么,但让我简单说明一下。

您一次仅将一行读入内存,但通过附加到列表将其中的很大一部分存储在内存中。

为了避免这种情况,您需要在文件系统或数据库(在磁盘上)中存储一些内容,以便以后查找(如果您的算法足够复杂)。

从我看来,您似乎可以轻松地增量编写输出。 IE。 您当前正在使用列表来存储要写入输出的有效行以及您可能在某个时候删除的临时行。为了有效利用内存,一旦您知道这些行是有效的输出,您就需要立即从临时列表中写入这些行。

总之,使用列表仅存储进行计算所需的临时数据,一旦准备好一些有效数据可供输出,您只需将其写入磁盘并将其从主内存中删除(在 python 中,这将是意味着你不应该再有任何对它的引用。)

Ok, you know what your problem is already from the other comments/answers, but let me simply state it.

You are only reading a single line at a time into memory, but you are storing a significant portion of these in memory by appending to a list.

In order to avoid this you need to store something in the filesystem or a database (on the disk) for later look up if your algorithm is complicated enough.

From what I see it seems you can easily write the output incrementally. ie. You are currently using a list to store valid lines to write to output as well as temporary lines you may delete at some point. To be efficient with memory you want to write the lines from your temporary list as soon as you know these are valid output.

In summary, use your list to store only temporary data you need to do your calculations based off of, and once you have some valid data ready for output you can simply write it to disk and delete it from your main memory (in python this would mean you should no longer have any references to it.)

攒一口袋星星 2024-12-27 15:20:26

如果您不使用 with 语句,则必须关闭文件的处理程序:

o.close()

f.close()

If you do not use the with statement , you must close the file's handlers:

o.close()

f.close()
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文