python读取大数据的不同方式

发布于 2024-12-20 05:42:14 字数 1079 浏览 2 评论 0原文

我正在处理大数据,因此找到一种读取数据的好方法非常重要。 我只是对不同的阅读方法有点困惑。

1.f=gzip.open(file,'r')
      for line in f:
          process line
     #how can I process nth line? can I?
2.f=gzip.open(file,'r').readlines()
  #f is a list
  f[10000]
  #we can process nth line

3.f=gzip.open(file,'r')
  while True:
       linelist=list(islice(f,4))

4.for line in fileinput.input():
  process line

2 和 3 有什么区别?我只是发现他们的内存使用量是相同的。 islice() 还需要首先将整个文件加载到内存中(但稍后会一点一点地加载)。 而且我听说第四种方法是最不消耗内存的,它确实是一点一点地处理,对吗? 对于 10GB 规模的文件,您会推荐哪种文件读取方法?欢迎任何想法/信息。 谢谢

编辑:我认为我的问题之一是我有时需要随机挑选特定的行。 说:

f1=open(inputfile1, 'r')
while True:
    line_group1 = list(islice(f1, 3))
    if not line_group1:
        break
    #then process specific lines say, the second line.
    processed 2nd line
    if ( ....):
           LIST1.append(line_group1[0])
           LIST1.append(processed 2nd line)
           LIST1.append(line_group1[2])

然后……就像

with open(file,'r') as f,
    for line in f:
       # process line

可能行不通,我是对的吗?

I'm dealing with large data, so finding a good way for reading data is really important.
I'm just a little bit confused about different reading methods.

1.f=gzip.open(file,'r')
      for line in f:
          process line
     #how can I process nth line? can I?
2.f=gzip.open(file,'r').readlines()
  #f is a list
  f[10000]
  #we can process nth line

3.f=gzip.open(file,'r')
  while True:
       linelist=list(islice(f,4))

4.for line in fileinput.input():
  process line

What's the difference between 2 and 3 ? I just find their memory usage is the same. islice() also needs to first load the whole file into memory (but just later take bit by bit).
And I hear the 4th method is the least memory-consuming, it's really processing bit by bit, right?
For 10GB-scale file, which file-reading method would you recommend? Any thought/information is welcomed.
thx

edit: I think one of my problem is I need to pick out specific lines randomly sometimes.
say:

f1=open(inputfile1, 'r')
while True:
    line_group1 = list(islice(f1, 3))
    if not line_group1:
        break
    #then process specific lines say, the second line.
    processed 2nd line
    if ( ....):
           LIST1.append(line_group1[0])
           LIST1.append(processed 2nd line)
           LIST1.append(line_group1[2])

And then sth. like

with open(file,'r') as f,
    for line in f:
       # process line

may not work, am I correct?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

瞎闹 2024-12-27 05:42:14

您忘记了 -

with open(...) as f:
    for line in f:
        <do something with line>

with 语句处理打开和关闭文件,包括内部块中是否引发异常。 for line in f 将文件对象 f 视为可迭代对象,它会自动使用缓冲 IO 和内存管理,因此您不必担心大文件。

不建议使用 2,3 来读取大文件,因为它们会读取和读取大文件。在处理开始之前将整个文件内容加载到内存中。要读取大文件,您需要找到不一次性读取整个文件的方法。

应该有一种(最好只有一种)明显的方法来做到这一点。

You forgot -

with open(...) as f:
    for line in f:
        <do something with line>

The with statement handles opening and closing the file, including if an exception is raised in the inner block. The for line in f treats the file object f as an iterable, which automatically uses buffered IO and memory management so you don't have to worry about large files.

Both 2,3 are not advised for large files as they read & load the entire file contents in memory before processing starts. To read large files you need to find ways to not read the entire file in one single go.

There should be one -- and preferably only one -- obvious way to do it.

情痴 2024-12-27 05:42:14

查看 David M. Beazley 关于使用生成器解析大型日志文件的演讲(请参阅 pdf 演示文稿):

http:// /www.dabeaz.com/generators/

Check out David M. Beazley's talks on parsing large log files with generators (see the pdf for the presentation):

http://www.dabeaz.com/generators/

情场扛把子 2024-12-27 05:42:14

您可以在迭代某些内容时使用 enumerate 来获取索引:

for idx, line in enumerate(f):
    # process line

简单且内存高效。实际上,您也可以使用 islice,并对其进行迭代,而无需先转换为列表:

for line in islice(f,start,stop):
    # process line

这两种方法都不会将整个文件读入内存,也不会创建中间列表。

至于fileinput,它只是一个用于快速循环标准输入或文件列表的辅助类,使用它并没有内存效率上的好处。

正如 Srikar 指出的那样,使用 with 语句是打开/关闭文件的首选方式。

You can use enumerate to get an index as you iterate over something:

for idx, line in enumerate(f):
    # process line

Simple and memory efficient. You can actually use islice too, and iterate over it without converting to a list first:

for line in islice(f,start,stop):
    # process line

Neither approach will read the entire file into memory, nor create an intermediate list.

As for fileinput, it's just a helper class for quickly looping over standard input or a list of files, there is no memory-efficiency benefit to using it.

As Srikar points out, using the with statement is preferred way to open/close a file.

客…行舟 2024-12-27 05:42:14

你不知道有多少行,直到你阅读并数出其中有多少个 \n 。
在1中,您可以添加一个枚举来获取行号。

you don't know how many lines until you read and count how many \n in it.
In 1, you can add a enumerate to get the line number.

九厘米的零° 2024-12-27 05:42:14

要读取大文件中的特定行,您可以使用 linecache 库

For reading specific lines in large files, you could use the linecache library.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文