python读取大数据的不同方式
我正在处理大数据,因此找到一种读取数据的好方法非常重要。 我只是对不同的阅读方法有点困惑。
1.f=gzip.open(file,'r')
for line in f:
process line
#how can I process nth line? can I?
2.f=gzip.open(file,'r').readlines()
#f is a list
f[10000]
#we can process nth line
3.f=gzip.open(file,'r')
while True:
linelist=list(islice(f,4))
4.for line in fileinput.input():
process line
2 和 3 有什么区别?我只是发现他们的内存使用量是相同的。 islice() 还需要首先将整个文件加载到内存中(但稍后会一点一点地加载)。 而且我听说第四种方法是最不消耗内存的,它确实是一点一点地处理,对吗? 对于 10GB 规模的文件,您会推荐哪种文件读取方法?欢迎任何想法/信息。 谢谢
编辑:我认为我的问题之一是我有时需要随机挑选特定的行。 说:
f1=open(inputfile1, 'r')
while True:
line_group1 = list(islice(f1, 3))
if not line_group1:
break
#then process specific lines say, the second line.
processed 2nd line
if ( ....):
LIST1.append(line_group1[0])
LIST1.append(processed 2nd line)
LIST1.append(line_group1[2])
然后……就像
with open(file,'r') as f,
for line in f:
# process line
可能行不通,我是对的吗?
I'm dealing with large data, so finding a good way for reading data is really important.
I'm just a little bit confused about different reading methods.
1.f=gzip.open(file,'r')
for line in f:
process line
#how can I process nth line? can I?
2.f=gzip.open(file,'r').readlines()
#f is a list
f[10000]
#we can process nth line
3.f=gzip.open(file,'r')
while True:
linelist=list(islice(f,4))
4.for line in fileinput.input():
process line
What's the difference between 2 and 3 ? I just find their memory usage is the same. islice() also needs to first load the whole file into memory (but just later take bit by bit).
And I hear the 4th method is the least memory-consuming, it's really processing bit by bit, right?
For 10GB-scale file, which file-reading method would you recommend? Any thought/information is welcomed.
thx
edit: I think one of my problem is I need to pick out specific lines randomly sometimes.
say:
f1=open(inputfile1, 'r')
while True:
line_group1 = list(islice(f1, 3))
if not line_group1:
break
#then process specific lines say, the second line.
processed 2nd line
if ( ....):
LIST1.append(line_group1[0])
LIST1.append(processed 2nd line)
LIST1.append(line_group1[2])
And then sth. like
with open(file,'r') as f,
for line in f:
# process line
may not work, am I correct?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
您忘记了 -
with
语句处理打开和关闭文件,包括内部块中是否引发异常。for line in f
将文件对象f
视为可迭代对象,它会自动使用缓冲 IO 和内存管理,因此您不必担心大文件。不建议使用 2,3 来读取大文件,因为它们会读取和读取大文件。在处理开始之前将整个文件内容加载到内存中。要读取大文件,您需要找到不一次性读取整个文件的方法。
You forgot -
The
with
statement handles opening and closing the file, including if an exception is raised in the inner block. Thefor line in f
treats the file objectf
as an iterable, which automatically uses buffered IO and memory management so you don't have to worry about large files.Both 2,3 are not advised for large files as they read & load the entire file contents in memory before processing starts. To read large files you need to find ways to not read the entire file in one single go.
查看 David M. Beazley 关于使用生成器解析大型日志文件的演讲(请参阅 pdf 演示文稿):
http:// /www.dabeaz.com/generators/
Check out David M. Beazley's talks on parsing large log files with generators (see the pdf for the presentation):
http://www.dabeaz.com/generators/
您可以在迭代某些内容时使用
enumerate
来获取索引:简单且内存高效。实际上,您也可以使用 islice,并对其进行迭代,而无需先转换为列表:
这两种方法都不会将整个文件读入内存,也不会创建中间列表。
至于
fileinput
,它只是一个用于快速循环标准输入或文件列表的辅助类,使用它并没有内存效率上的好处。正如 Srikar 指出的那样,使用
with
语句是打开/关闭文件的首选方式。You can use
enumerate
to get an index as you iterate over something:Simple and memory efficient. You can actually use
islice
too, and iterate over it without converting to a list first:Neither approach will read the entire file into memory, nor create an intermediate list.
As for
fileinput
, it's just a helper class for quickly looping over standard input or a list of files, there is no memory-efficiency benefit to using it.As Srikar points out, using the
with
statement is preferred way to open/close a file.你不知道有多少行,直到你阅读并数出其中有多少个 \n 。
在1中,您可以添加一个枚举来获取行号。
you don't know how many lines until you read and count how many \n in it.
In 1, you can add a enumerate to get the line number.
要读取大文件中的特定行,您可以使用 linecache 库。
For reading specific lines in large files, you could use the linecache library.