使用 Python 2.6 从网络保存二进制文件的最有效内存方法？

发布于 2024-07-21 13:33:15 字数 1070 浏览 5 评论 0原文

我正在尝试使用 Python 2.6 和 urllib 从网络下载（并保存）二进制文件。

据我了解，read()、readline() 和 readlines() 是读取类文件对象的 3 种方法。由于二进制文件并未真正分解为换行符，因此 read() 和 readlines() 将整个文件读入内存。

选择随机 read() 缓冲区大小是在此过程中限制内存使用的最有效方法吗？

即

import urllib
import os

title = 'MyFile'
downloadurl = 'http://somedomain.com/myfile.avi'
webFile = urllib.urlopen(downloadurl)
mydirpath = os.path.join('c:', os.sep,'mydirectory',\
                         downloadurl.split('/')[-1])

if not os.path.exists(mydirpath):
    print "Downloading...%s" % title
    localFile = open(mydirpath, 'wb')
    data = webFile.read(1000000) #1MB at a time
    while data:
        localFile.write(data)
        data = webFile.read(1000000) #1MB at a time
    webFile.close()
    localFile.close()
    print "Finished downloading: %s" % title
else:
    print "%s already exists." % mydirypath

我任意选择了 read(1000000)，因为它有效并降低了 RAM 使用率。我认为如果我使用原始网络缓冲区，选择随机数量会很糟糕，因为如果传输速率太低，缓冲区可能会耗尽。但 urllib 似乎已经在为我处理较低级别的缓冲了。

考虑到这一点，选择任意数字就可以了吗？有没有更好的办法？

谢谢。

原文

I'm trying to download (and save) a binary file from the web using Python 2.6 and urllib.

As I understand it, read(), readline() and readlines() are the 3 ways to read a file-like object.
Since the binary files aren't really broken into newlines, read() and readlines() read teh whole file into memory.

Is choosing a random read() buffer size the most efficient way to limit memory usage during this process?

i.e.

import urllib
import os

title = 'MyFile'
downloadurl = 'http://somedomain.com/myfile.avi'
webFile = urllib.urlopen(downloadurl)
mydirpath = os.path.join('c:', os.sep,'mydirectory',\
                         downloadurl.split('/')[-1])

if not os.path.exists(mydirpath):
    print "Downloading...%s" % title
    localFile = open(mydirpath, 'wb')
    data = webFile.read(1000000) #1MB at a time
    while data:
        localFile.write(data)
        data = webFile.read(1000000) #1MB at a time
    webFile.close()
    localFile.close()
    print "Finished downloading: %s" % title
else:
    print "%s already exists." % mydirypath

I chose read(1000000) arbitrarily because it worked and kept RAM usage down. I assume if I was working with a raw network buffer choosing a random amount would be bad since the buffer might run dry if the transfer rate was too low. But it seems urllib is already handling lower level buffering for me.

With that in mind, is choosing an arbitrary number fine? Is there a better way?

Thanks.

分享到QQ

分享到微博