Python:将大型网页保存到文件
首先我要说的是,我对编程并不陌生,但对 python 很陌生。
我已经使用 urllib2 编写了一个程序,该程序请求一个网页,然后我想将其保存到文件中。该网页大约有300KB,这对我来说并不是特别大,但似乎足以给我带来麻烦,所以我称之为“大”。 我正在使用一个简单的调用直接从 urlopen
返回的对象复制到文件中:
file.write(webpage.read())
但它只会等待分钟,尝试写入文件,我最终收到以下信息:
Traceback (most recent call last):
File "program.py", line 51, in <module>
main()
File "program.py", line 43, in main
f.write(webpage.read())
File "/usr/lib/python2.7/socket.py", line 351, in read
data = self._sock.recv(rbufsize)
File "/usr/lib/python2.7/httplib.py", line 541, in read
return self._read_chunked(amt)
File "/usr/lib/python2.7/httplib.py", line 592, in _read_chunked
value.append(self._safe_read(amt))
File "/usr/lib/python2.7/httplib.py", line 649, in _safe_read
raise IncompleteRead(''.join(s), amt)
httplib.IncompleteRead: IncompleteRead(6384 bytes read, 1808 more expected)
我不知道为什么这会给程序带来如此大的痛苦?
编辑|
这是我检索页面的方式
jar = cookielib.CookieJar()
cookie_processor = urllib2.HTTPCookieProcessor(jar);
opener = urllib2.build_opener(cookie_processor)
urllib2.install_opener(opener)
requ_login = urllib2.Request(LOGIN_PAGE,
data = urllib.urlencode( { 'destination' : "", 'username' : USERNAME, 'password' : PASSWORD } ))
requ_page = urllib2.Request(WEBPAGE)
try:
#login
urllib2.urlopen(requ_login)
#get desired page
portfolio = urllib2.urlopen(requ_page)
except urllib2.URLError as e:
print e.code, ": ", e.reason
Let me start off by saying, I'm not new to programming but am very new to python.
I've written a program using urllib2 that requests a web page that I would then like to save to a file. The web page is about 300KB, which doesn't strike me as particularly large but seems to be enough to give me trouble, so I'm calling it 'large'.
I'm using a simple call to copy directly from the object returned from urlopen
into the file:
file.write(webpage.read())
but it will just sit for minutes, trying to write into the file and I eventually receive the following:
Traceback (most recent call last):
File "program.py", line 51, in <module>
main()
File "program.py", line 43, in main
f.write(webpage.read())
File "/usr/lib/python2.7/socket.py", line 351, in read
data = self._sock.recv(rbufsize)
File "/usr/lib/python2.7/httplib.py", line 541, in read
return self._read_chunked(amt)
File "/usr/lib/python2.7/httplib.py", line 592, in _read_chunked
value.append(self._safe_read(amt))
File "/usr/lib/python2.7/httplib.py", line 649, in _safe_read
raise IncompleteRead(''.join(s), amt)
httplib.IncompleteRead: IncompleteRead(6384 bytes read, 1808 more expected)
I don't know why this should give the program so much grief?
EDIT |
here is how I'm retrieving the page
jar = cookielib.CookieJar()
cookie_processor = urllib2.HTTPCookieProcessor(jar);
opener = urllib2.build_opener(cookie_processor)
urllib2.install_opener(opener)
requ_login = urllib2.Request(LOGIN_PAGE,
data = urllib.urlencode( { 'destination' : "", 'username' : USERNAME, 'password' : PASSWORD } ))
requ_page = urllib2.Request(WEBPAGE)
try:
#login
urllib2.urlopen(requ_login)
#get desired page
portfolio = urllib2.urlopen(requ_page)
except urllib2.URLError as e:
print e.code, ": ", e.reason
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我会使用
shutil 提供的方便的 文件对象复制器功能模块。它在我的机器上工作:)
更新:您可能希望将第三个参数传递给
copyfileobj
,它控制用于传输字节的内部缓冲区的大小。UPDATE2:
shutil.copyfileobj 没什么特别的。
它只是从源文件对象中读取一大块字节,然后重复将其写入目标文件对象,直到没有更多内容可读取为止。这是我从 Python 标准库中获取的实际源代码:I'd use a handy fileobject copier function provided by
shutil
module. It worked on my machine :)UPDATE: You may want to pass the 3rd argument to
copyfileobj
that controls the size of internal buffer used to transfer bytes.UPDATE2: There's nothing fancy about
shutil.copyfileobj.
It simply reads a chunk of bytes from source file object and writes it the destination file object repeatedly until there's nothing more to read. Here's the actual source code of it that I grabbed from inside Python standard library: