Python：将大型网页保存到文件

发布于 2024-12-17 11:21:15 字数 1578 浏览 1 评论 0原文

首先我要说的是，我对编程并不陌生，但对 python 很陌生。

我已经使用 urllib2 编写了一个程序，该程序请求一个网页，然后我想将其保存到文件中。该网页大约有300KB，这对我来说并不是特别大，但似乎足以给我带来麻烦，所以我称之为“大”。我正在使用一个简单的调用直接从 urlopen 返回的对象复制到文件中：

file.write(webpage.read())

但它只会等待分钟，尝试写入文件，我最终收到以下信息：

Traceback (most recent call last):
  File "program.py", line 51, in <module>
    main()
  File "program.py", line 43, in main
    f.write(webpage.read())
  File "/usr/lib/python2.7/socket.py", line 351, in read
    data = self._sock.recv(rbufsize)
  File "/usr/lib/python2.7/httplib.py", line 541, in read
    return self._read_chunked(amt)
  File "/usr/lib/python2.7/httplib.py", line 592, in _read_chunked
    value.append(self._safe_read(amt))
  File "/usr/lib/python2.7/httplib.py", line 649, in _safe_read
    raise IncompleteRead(''.join(s), amt)
httplib.IncompleteRead: IncompleteRead(6384 bytes read, 1808 more expected)

我不知道为什么这会给程序带来如此大的痛苦？

编辑|

这是我检索页面的方式

jar = cookielib.CookieJar()

cookie_processor = urllib2.HTTPCookieProcessor(jar);

opener = urllib2.build_opener(cookie_processor)
urllib2.install_opener(opener)

requ_login = urllib2.Request(LOGIN_PAGE,
                             data = urllib.urlencode( { 'destination' : "", 'username' : USERNAME, 'password' :  PASSWORD } ))

requ_page = urllib2.Request(WEBPAGE)    
try:
    #login
    urllib2.urlopen(requ_login)

    #get desired page
    portfolio = urllib2.urlopen(requ_page)
except urllib2.URLError as e:
    print e.code, ": ", e.reason

原文

Let me start off by saying, I'm not new to programming but am very new to python.

I've written a program using urllib2 that requests a web page that I would then like to save to a file. The web page is about 300KB, which doesn't strike me as particularly large but seems to be enough to give me trouble, so I'm calling it 'large'.
I'm using a simple call to copy directly from the object returned from urlopen into the file:

file.write(webpage.read())

but it will just sit for minutes, trying to write into the file and I eventually receive the following:

Traceback (most recent call last):
  File "program.py", line 51, in <module>
    main()
  File "program.py", line 43, in main
    f.write(webpage.read())
  File "/usr/lib/python2.7/socket.py", line 351, in read
    data = self._sock.recv(rbufsize)
  File "/usr/lib/python2.7/httplib.py", line 541, in read
    return self._read_chunked(amt)
  File "/usr/lib/python2.7/httplib.py", line 592, in _read_chunked
    value.append(self._safe_read(amt))
  File "/usr/lib/python2.7/httplib.py", line 649, in _safe_read
    raise IncompleteRead(''.join(s), amt)
httplib.IncompleteRead: IncompleteRead(6384 bytes read, 1808 more expected)

I don't know why this should give the program so much grief?

EDIT |

here is how I'm retrieving the page

jar = cookielib.CookieJar()

cookie_processor = urllib2.HTTPCookieProcessor(jar);

opener = urllib2.build_opener(cookie_processor)
urllib2.install_opener(opener)

requ_login = urllib2.Request(LOGIN_PAGE,
                             data = urllib.urlencode( { 'destination' : "", 'username' : USERNAME, 'password' :  PASSWORD } ))

requ_page = urllib2.Request(WEBPAGE)    
try:
    #login
    urllib2.urlopen(requ_login)

    #get desired page
    portfolio = urllib2.urlopen(requ_page)
except urllib2.URLError as e:
    print e.code, ": ", e.reason

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

浅语花开 2024-12-24 11:21:15

我会使用 shutil 提供的方便的文件对象复制器功能模块。它在我的机器上工作:)

>>> import urllib2
>>> import shutil
>>> remote_fo = urllib2.urlopen('http://docs.python.org/library/shutil.html')
>>> with open('bigfile', 'wb') as local_fo:
...     shutil.copyfileobj(remote_fo, local_fo)
... 
>>>

更新：您可能希望将第三个参数传递给copyfileobj，它控制用于传输字节的内部缓冲区的大小。

UPDATE2：shutil.copyfileobj 没什么特别的。它只是从源文件对象中读取一大块字节，然后重复将其写入目标文件对象，直到没有更多内容可读取为止。这是我从 Python 标准库中获取的实际源代码：

def copyfileobj(fsrc, fdst, length=16*1024):
    """copy data from file-like object fsrc to file-like object fdst"""
    while 1:
        buf = fsrc.read(length)
        if not buf:
            break
        fdst.write(buf)

I'd use a handy fileobject copier function provided by shutil module. It worked on my machine :)

>>> import urllib2
>>> import shutil
>>> remote_fo = urllib2.urlopen('http://docs.python.org/library/shutil.html')
>>> with open('bigfile', 'wb') as local_fo:
...     shutil.copyfileobj(remote_fo, local_fo)
... 
>>>

UPDATE: You may want to pass the 3rd argument to copyfileobj that controls the size of internal buffer used to transfer bytes.

UPDATE2: There's nothing fancy about shutil.copyfileobj. It simply reads a chunk of bytes from source file object and writes it the destination file object repeatedly until there's nothing more to read. Here's the actual source code of it that I grabbed from inside Python standard library:

def copyfileobj(fsrc, fdst, length=16*1024):
    """copy data from file-like object fsrc to file-like object fdst"""
    while 1:
        buf = fsrc.read(length)
        if not buf:
            break
        fdst.write(buf)

回复收藏 0 原文

~没有更多了~