urllib2 给出不一致的输出 - 通常会下载网页的一部分

发布于 2025-01-01 11:17:26 字数 1302 浏览 5 评论 0原文

我正在使用 urllib2 打开和保存网页。然而,通常只下载网页的一部分,而有时会下载整个页面。

import urllib2
import time
import numpy as np
from itertools import izip

outFiles = ["outFile1.html", "outFile2.html", "outFile3.html", "outFile4.html"]

urls=["http://www.guardian.co.uk/commentisfree/2011/sep/06/cameron-nhs-bill-parliament?commentpage=all",
"http://www.guardian.co.uk/commentisfree/2011/sep/06/tory-scotland-conservative-murdo-fraser?commentpage=all",
"http://www.guardian.co.uk/commentisfree/2011/sep/06/palestine-statehood-united-nations?commentpage=all",
"http://www.guardian.co.uk/commentisfree/2011/sep/05/in-praise-of-understanding-riots?commentpage=all"]

opener = urllib2.build_opener()
user_agent = 'Mozilla/5.0 (Ubuntu; X11; Linux x86_64; rv:9.0.1) Gecko/20100101 Firefox/9.0.1'
opener.addheaders = [('User-agent', user_agent)]
urllib2.install_opener(opener)

for fileName, url in izip(outFiles,urls):
    response = urllib2.urlopen(url)
    responseBody = response.read()
    fp=open(fileName,'w')
    fp.write(responseBody)
    fp.close()
    time.sleep(np.random.randint(20,40))

根据不同的运行,outputFile.html 具有不同的大小。有时文件的大小> 200Kb 到 1MB,而其他时候则在 140KB 左右。什么会导致这种差异?

当文件较小时,注释部分会丢失,但文件绝不会不完整。有时,包括评论在内的整个页面也会被下载。我检查了 curl 并且仍然遇到类似的问题。我不明白是什么导致了这种不一致。

I am using urllib2 to open and save a webpage. However, often only a part of the webpage is downloaded, whereas sometime the full page is downloaded.

import urllib2
import time
import numpy as np
from itertools import izip

outFiles = ["outFile1.html", "outFile2.html", "outFile3.html", "outFile4.html"]

urls=["http://www.guardian.co.uk/commentisfree/2011/sep/06/cameron-nhs-bill-parliament?commentpage=all",
"http://www.guardian.co.uk/commentisfree/2011/sep/06/tory-scotland-conservative-murdo-fraser?commentpage=all",
"http://www.guardian.co.uk/commentisfree/2011/sep/06/palestine-statehood-united-nations?commentpage=all",
"http://www.guardian.co.uk/commentisfree/2011/sep/05/in-praise-of-understanding-riots?commentpage=all"]

opener = urllib2.build_opener()
user_agent = 'Mozilla/5.0 (Ubuntu; X11; Linux x86_64; rv:9.0.1) Gecko/20100101 Firefox/9.0.1'
opener.addheaders = [('User-agent', user_agent)]
urllib2.install_opener(opener)

for fileName, url in izip(outFiles,urls):
    response = urllib2.urlopen(url)
    responseBody = response.read()
    fp=open(fileName,'w')
    fp.write(responseBody)
    fp.close()
    time.sleep(np.random.randint(20,40))

Depending on different runs, the outputFile.html are of different size. Sometime the files are of size > 200Kb upto 1MB whereas some other time they are around 140KB. What can lead to this difference?

When the files are smaller the comment section is missing, however the file is never incomplete. Sometime the whole page including the comments are also download. I checked with curl and still had similar problem. What I don't understand is what can lead to this inconsistency.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文