urllib2 给出不一致的输出 - 通常会下载网页的一部分

发布于 2025-01-01 11:17:26 字数 1302 浏览 5 评论 0原文

我正在使用 urllib2 打开和保存网页。然而，通常只下载网页的一部分，而有时会下载整个页面。

import urllib2
import time
import numpy as np
from itertools import izip

outFiles = ["outFile1.html", "outFile2.html", "outFile3.html", "outFile4.html"]

urls=["http://www.guardian.co.uk/commentisfree/2011/sep/06/cameron-nhs-bill-parliament?commentpage=all",
"http://www.guardian.co.uk/commentisfree/2011/sep/06/tory-scotland-conservative-murdo-fraser?commentpage=all",
"http://www.guardian.co.uk/commentisfree/2011/sep/06/palestine-statehood-united-nations?commentpage=all",
"http://www.guardian.co.uk/commentisfree/2011/sep/05/in-praise-of-understanding-riots?commentpage=all"]

opener = urllib2.build_opener()
user_agent = 'Mozilla/5.0 (Ubuntu; X11; Linux x86_64; rv:9.0.1) Gecko/20100101 Firefox/9.0.1'
opener.addheaders = [('User-agent', user_agent)]
urllib2.install_opener(opener)

for fileName, url in izip(outFiles,urls):
    response = urllib2.urlopen(url)
    responseBody = response.read()
    fp=open(fileName,'w')
    fp.write(responseBody)
    fp.close()
    time.sleep(np.random.randint(20,40))

根据不同的运行，outputFile.html 具有不同的大小。有时文件的大小> 200Kb 到 1MB，而其他时候则在 140KB 左右。什么会导致这种差异？

当文件较小时，注释部分会丢失，但文件绝不会不完整。有时，包括评论在内的整个页面也会被下载。我检查了 curl 并且仍然遇到类似的问题。我不明白是什么导致了这种不一致。

原文

I am using urllib2 to open and save a webpage. However, often only a part of the webpage is downloaded, whereas sometime the full page is downloaded.

import urllib2
import time
import numpy as np
from itertools import izip

outFiles = ["outFile1.html", "outFile2.html", "outFile3.html", "outFile4.html"]

urls=["http://www.guardian.co.uk/commentisfree/2011/sep/06/cameron-nhs-bill-parliament?commentpage=all",
"http://www.guardian.co.uk/commentisfree/2011/sep/06/tory-scotland-conservative-murdo-fraser?commentpage=all",
"http://www.guardian.co.uk/commentisfree/2011/sep/06/palestine-statehood-united-nations?commentpage=all",
"http://www.guardian.co.uk/commentisfree/2011/sep/05/in-praise-of-understanding-riots?commentpage=all"]

opener = urllib2.build_opener()
user_agent = 'Mozilla/5.0 (Ubuntu; X11; Linux x86_64; rv:9.0.1) Gecko/20100101 Firefox/9.0.1'
opener.addheaders = [('User-agent', user_agent)]
urllib2.install_opener(opener)

for fileName, url in izip(outFiles,urls):
    response = urllib2.urlopen(url)
    responseBody = response.read()
    fp=open(fileName,'w')
    fp.write(responseBody)
    fp.close()
    time.sleep(np.random.randint(20,40))

Depending on different runs, the outputFile.html are of different size. Sometime the files are of size > 200Kb upto 1MB whereas some other time they are around 140KB. What can lead to this difference?

When the files are smaller the comment section is missing, however the file is never incomplete. Sometime the whole page including the comments are also download. I checked with curl and still had similar problem. What I don't understand is what can lead to this inconsistency.

分享到QQ

分享到微博