python - 更快地下载约 500 个网页(循环)
对于初学者来说,我是 python 新手,所以下面的代码可能不是最干净的。对于一个程序,我需要下载大约 500 个网页。 url 存储在一个数组中,该数组由先前的函数填充。下载部分是这样的:
def downloadpages(num):
import urllib
for i in range(0,numPlanets):
urllib.urlretrieve(downloadlist[i], 'webpages/'+names[i]'.htm')
每个文件只有20KB左右,但下载全部文件至少需要10分钟。下载总大小的单个文件应该只需要一两分钟。有什么办法可以加快速度吗?谢谢
编辑:对于任何感兴趣的人,请按照http 中的示例进行操作://code.google.com/p/workerpool/wiki/MassDownloader并使用50个线程,下载时间从原来的10多分钟减少到20秒左右。随着线程的增加,下载速度继续下降,直到大约 60 个线程,之后下载时间又开始增加。
For starters I'm new to python so my code below may not be the cleanest. For a program I need to download about 500 webpages. The url's are stored in an array which is populated by a previous function. The downloading part goes something like this:
def downloadpages(num):
import urllib
for i in range(0,numPlanets):
urllib.urlretrieve(downloadlist[i], 'webpages/'+names[i]'.htm')
each file is only around 20KB but it takes at least 10 mins to download all of them. Downloading a single file of the total combined size should only take a minute or two. Is there a way I can speed this up? Thanks
Edit: To anyone who is interested, following the example at http://code.google.com/p/workerpool/wiki/MassDownloader and using 50 threads, the download time has been reduced to about 20 seconds from the original 10 minutes plus. The download speed continues to decrease as the threads are increased up until around 60 threads, after which the download time begins to rise again.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
但您在这里下载的不是单个文件。您正在下载 500 个单独的页面,每个连接都涉及开销(对于初始连接),加上服务器正在执行的其他操作(是否为其他人提供服务?)。
不管怎样,下载 500 x 20kb 与下载该大小的单个文件不同。
But you're not downloading a single file, here. You're downloading 500 separate pages, each connection involves overhead (for the initial connection), plus whatever else the server is doing (is it serving other people?).
Either way, downloading 500 x 20kb is not the same as downloading a single file of that size.
您可以通过使用线程显着加快执行速度(但要小心,不要使服务器过载)。
介绍材料/代码示例:
You can speed up execution significantly by using threads (be careful though, to not overload the server).
Intro material/Code samples:
您可以使用 greenlet 来执行此操作。
带有 eventlet 库的 EG:
池中的所有调用都将是伪模拟的。
当然你必须先用pip或者easy_install安装eventlet。
您在 Python 中有多种 greenlet 实现。您可以对 gevent 或其他 gevent 执行相同的操作。
You can use greenlet to do so.
E.G with the eventlet lib:
All calls in the pools will be pseudo simulatneous.
Of course you must install eventlet with pip or easy_install before.
You have several implementations of greenlets in Python. You could do the same with gevent or another one.
除了使用某种并发性之外,请确保用于发出请求的任何方法都使用 HTTP 1.1 连接持久性。这将允许每个线程仅打开一个连接并通过该连接请求所有页面,而不是为每个请求进行 TCP/IP 设置/拆卸。不确定 urllib2 是否默认这样做;你可能需要自己动手。
In addition to using concurrency of some sort, make sure whatever method you're using to make the requests uses HTTP 1.1 connection persistence. That will allow each thread to open only a single connection and request all the pages over that, instead of having a TCP/IP setup/teardown for each request. Not sure if urllib2 does that by default; you might have to roll your own.