python - 更快地下载约 500 个网页（循环）

发布于 2024-11-27 05:58:43 字数 617 浏览 4 评论 0原文

对于初学者来说，我是 python 新手，所以下面的代码可能不是最干净的。对于一个程序，我需要下载大约 500 个网页。 url 存储在一个数组中，该数组由先前的函数填充。下载部分是这样的：

def downloadpages(num):

    import urllib
    for i in range(0,numPlanets):
            urllib.urlretrieve(downloadlist[i], 'webpages/'+names[i]'.htm')

每个文件只有20KB左右，但下载全部文件至少需要10分钟。下载总大小的单个文件应该只需要一两分钟。有什么办法可以加快速度吗？谢谢

编辑：对于任何感兴趣的人，请按照http 中的示例进行操作://code.google.com/p/workerpool/wiki/MassDownloader并使用50个线程，下载时间从原来的10多分钟减少到20秒左右。随着线程的增加，下载速度继续下降，直到大约 60 个线程，之后下载时间又开始增加。

原文

For starters I'm new to python so my code below may not be the cleanest. For a program I need to download about 500 webpages. The url's are stored in an array which is populated by a previous function. The downloading part goes something like this:

def downloadpages(num):

    import urllib
    for i in range(0,numPlanets):
            urllib.urlretrieve(downloadlist[i], 'webpages/'+names[i]'.htm')

each file is only around 20KB but it takes at least 10 mins to download all of them. Downloading a single file of the total combined size should only take a minute or two. Is there a way I can speed this up? Thanks

Edit: To anyone who is interested, following the example at http://code.google.com/p/workerpool/wiki/MassDownloader and using 50 threads, the download time has been reduced to about 20 seconds from the original 10 minutes plus. The download speed continues to decrease as the threads are increased up until around 60 threads, after which the download time begins to rise again.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

淡看悲欢离合 2024-12-04 05:58:43

但您在这里下载的不是单个文件。您正在下载 500 个单独的页面，每个连接都涉及开销（对于初始连接），加上服务器正在执行的其他操作（是否为其他人提供服务？）。

不管怎样，下载 500 x 20kb 与下载该大小的单个文件不同。

回复收藏 0 原文

遗忘曾经 2024-12-04 05:58:43

您可以通过使用线程显着加快执行速度（但要小心，不要使服务器过载）。

介绍材料/代码示例：

回复收藏 0 原文

无妨# 2024-12-04 05:58:43

您可以使用 greenlet 来执行此操作。

带有 eventlet 库的 EG：

urls = [url1, url2, ...]

import eventlet
from eventlet.green import urllib2

def fetch(url):
  return urllib2.urlopen(url).read()

pool = eventlet.GreenPool()

for body in pool.imap(fetch, urls):
  print "got body", len(body)

池中的所有调用都将是伪模拟的。

当然你必须先用pip或者easy_install安装eventlet。

您在 Python 中有多种 greenlet 实现。您可以对 gevent 或其他 gevent 执行相同的操作。

You can use greenlet to do so.

E.G with the eventlet lib:

urls = [url1, url2, ...]

import eventlet
from eventlet.green import urllib2

def fetch(url):
  return urllib2.urlopen(url).read()

pool = eventlet.GreenPool()

for body in pool.imap(fetch, urls):
  print "got body", len(body)

All calls in the pools will be pseudo simulatneous.

Of course you must install eventlet with pip or easy_install before.

You have several implementations of greenlets in Python. You could do the same with gevent or another one.

回复收藏 0 原文