如何在 python 中使用 urllib2 加快获取页面的速度?
我有一个脚本可以获取多个网页并解析信息。
(示例可参见 http://bluedevilbooks.com/search /?DEPT=MATH&CLASS=103&SEC=01 )
我在上面运行了 cProfile,正如我所假设的,urlopen 占用了很多时间。有没有办法更快地获取页面?或者一次获取多个页面的方法?我会做最简单的事情,因为我是 python 和网络开发的新手。
提前致谢! :)
更新:我有一个名为 fetchURLs()
的函数,我用它来创建我需要的 URL 数组 所以类似于 urls = fetchURLS()
。URL 都是来自 Amazon 和 eBay API 的 XML 文件(这让我很困惑为什么加载时间这么长,也许我的网络主机很慢?
)需要做的是加载每个 URL,读取每个页面,并将该数据发送到脚本的另一部分,该部分将解析并显示数据。
请注意,在获取所有页面之前我无法执行后一部分,这就是我的问题所在。
另外,我相信我的主机将我一次限制为 25 个进程,所以服务器上最简单的就很好了:)
这是时间:
Sun Aug 15 20:51:22 2010 prof
211352 function calls (209292 primitive calls) in 22.254 CPU seconds
Ordered by: internal time
List reduced from 404 to 10 due to restriction <10>
ncalls tottime percall cumtime percall filename:lineno(function)
10 18.056 1.806 18.056 1.806 {_socket.getaddrinfo}
4991 2.730 0.001 2.730 0.001 {method 'recv' of '_socket.socket' objects}
10 0.490 0.049 0.490 0.049 {method 'connect' of '_socket.socket' objects}
2415 0.079 0.000 0.079 0.000 {method 'translate' of 'unicode' objects}
12 0.061 0.005 0.745 0.062 /usr/local/lib/python2.6/HTMLParser.py:132(goahead)
3428 0.060 0.000 0.202 0.000 /usr/local/lib/python2.6/site-packages/BeautifulSoup.py:1306(endData)
1698 0.055 0.000 0.068 0.000 /usr/local/lib/python2.6/site-packages/BeautifulSoup.py:1351(_smartPop)
4125 0.053 0.000 0.056 0.000 /usr/local/lib/python2.6/site-packages/BeautifulSoup.py:118(setup)
1698 0.042 0.000 0.358 0.000 /usr/local/lib/python2.6/HTMLParser.py:224(parse_starttag)
1698 0.042 0.000 0.275 0.000 /usr/local/lib/python2.6/site-packages/BeautifulSoup.py:1397(unknown_starttag)
I have a script that fetches several web pages and parses the info.
(An example can be seen at http://bluedevilbooks.com/search/?DEPT=MATH&CLASS=103&SEC=01 )
I ran cProfile on it, and as I assumed, urlopen takes up a lot of time. Is there a way to fetch the pages faster? Or a way to fetch several pages at once? I'll do whatever is simplest, as I'm new to python and web developing.
Thanks in advance! :)
UPDATE: I have a function called fetchURLs()
, which I use to make an array of the URLs I need
so something like urls = fetchURLS()
.The URLS are all XML files from Amazon and eBay APIs (which confuses me as to why it takes so long to load, maybe my webhost is slow?)
What I need to do is load each URL, read each page, and send that data to another part of the script which will parse and display the data.
Note that I can't do the latter part until ALL of the pages have been fetched, that's what my issue is.
Also, my host limits me to 25 processes at a time, I believe, so whatever is easiest on the server would be nice :)
Here it is for time:
Sun Aug 15 20:51:22 2010 prof
211352 function calls (209292 primitive calls) in 22.254 CPU seconds
Ordered by: internal time
List reduced from 404 to 10 due to restriction <10>
ncalls tottime percall cumtime percall filename:lineno(function)
10 18.056 1.806 18.056 1.806 {_socket.getaddrinfo}
4991 2.730 0.001 2.730 0.001 {method 'recv' of '_socket.socket' objects}
10 0.490 0.049 0.490 0.049 {method 'connect' of '_socket.socket' objects}
2415 0.079 0.000 0.079 0.000 {method 'translate' of 'unicode' objects}
12 0.061 0.005 0.745 0.062 /usr/local/lib/python2.6/HTMLParser.py:132(goahead)
3428 0.060 0.000 0.202 0.000 /usr/local/lib/python2.6/site-packages/BeautifulSoup.py:1306(endData)
1698 0.055 0.000 0.068 0.000 /usr/local/lib/python2.6/site-packages/BeautifulSoup.py:1351(_smartPop)
4125 0.053 0.000 0.056 0.000 /usr/local/lib/python2.6/site-packages/BeautifulSoup.py:118(setup)
1698 0.042 0.000 0.358 0.000 /usr/local/lib/python2.6/HTMLParser.py:224(parse_starttag)
1698 0.042 0.000 0.275 0.000 /usr/local/lib/python2.6/site-packages/BeautifulSoup.py:1397(unknown_starttag)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(11)
Ray 提供了一种优雅的方法来做到这一点(在 Python 2 和 Python 3 中)。 Ray 是一个用于编写并行和分布式 Python 的库。
只需使用
@ray.remote
装饰器定义fetch
函数即可。然后,您可以通过调用fetch.remote(url)
在后台获取URL。如果您还想并行处理网页,可以将处理代码直接放入
fetch
中,也可以定义一个新的远程函数并将它们组合在一起。如果您要获取的 URL 列表很长,您可能希望发出一些任务,然后按照任务完成的顺序处理它们。您可以使用 ray.wait 来完成此操作。
查看 Ray 文档。
Ray offers an elegant way to do this (in both Python 2 and Python 3). Ray is a library for writing parallel and distributed Python.
Simply define the
fetch
function with the@ray.remote
decorator. Then you can fetch a URL in the background by callingfetch.remote(url)
.If you also want to process the webpages in parallel, you can either put the processing code directly into
fetch
, or you can define a new remote function and compose them together.If you have a very long list of URLs that you want to fetch, you may wish to issue some tasks and then process them in the order that they complete. You can do this using
ray.wait
.View the Ray documentation.
显然,获取网页需要一段时间,因为您没有访问任何本地内容。如果您有多个要访问的内容,则可以使用
线程
一次运行几个模块。这是一个非常粗略的示例,
这是我运行它时的输出:
通过附加到列表来从线程中获取数据可能是不明智的(队列会更好),但它说明了存在差异。
Fetching webpages obviously will take a while as you're not accessing anything local. If you have several to access, you could use the
threading
module to run a couple at once.Here's a very crude example
This was the output when I ran it:
Grabbing the data from the thread by appending to a list is probably ill-advised (Queue would be better) but it illustrates that there is a difference.
编辑:我正在扩展答案以包含一个更精美的示例。我在这篇文章中发现了很多关于线程与异步 I/O 的敌意和错误信息。因此我还添加了更多的论据来反驳某些无效的主张。我希望这将帮助人们为正确的工作选择正确的工具。
这是三天前的一个问题的重复。
Python urllib2.open 很慢,需要更好的方法来读取多个网址 - 代码日志
Python urllib2 .urlopen() 很慢,需要一种更好的方法来读取多个网址
我正在完善代码以展示如何使用线程并行获取多个网页。
正如您所看到的,应用程序特定代码只有 3 行,如果您积极的话,可以将其折叠为 1 行。我认为没有人可以证明他们的说法是合理的,即这是复杂且无法维护的。
不幸的是,这里发布的大多数其他线程代码都有一些缺陷。他们中的许多人都会主动轮询以等待代码完成。
join()
是同步代码的更好方法。我认为这段代码已经改进了迄今为止的所有线程示例。保持活动连接
如果所有 URL 都指向同一服务器,WoLpH 关于使用保持活动连接的建议可能非常有用。
twisted
Aaron Gallagher 是
twisted
框架的粉丝,他敌视任何建议线程的人。不幸的是,他的许多说法都是错误信息。例如,他说“-1 表示建议线程。这是 IO 绑定的;线程在这里毫无用处。”这与证据相反,因为 Nick T 和我都证明了使用线程的速度增益。事实上,I/O 密集型应用程序可以从使用 Python 线程中获得最大收益(而 CPU 密集型应用程序则没有任何收益)。 Aaron 对线程的误导性批评表明他对一般的并行编程相当困惑。正确的工具适合正确的工作
我很清楚与使用线程、Python、异步 I/O 等进行并行编程相关的问题。每个工具都有其优点和缺点。对于每种情况都有合适的工具。我并不反对扭曲(尽管我自己还没有部署过)。但我不相信我们可以断然地说在所有情况下线都是坏的,而扭曲的都是好的。
例如,如果OP的要求是并行获取10,000个网站,则异步I/O将是首选。线程是不合适的(除非使用无堆栈Python)。
Aaron 对线程的反对大多是概括性的。他没有意识到这是一个微不足道的并行化任务。每个任务都是独立的,不共享资源。所以他的大部分攻击都不适用。
鉴于我的代码没有外部依赖性,我将其称为适合正确工作的正确工具。
性能
我想大多数人都会同意这个任务的性能很大程度上取决于网络代码和外部服务器,其中平台代码的性能影响应该可以忽略不计。然而,Aaron 的基准测试显示,线程代码的速度提高了 50%。我认为有必要对这种明显的速度增益做出反应。
在尼克的代码中,有一个明显的缺陷导致效率低下。但是你如何解释我的代码获得的 233 毫秒的速度增益?我想即使是扭曲的粉丝也不会轻易下结论,将这归因于扭曲的效率。毕竟,系统代码之外还有大量的变量,例如远程服务器的性能、网络、缓存以及 urllib2 和twisted web 客户端之间的差异实现等。
为了确保 Python 的线程不会导致大量低效率,我做了一个快速基准测试来生成 5 个线程,然后生成 500 个线程。我可以很轻松地说,生成 5 个线程的开销可以忽略不计,并且无法解释 233 毫秒的速度差异。
对我的并行获取的进一步测试显示,17 次运行中响应时间存在巨大差异。 (不幸的是,我没有扭曲验证亚伦的代码)。
我的测试并不支持 Aaron 的结论,即线程始终比异步 I/O 慢很多。考虑到涉及的变量数量,我不得不说这不是衡量异步 I/O 和线程之间系统性能差异的有效测试。
EDIT: I'm expanding the answer to include a more polished example. I have found a lot hostility and misinformation in this post regarding threading v.s. async I/O. Therefore I also adding more argument to refute certain invalid claim. I hope this will help people to choose the right tool for the right job.
This is a dup to a question 3 days ago.
Python urllib2.open is slow, need a better way to read several urls - Stack Overflow
Python urllib2.urlopen() is slow, need a better way to read several urls
I'm polishing the code to show how to fetch multiple webpage in parallel using threads.
As you can see, the application specific code has only 3 lines, which can be collapsed into 1 line if you are aggressive. I don't think anyone can justify their claim that this is complex and unmaintainable.
Unfortunately most other threading code posted here has some flaws. Many of them do active polling to wait for the code to finish.
join()
is a better way to synchronize the code. I think this code has improved upon all the threading examples so far.keep-alive connection
WoLpH's suggestion about using keep-alive connection could be very useful if all you URLs are pointing to the same server.
twisted
Aaron Gallagher is a fans of
twisted
framework and he is hostile any people who suggest thread. Unfortunately a lot of his claims are misinformation. For example he said "-1 for suggesting threads. This is IO-bound; threads are useless here." This contrary to evidence as both Nick T and I have demonstrated speed gain from the using thread. In fact I/O bound application has the most to gain from using Python's thread (v.s. no gain in CPU bound application). Aaron's misguided criticism on thread shows he is rather confused about parallel programming in general.Right tool for the right job
I'm well aware of the issues pertain to parallel programming using threads, python, async I/O and so on. Each tool has their pros and cons. For each situation there is an appropriate tool. I'm not against twisted (though I have not deployed one myself). But I don't believe we can flat out say that thread is BAD and twisted is GOOD in all situations.
For example, if the OP's requirement is to fetch 10,000 website in parallel, async I/O will be prefereable. Threading won't be appropriable (unless maybe with stackless Python).
Aaron's opposition to threads are mostly generalizations. He fail to recognize that this is a trivial parallelization task. Each task is independent and do not share resources. So most of his attack do not apply.
Given my code has no external dependency, I'll call it right tool for the right job.
Performance
I think most people would agree that performance of this task is largely depend on the networking code and the external server, where the performance of platform code should have negligible effect. However Aaron's benchmark show an 50% speed gain over the threaded code. I think it is necessary to response to this apparent speed gain.
In Nick's code, there is an obvious flaw that caused the inefficiency. But how do you explain the 233ms speed gain over my code? I think even twisted fans will refrain from jumping into conclusion to attribute this to the efficiency of twisted. There are, after all, a huge amount of variable outside of the system code, like the remote server's performance, network, caching, and difference implementation between urllib2 and twisted web client and so on.
Just to make sure Python's threading will not incur a huge amount of inefficiency, I do a quick benchmark to spawn 5 threads and then 500 threads. I am quite comfortable to say the overhead of spawning 5 thread is negligible and cannot explain the 233ms speed difference.
Further testing on my parallel fetching shows a huge variability in the response time in 17 runs. (Unfortunately I don't have twisted to verify Aaron's code).
My testing does not support Aaron's conclusion that threading is consistently slower than async I/O by a measurable margin. Given the number of variables involved, I have to say this is not a valid test to measure the systematic performance difference between async I/O and threading.
使用扭曲!与使用线程相比,它使这种事情变得异常简单。
该代码的性能也比发布的任何其他解决方案(在我关闭了一些使用大量带宽的东西后编辑):
并使用 Nick T 的代码,进行了调整,也给出了 5 的平均值并更好地显示了输出:
并且使用 Wai Yip Tung 的代码:
我不得不说,我确实喜欢顺序读取对我来说更好。
Use twisted! It makes this kind of thing absurdly easy compared to, say, using threads.
This code also performs better than any of the other solutions posted (edited after I closed some things that were using a lot of bandwidth):
And using Nick T's code, rigged up to also give the average of five and show the output better:
And using Wai Yip Tung's code:
I've gotta say, I do like that the sequential fetches performed better for me.
这是一个使用 python
Threads
的示例。这里的其他线程示例为每个 url 启动一个线程,如果它导致服务器无法处理太多的点击,那么这不是非常友好的行为(例如,蜘蛛在同一主机上有许多 url 是很常见的)注意:给出的时间这里有 40 个 url,很大程度上取决于您的互联网连接速度和服务器的延迟。在澳大利亚,我的 ping 值是 > 300 毫秒
WORKERS=1
运行需要 86 秒使用
WORKERS=4
运行需要 23 秒使用
WORKERS=10
运行需要 10 秒,因此 10 个线程的下载速度是单线程的 8.6 倍。
这是使用队列的升级版本。至少有几个优点。
1. url 按照它们在列表中出现的顺序进行请求
2. 可以使用
q.join()
来检测请求何时全部完成3.结果保持与url列表相同的顺序
Here is an example using python
Threads
. The other threaded examples here launch a thread per url, which is not very friendly behaviour if it causes too many hits for the server to handle (for example it is common for spiders to have many urls on the same host)Note: The times given here are for 40 urls and will depend a lot on the speed of your internet connection and the latency to the server. Being in Australia, my ping is > 300ms
With
WORKERS=1
it took 86 seconds to runWith
WORKERS=4
it took 23 seconds to runwith
WORKERS=10
it took 10 seconds to runso having 10 threads downloading is 8.6 times as fast as a single thread.
Here is an upgraded version that uses a Queue. There are at least a couple of advantages.
1. The urls are requested in the order that they appear in the list
2. Can use
q.join()
to detect when the requests have all completed3. The results are kept in the same order as the url list
由于这个问题已经发布,看起来有一个更高级别的抽象可用,
ThreadPoolExecutor
:https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor-example
为了方便起见,将示例粘贴到此处:
还有
我认为这使代码更容易: https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.Executor.map
Since this question was posted it looks like there's a higher level abstraction available,
ThreadPoolExecutor
:https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor-example
The example from there pasted here for convenience:
There's also
map
which I think makes the code easier: https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.Executor.map实际的等待可能不在 urllib2 中,而是在服务器和/或与服务器的网络连接中。
有两种方法可以加快速度。
multiprocessing
库使事情变得非常简单。The actual wait is probably not in
urllib2
but in the server and/or your network connection to the server.There are 2 ways of speeding this up.
multiprocessing
lib to make things pretty easy.大多数答案都集中在同时从不同服务器获取多个页面
(线程)但不适用于重用已打开的 HTTP 连接。如果 OP 向同一服务器/站点发出多个请求。
在 urlib2 中,每个请求都会创建一个单独的连接,这会影响性能,从而降低获取页面的速度。 urllib3通过使用连接池解决了这个问题。可以在这里阅读更多内容 urllib3 [也是线程安全的]
还有 请求使用 urllib3 的 HTTP 库
这与线程相结合应该可以提高获取页面的速度
Most of the answers focused on fetching multiple pages from different servers at the same time
(threading) but not on reusing already open HTTP connection. If OP is making multiple request to the same server/site.
In urlib2 a separate connection is created with each request which impacts performance and and as a result slower rate of fetching pages. urllib3 solves this problem by using a connection pool. Can read more here urllib3 [Also thread-safe]
There is also Requests an HTTP library that uses urllib3
This combined with threading should increase the speed of fetching pages
如今,有一个出色的 Python 库可以为您执行此操作,称为 requests。
如果您想要基于线程的解决方案,请使用标准请求 api;如果您想要基于非阻塞 IO 的解决方案,请使用异步 api(在后台使用 gevent)。
Nowadays there is excellent Python lib that do this for you called requests.
Use standard api of requests if you want solution based on threads or async api (using gevent under the hood) if you want solution based on non-blocking IO.
这是一个标准库解决方案。它的速度不是那么快,但它使用的内存比线程解决方案少。
另外,如果您的大部分请求都发送到同一主机,那么重用相同的 http 连接可能比并行处理更有帮助。
Here's a standard library solution. It's not quite as fast, but it uses less memory than the threaded solutions.
Also, if most of your requests are to the same host, then reusing the same http connection would probably help more than doing things in parallel.
请查找用于单连接缓慢识别的 Python 网络基准测试脚本:
Python 3.6 的结果示例:
Python 2.7.13 具有非常相似的结果。
在这种情况下,DNS 和 urlopen 缓慢很容易识别。
Please find Python network benchmark script for single connection slowness identification:
And example of results with Python 3.6:
Python 2.7.13 has very similar results.
In this case, DNS and urlopen slowness are easily identified.