我最近尝试使用多处理模块(并且它是s 工作池)。我在这里读到了一些关于多线程(与标准的非线程版本相比,整个过程变慢)和多处理的讨论,但我找不到一个(可能非常简单)问题的答案:
你能加速 url-具有多处理功能的调用或者瓶颈不是网络适配器之类的东西吗?例如,我不知道 urllib2-open-method 的哪一部分可以并行化以及它应该如何工作...
编辑:这是我想要加速的请求和当前的多处理设置:
urls=["www.foo.bar", "www.bar.foo",...]
tw_url='http://urls.api.twitter.com/1/urls/count.json?url=%s'
def getTweets(self,urls):
for i in urls:
try:
self.tw_que=urllib2.urlopen(tw_url %(i))
self.jsons=json.loads(self.tw_que.read())
self.tweets.append({'url':i,'date':today,'tweets':self.jsons['count']})
except ValueError:
print ....
continue
return self.tweets
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=4)
result = [pool.apply_async(getTweets(i,)) for i in urls]
[i.get() for i in result]
I recently tried to speed up a little tool (which uses urllib2 to send a request to the (unofficial)twitter-button-count-url (> 2000 urls) and parses it´s results) with the multiprocessing module (and it´s worker pools). I read several discussion here about multithreading (which slowed the whole thing down compared to a standard, non-threaded version) and multiprocessing, but i could´t find an answer to a (probably very simple) question:
Can you speed up url-calls with multiprocessing or ain´t the bottleneck something like the network-adapter? I don´t see which part of, for example, the urllib2-open-method could be parallelized and how that should work...
EDIT: THis is the request i want to speed up and the current multiprocessing-setup:
urls=["www.foo.bar", "www.bar.foo",...]
tw_url='http://urls.api.twitter.com/1/urls/count.json?url=%s'
def getTweets(self,urls):
for i in urls:
try:
self.tw_que=urllib2.urlopen(tw_url %(i))
self.jsons=json.loads(self.tw_que.read())
self.tweets.append({'url':i,'date':today,'tweets':self.jsons['count']})
except ValueError:
print ....
continue
return self.tweets
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=4)
result = [pool.apply_async(getTweets(i,)) for i in urls]
[i.get() for i in result]
发布评论
评论(5)
看一下 gevent,特别是这个示例:concurrent_download.py。它将比多处理和多线程更快,并且可以轻松处理数千个连接。
Take a look at a look at gevent and specifically at this example: concurrent_download.py. It will be reasonably faster than multiprocessing and multithreading + it can handle thousands of connections easily.
啊,又一个关于 GIL 的讨论来了。嗯,事情是这样的。使用 urllib2 获取内容将大部分受到IO限制。当任务受 IO 限制时,本机线程和多处理将具有相同的性能(线程仅在受 CPU 限制时才成为问题)。是的,你可以加快速度,我自己使用 python 线程和 10 个下载器线程完成了它。
基本上,您使用生产者-消费者模型,其中一个线程(或进程)生成要下载的 URL,N 个线程(或进程)从该队列消费并向服务器发出请求。
下面是一些伪代码:
现在,如果您正在下载非常大的数据块(数百MB)并且单个请求完全饱和了带宽,那么运行多个下载是毫无意义的。您运行多次下载的原因(通常)是因为请求很小并且具有相对较高的延迟/开销。
Ah here comes yet another discussion about the GIL. Well here's the thing. Fetching content with urllib2 is going to be mostly IO-bound. Native threading AND multiprocessing will both have the same performance when the task is IO-bound (threading only becomes a problem when it's CPU-bound). Yes you can speed it up, I've done it myself using python threads and something like 10 downloader threads.
Basically you use a producer-consumer model with one thread (or process) producing urls to download, and N threads (or processes) consuming from that queue and making requests to the server.
Here's some pseudo-code:
Now if you're downloading very large chunks of data (hundreds of MB) and a single request completely saturates the bandwidth, then yes running multiple downloads is pointless. The reason you run multiple downloads (generally) is because requests are small and have a relatively high latency / overhead.
这取决于!您是否正在联系不同的服务器,传输的文件是小还是大,您是否浪费了大量时间等待服务器回复或传输数据,...
通常,多重处理会涉及一些开销,因此您希望确保通过并行化工作获得的加速比开销本身更大。
另一点:网络和 I/O 绑定应用程序通过异步 I/O 和事件驱动架构(而不是线程或多处理)更好地工作和扩展,因为在此类应用程序中,大部分时间都花在等待 I/O 而不是进行任何计算。
对于您的具体问题,我会尝试使用 Twisted、gevent、龙卷风 或任何其他不使用线程的网络框架并行连接。
It depends! Are you contacting different servers, are the transferred files small or big, do you loose much of the time waiting for the server to reply or by transferring data,...
Generally, multiprocessing involves some overhead and as such you want to be sure that the speedup gained by parallelizing the work is larger than the overhead itself.
Another point: network and thus I/O bound applications work – and scale – better with asynchronous I/O and an event driven architecture instead of threading or multiprocessing, as in such applications much of the time is spent waiting on I/O and not doing any computation.
For your specific problem, I would try to implement a solution by using Twisted, gevent, Tornado or any other networking framework which does not use threads to parallelize connections.
当您将 Web 请求拆分到多个进程时,您要做的就是并行化网络延迟(即等待响应)。因此,您通常应该获得良好的加速,因为大多数进程应该大部分时间处于休眠状态,等待事件。
或者使用扭曲。 ;)
What you do when you split web requests over several processes is to parallelize the network latencies (i.e. the waiting for responses). So you should normally get a good speedup, since most of the processes should sleep most of the time, waiting for an event.
Or use Twisted. ;)
如果你的代码被破坏了,那么没有什么用处:
f()
(带括号)立即调用Python中的函数,你应该只传递f
(没有括号)来执行改为游泳池。问题中的代码:注意 getTweets 后面的括号,这意味着所有代码都在主线程中串行执行。
将调用委托给池:
此外,您不需要在这里单独的进程,除非
json.loads()
在您的情况下很昂贵(CPU 方面)。您可以使用线程:将multiprocessing.Pool
替换为multiprocessing.pool.ThreadPool
- 其余部分是相同的。 GIL 在 CPython 中的 IO 过程中被释放,因此如果大部分时间花在urlopen().read()
上,线程应该会加速您的代码。这是完整的代码示例。
Nothing is useful if your code is broken:
f()
(with parentheses) calls a function in Python immediately, you should pass justf
(no parentheses) to be executed in the pool instead. Your code from the question:notice parentheses after
getTweets
that means that all the code is executed in the main thread serially.Delegate the call to the pool instead:
Also, you don't need separate processes here unless
json.loads()
is expensive (CPU-wise) in your case. You could use threads: replacemultiprocessing.Pool
withmultiprocessing.pool.ThreadPool
-- the rest is identical. GIL is released during IO in CPython and therefore threads should speed up your code if most of the time is spent inurlopen().read()
.Here's a complete code example.