当前位置：文江博客话题详情

多处理对 urllib2 没用？

发布于 2024-11-27 12:51:38 字数 875 浏览 9 评论 0 原文

我最近尝试使用多处理模块（并且它是s 工作池）。我在这里读到了一些关于多线程（与标准的非线程版本相比，整个过程变慢）和多处理的讨论，但我找不到一个（可能非常简单）问题的答案：

你能加速 url-具有多处理功能的调用或者瓶颈不是网络适配器之类的东西吗？例如，我不知道 urllib2-open-method 的哪一部分可以并行化以及它应该如何工作...

编辑：这是我想要加速的请求和当前的多处理设置：

 urls=["www.foo.bar", "www.bar.foo",...]
 tw_url='http://urls.api.twitter.com/1/urls/count.json?url=%s'

 def getTweets(self,urls):
    for i in urls:
        try:
            self.tw_que=urllib2.urlopen(tw_url %(i))
            self.jsons=json.loads(self.tw_que.read())
            self.tweets.append({'url':i,'date':today,'tweets':self.jsons['count']})
        except ValueError:
            print ....
            continue
    return self.tweets 

 if __name__ == '__main__':
    pool = multiprocessing.Pool(processes=4)            
    result = [pool.apply_async(getTweets(i,)) for i in urls]
    [i.get() for i in result]

原文

I recently tried to speed up a little tool (which uses urllib2 to send a request to the (unofficial)twitter-button-count-url (> 2000 urls) and parses it´s results) with the multiprocessing module (and it´s worker pools). I read several discussion here about multithreading (which slowed the whole thing down compared to a standard, non-threaded version) and multiprocessing, but i could´t find an answer to a (probably very simple) question:

Can you speed up url-calls with multiprocessing or ain´t the bottleneck something like the network-adapter? I don´t see which part of, for example, the urllib2-open-method could be parallelized and how that should work...

EDIT: THis is the request i want to speed up and the current multiprocessing-setup:

 urls=["www.foo.bar", "www.bar.foo",...]
 tw_url='http://urls.api.twitter.com/1/urls/count.json?url=%s'

 def getTweets(self,urls):
    for i in urls:
        try:
            self.tw_que=urllib2.urlopen(tw_url %(i))
            self.jsons=json.loads(self.tw_que.read())
            self.tweets.append({'url':i,'date':today,'tweets':self.jsons['count']})
        except ValueError:
            print ....
            continue
    return self.tweets 

 if __name__ == '__main__':
    pool = multiprocessing.Pool(processes=4)            
    result = [pool.apply_async(getTweets(i,)) for i in urls]
    [i.get() for i in result]

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

笙痞 2024-12-04 12:51:38

看一下 gevent，特别是这个示例：concurrent_download.py。它将比多处理和多线程更快，并且可以轻松处理数千个连接。

回复收藏 0 原文

桃气十足 2024-12-04 12:51:38

啊，又一个关于 GIL 的讨论来了。嗯，事情是这样的。使用 urllib2 获取内容将大部分受到IO限制。当任务受 IO 限制时，本机线程和多处理将具有相同的性能（线程仅在受 CPU 限制时才成为问题）。是的，你可以加快速度，我自己使用 python 线程和 10 个下载器线程完成了它。

基本上，您使用生产者-消费者模型，其中一个线程（或进程）生成要下载的 URL，N 个线程（或进程）从该队列消费并向服务器发出请求。

下面是一些伪代码：

# Make sure that the queue is thread-safe!!

def producer(self):
    # Only need one producer, although you could have multiple
    with fh = open('urllist.txt', 'r'):
        for line in fh:
            self.queue.enqueue(line.strip())

def consumer(self):
    # Fire up N of these babies for some speed
    while True:
        url = self.queue.dequeue()
        dh = urllib2.urlopen(url)
        with fh = open('/dev/null', 'w'): # gotta put it somewhere
            fh.write(dh.read())

现在，如果您正在下载非常大的数据块（数百MB）并且单个请求完全饱和了带宽，那么运行多个下载是毫无意义的。您运行多次下载的原因（通常）是因为请求很小并且具有相对较高的延迟/开销。

Ah here comes yet another discussion about the GIL. Well here's the thing. Fetching content with urllib2 is going to be mostly IO-bound. Native threading AND multiprocessing will both have the same performance when the task is IO-bound (threading only becomes a problem when it's CPU-bound). Yes you can speed it up, I've done it myself using python threads and something like 10 downloader threads.

Basically you use a producer-consumer model with one thread (or process) producing urls to download, and N threads (or processes) consuming from that queue and making requests to the server.

Here's some pseudo-code:

# Make sure that the queue is thread-safe!!

def producer(self):
    # Only need one producer, although you could have multiple
    with fh = open('urllist.txt', 'r'):
        for line in fh:
            self.queue.enqueue(line.strip())

def consumer(self):
    # Fire up N of these babies for some speed
    while True:
        url = self.queue.dequeue()
        dh = urllib2.urlopen(url)
        with fh = open('/dev/null', 'w'): # gotta put it somewhere
            fh.write(dh.read())

Now if you're downloading very large chunks of data (hundreds of MB) and a single request completely saturates the bandwidth, then yes running multiple downloads is pointless. The reason you run multiple downloads (generally) is because requests are small and have a relatively high latency / overhead.

回复收藏 0 原文

欢你一世 2024-12-04 12:51:38

这取决于！您是否正在联系不同的服务器，传输的文件是小还是大，您是否浪费了大量时间等待服务器回复或传输数据，...

通常，多重处理会涉及一些开销，因此您希望确保通过并行化工作获得的加速比开销本身更大。

另一点：网络和 I/O 绑定应用程序通过异步 I/O 和事件驱动架构（而不是线程或多处理）更好地工作和扩展，因为在此类应用程序中，大部分时间都花在等待 I/O 而不是进行任何计算。

对于您的具体问题，我会尝试使用 Twisted、gevent、龙卷风或任何其他不使用线程的网络框架并行连接。

回复收藏 0 原文

爱已欠费 2024-12-04 12:51:38

当您将 Web 请求拆分到多个进程时，您要做的就是并行化网络延迟（即等待响应）。因此，您通常应该获得良好的加速，因为大多数进程应该大部分时间处于休眠状态，等待事件。

或者使用扭曲。 ;)

回复收藏 0 原文

倥絔 2024-12-04 12:51:38

如果你的代码被破坏了，那么没有什么用处：f()（带括号）立即调用Python中的函数，你应该只传递f（没有括号）来执行改为游泳池。问题中的代码：

#XXX BROKEN, DO NOT USE
result = [pool.apply_async(getTweets(i,)) for i in urls]
[i.get() for i in result]

注意 getTweets 后面的括号，这意味着所有代码都在主线程中串行执行。

将调用委托给池：

all_tweets = pool.map(getTweets, urls)

此外，您不需要在这里单独的进程，除非 json.loads() 在您的情况下很昂贵（CPU 方面）。您可以使用线程：将 multiprocessing.Pool 替换为 multiprocessing.pool.ThreadPool - 其余部分是相同的。 GIL 在 CPython 中的 IO 过程中被释放，因此如果大部分时间花在 urlopen().read() 上，线程应该会加速您的代码。

这是完整的代码示例。

Nothing is useful if your code is broken: f() (with parentheses) calls a function in Python immediately, you should pass just f (no parentheses) to be executed in the pool instead. Your code from the question:

#XXX BROKEN, DO NOT USE
result = [pool.apply_async(getTweets(i,)) for i in urls]
[i.get() for i in result]

notice parentheses after getTweets that means that all the code is executed in the main thread serially.

Delegate the call to the pool instead:

all_tweets = pool.map(getTweets, urls)

Also, you don't need separate processes here unless json.loads() is expensive (CPU-wise) in your case. You could use threads: replace multiprocessing.Pool with multiprocessing.pool.ThreadPool -- the rest is identical. GIL is released during IO in CPython and therefore threads should speed up your code if most of the time is spent in urlopen().read().

Here's a complete code example.

回复收藏 0 原文

~没有更多了~