线程化 HTTP 请求（使用代理）

发布于 2024-11-17 19:24:47 字数 1102 浏览 8 评论 0原文

我看过类似的问题，但对于使用 HTTP 处理线程的最佳方法似乎总是存在很多分歧。

我特别想做的事情：我正在使用 Python 2.7，并且我想尝试线程化 HTTP 请求（具体来说，POST 某些内容），每个请求都有一个 SOCKS5 代理。我的代码已经可以工作了，但是速度相当慢，因为它要等待每个请求（到代理服务器，然后是 Web 服务器）完成才能开始另一个请求。每个线程很可能会使用不同的 SOCKS 代理发出不同的请求。

到目前为止我一直纯粹使用 urllib2。我研究了像 PycURL 这样的模块，但在 Windows 上使用 Python 2.7 正确安装是极其困难的，我想支持它并且我正在其上编码。不过，我愿意使用任何其他模块。

我特别研究了这些问题：

Python urllib2.urlopen() 很慢，需要更好的方法来读取多个 url

Python - 使用 HTTPS 的 urllib2 异步/线程请求示例

许多的例子遭到否决和争论。假设评论者是正确的，那么使用像 Twisted 这样的异步框架来创建客户端听起来会是使用速度最快的东西。然而，我在 Google 上疯狂搜索，它没有为 SOCKS5 代理提供任何形式的支持。我目前正在使用 Socksipy 模块，我可以尝试类似的方法：

socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, IP, port)
socks.wrapmodule(twisted.web.client)

我不知道这是否可行，而且我什至不知道 Twisted 是否是我真正想要使用的。我也可以直接使用线程模块并将其应用到我当前的 urllib2 代码中，但如果这比 Twisted 慢得多，我可能不想打扰。有人有任何见解吗？

原文

I've looked at similar questions, but there always seems to be a whole lot of disagreement over the best way to handle threading with HTTP.

What I specifically want to do: I'm using Python 2.7, and I want to try and thread HTTP requests (specifically, POSTing something), with a SOCKS5 proxy for each. The code I have already works, but is rather slow since it's waiting for each request (to the proxy server, then the web server) to finish before starting another. Each thread would most likely be making a different request with a different SOCKS proxy.

So far I've purely been using urllib2. I looked into modules like PycURL, but it is extremely difficult to install properly with Python 2.7 on Windows, which I want to support and which I am coding on. I'd be willing to use any other module though.

I've looked at these questions in particular:

Python urllib2.urlopen() is slow, need a better way to read several urls

Python - Example of urllib2 asynchronous / threaded request using HTTPS

Many of the examples received downvotes and arguing. Assuming the commenters are correct, making a client with an asynchronous framework like Twisted sounds like it would be the fastest thing to use. However, I Googled ferociously, and it does not provide any sort of support for SOCKS5 proxies. I'm currently using the Socksipy module, and I could try something like:

socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, IP, port)
socks.wrapmodule(twisted.web.client)

I have no idea if that would work though, and I also don't even know if Twisted is what I really want to use. I could also just go with the threading module and work that into my current urllib2 code, but if that is going to be much slower than Twisted, I may not want to bother. Does anyone have any insight?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

贪恋 2024-11-24 19:24:47

也许一种更简单的方法是仅依靠 gevent （或 eventlet）来打开大量与服务器。这些库 Monkeypatch urllib 使之异步，同时仍然允许您编写同步代码。与线程相比，它们的开销较小，也意味着您可以生成更多（1000 秒并不罕见）。

我使用过类似的负载（抄袭自此处):

urls = ['http://www.google.com', 'http://www.yandex.ru', 'http://www.python.org']

import gevent
from gevent import monkey

# patches stdlib (including socket and ssl modules) to cooperate with other greenlets
monkey.patch_all()

import urllib2


def print_head(url):
    print ('Starting %s' % url)
    data = urllib2.urlopen(url).read()
    print ('%s: %s bytes: %r' % (url, len(data), data[:50]))

jobs = [gevent.spawn(print_head, url) for url in urls]

Perhaps an easier way would be to just rely on gevent (or eventlet) to let you open lots of connections to the server. These libs monkeypatch urllib to make then async, whilst still letting you write code that is sync-ish. Their smaller overhead vs threads also means you can spawn lots more (1000s would not be unusual).

Ive used something like this loads (plagiarized from here):

urls = ['http://www.google.com', 'http://www.yandex.ru', 'http://www.python.org']

import gevent
from gevent import monkey

# patches stdlib (including socket and ssl modules) to cooperate with other greenlets
monkey.patch_all()

import urllib2


def print_head(url):
    print ('Starting %s' % url)
    data = urllib2.urlopen(url).read()
    print ('%s: %s bytes: %r' % (url, len(data), data[:50]))

jobs = [gevent.spawn(print_head, url) for url in urls]

回复收藏 0 原文

~没有更多了~