pycurl 和很多回调函数

发布于 2024-08-23 14:41:38 字数 397 浏览 15 评论 0原文

我有一个很大的 URL 列表,我必须并行下载该列表并检查每个响应返回的标头之一。

我可以使用 CurlMulti 进行并行化。 我可以使用 /dev/null 作为 fb,因为我对正文不感兴趣,只对标题感兴趣。

但如何检查每个标头呢?

要接收标头,我必须设置 HEADERFUNCTION 回调。我明白了。

但在这个回调函数中我只得到带有标题的缓冲区。如何区分一个请求与另一个请求?

我不喜欢创建与 URL 一样多的回调函数。我应该创建一些类以及该类的尽可能多的实例吗?也不是很聪明。

I have big URL list, which I have to download in parallel and check one of headers that is returned with each response.

I can use CurlMulti for parallelization.
I can use /dev/null as fb, because I am not interested in body, only headers.

But how can I check each header?

To receive header, I must set HEADERFUNCTION callback. I get that.

But in this callback function I get only buffer with headers. How can I distinguish one request from another?

I don't like the idea of creating as much callback functions as there are URLs. Should I create some class and as much instances of that class? Also not very clever.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

左耳近心 2024-08-30 14:41:38

我会使用 Python 内置的 httplib 和线程模块。我认为不需要第三方模块。

I would use Python's built in httplib and threading modules. I don't see need for a 3rd party module.

赠我空喜 2024-08-30 14:41:38

我知道你问的是 pycurl,但我发现它太难用了,而且不适合使用。 API 很奇怪。

这是一个 twisted 示例:

from twisted.web.client import Agent
from twisted.internet import reactor, defer

def get_headers(response, url):
    '''Extract a dict of headers from the response'''
    return url, dict(response.headers.getAllRawHeaders())

def got_everything(all_headers):
    '''print results and end program'''
    print dict(all_headers)
    reactor.stop()

agent = Agent(reactor)
urls = (line.strip() for line in open('urls.txt'))
reqs = [agent.request('HEAD', url).addCallback(get_headers, url) for url in urls if url]
defer.gatherResults(reqs).addCallback(got_everything)
reactor.run()

此示例异步启动所有请求,并收集所有结果。以下是具有 3 个 url 的文件的输出:

{'http://debian.org': {'Content-Type': ['text/html; charset=iso-8859-1'],
                       'Date': ['Thu, 04 Mar 2010 13:27:25 GMT'],
                       'Location': ['http://www.debian.org/'],
                       'Server': ['Apache'],
                       'Vary': ['Accept-Encoding']},
 'http://google.com': {'Cache-Control': ['public, max-age=2592000'],
                       'Content-Type': ['text/html; charset=UTF-8'],
                       'Date': ['Thu, 04 Mar 2010 13:27:25 GMT'],
                       'Expires': ['Sat, 03 Apr 2010 13:27:25 GMT'],
                       'Location': ['http://www.google.com/'],
                       'Server': ['gws'],
                       'X-Xss-Protection': ['0']},
 'http://stackoverflow.com': {'Cache-Control': ['private'],
                              'Content-Type': ['text/html; charset=utf-8'],
                              'Date': ['Thu, 04 Mar 2010 13:27:24 GMT'],
                              'Expires': ['Thu, 04 Mar 2010 13:27:25 GMT'],
                              'Server': ['Microsoft-IIS/7.5']}}

I know you're asking about pycurl, but I find it too hard and unpythonic to use. The API is weird.

Here's a twisted example:

from twisted.web.client import Agent
from twisted.internet import reactor, defer

def get_headers(response, url):
    '''Extract a dict of headers from the response'''
    return url, dict(response.headers.getAllRawHeaders())

def got_everything(all_headers):
    '''print results and end program'''
    print dict(all_headers)
    reactor.stop()

agent = Agent(reactor)
urls = (line.strip() for line in open('urls.txt'))
reqs = [agent.request('HEAD', url).addCallback(get_headers, url) for url in urls if url]
defer.gatherResults(reqs).addCallback(got_everything)
reactor.run()

This example starts all requests asynchronously, and gather all results. Here's the output for a file with 3 urls:

{'http://debian.org': {'Content-Type': ['text/html; charset=iso-8859-1'],
                       'Date': ['Thu, 04 Mar 2010 13:27:25 GMT'],
                       'Location': ['http://www.debian.org/'],
                       'Server': ['Apache'],
                       'Vary': ['Accept-Encoding']},
 'http://google.com': {'Cache-Control': ['public, max-age=2592000'],
                       'Content-Type': ['text/html; charset=UTF-8'],
                       'Date': ['Thu, 04 Mar 2010 13:27:25 GMT'],
                       'Expires': ['Sat, 03 Apr 2010 13:27:25 GMT'],
                       'Location': ['http://www.google.com/'],
                       'Server': ['gws'],
                       'X-Xss-Protection': ['0']},
 'http://stackoverflow.com': {'Cache-Control': ['private'],
                              'Content-Type': ['text/html; charset=utf-8'],
                              'Date': ['Thu, 04 Mar 2010 13:27:24 GMT'],
                              'Expires': ['Thu, 04 Mar 2010 13:27:25 GMT'],
                              'Server': ['Microsoft-IIS/7.5']}}
不可一世的女人 2024-08-30 14:41:38

解决方案是使用一些函数式编程将一些附加信息“粘贴”到我们的回调函数中。

functools.partial

The solution is to use a little bit of functional programming to 'stick' some additional information to our callback function.

functools.partial

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文