pycurl 和很多回调函数
我有一个很大的 URL 列表,我必须并行下载该列表并检查每个响应返回的标头之一。
我可以使用 CurlMulti 进行并行化。 我可以使用 /dev/null
作为 fb,因为我对正文不感兴趣,只对标题感兴趣。
但如何检查每个标头呢?
要接收标头,我必须设置 HEADERFUNCTION 回调。我明白了。
但在这个回调函数中我只得到带有标题的缓冲区。如何区分一个请求与另一个请求?
我不喜欢创建与 URL 一样多的回调函数。我应该创建一些类以及该类的尽可能多的实例吗?也不是很聪明。
I have big URL list, which I have to download in parallel and check one of headers that is returned with each response.
I can use CurlMulti for parallelization.
I can use /dev/null
as fb, because I am not interested in body, only headers.
But how can I check each header?
To receive header, I must set HEADERFUNCTION callback. I get that.
But in this callback function I get only buffer with headers. How can I distinguish one request from another?
I don't like the idea of creating as much callback functions as there are URLs. Should I create some class and as much instances of that class? Also not very clever.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我会使用 Python 内置的 httplib 和线程模块。我认为不需要第三方模块。
I would use Python's built in httplib and threading modules. I don't see need for a 3rd party module.
我知道你问的是 pycurl,但我发现它太难用了,而且不适合使用。 API 很奇怪。
这是一个 twisted 示例:
此示例异步启动所有请求,并收集所有结果。以下是具有 3 个 url 的文件的输出:
I know you're asking about pycurl, but I find it too hard and unpythonic to use. The API is weird.
Here's a twisted example:
This example starts all requests asynchronously, and gather all results. Here's the output for a file with 3 urls:
解决方案是使用一些函数式编程将一些附加信息“粘贴”到我们的回调函数中。
functools.partial
The solution is to use a little bit of functional programming to 'stick' some additional information to our callback function.
functools.partial