传出负载均衡器
我有一个用 python 编写的大型线程提要检索脚本。
我的问题是,如何对传出请求进行负载平衡,以免过于频繁地访问任何一台主机?
这对于 feedburner 来说是一个大问题,因为很大一部分网站通过 feedburner 代理其 RSS,并且使问题进一步复杂化的是,许多网站将其域上的子域别名为 feedburner,以掩盖他们正在使用它的事实(例如“mysite”)将其 RSS url 设置为 feeds.mysite.com/mysite,其中 feeds.mysite.com 跳转到 feedburner)。有时它会阻止我一段时间并重定向到他们的 "自动请求”错误页面。
I have a big threaded feed retrieval script in python.
My question is, how can I load balance outgoing requests so that I don't hit any one host too often?
This is a big problem for feedburner, since a large percentage of sites proxy their RSS through feedburner and to further complicate matters many sites will alias a subdomain on their domain to feedburner to obscure the fact that they're using it (e.g. "mysite" sets its RSS url to feeds.mysite.com/mysite, where feeds.mysite.com bounces to feedburner). Sometimes it blocks me for awhile and redirects to their "automated requests" error page.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您可能应该执行一次性请求(每周/每月,无论合适)。对于每个提要并遵循重定向以获得“真实”地址。无论当时的限制情况如何,您都应该能够解析所有提要,保存该数据,然后对添加到列表中的每个新提要执行一次。您可以查看urllib的geturl(),因为它从您的URL返回最终的URL放入其中。当您对源执行 ping 操作时,请务必使用原始源(保留“真实”只是为了负载平衡),以确保在用户移动了源或类似情况时它能够正确重定向。
完成此操作后,您可以简单地设计一种加载机制,例如给定域每小时仅发送 X 个请求,遍历每个 feed 并跳过主机已达到限制的 feed。如果 feedburner 将其限制公开(不太可能),您可以将其用于 X,但否则您只需估计它并粗略估计您知道低于限制。然而,了解谷歌,他们的限制可能会衡量模式,并且没有特定的硬限制。
编辑:添加评论中的建议。
You should probably do a one-time request (per week/month, whatever fits). for each feed and follow redirects to get the "true" address. Regardless of your throttling situation at the time, you should be able to resolve all feeds, save that data and then just do it once for every new feed you add to the list. You can look at urllib's geturl() as it returns the final url from the URL you put into it. When you do ping the feeds, be sure to use the original (keep the "real" simply for load-balancing) to make sure it redirects properly if the user has moved it or similar.
Once that is done, you can simply devise a load mechanism such as only X requests per hour for a given domain, going through each feed and skipping feeds whose hosts have hit the limit. If feedburner keeps their limits public (not likely) you can use that for X, but otherwise you will just have to estimate it and make a rough estimate that you know to be below the limit. Knowing google however, their limits might measure patterns and not have a specific hard limit.
Edit: Added suggestion from comment.
如果您的问题与 Feedburner“限制您”有关,那么它肯定是因为您的机器人的源 IP 造成的。 “负载平衡到 Feedburner”的方法是从多个不同的源 IP 开始。
现在有很多方法可以实现这一点,其中 2 种是:
当然,您现在不要在它们前面放一个 NAT 盒吗 ;-)
多 上面解决了可能的“节流问题”,现在是“调度部分”。您应该为每个“目的地”维护一个“虚拟调度程序”,并确保不超出相关 Web 服务(例如 Feedburner)的参数。现在,棘手的部分是掌握这些“限制”......有时它们是广告中的,有时您需要通过实验找出它们。
我知道这是“高级架构指南”,但我还没有准备好为您编码......我希望您原谅我;-)
If your problem is related to Feedburner "throttling you", it most certainly does this because of the source IP of your bot. The way to "load balance to Feedburner" would be to have multiple different source IPs to start from.
Now there are numerous ways to achieving this, 2 of them being:
Of course, don't you go a put a NAT box in front of them now ;-)
The above takes care of the possible "throttling problems", now for the "scheduling part". You should maintain a "virtual scheduler" per "destination" and make sure not to exceed the parameters of the Web Service (e.g. Feedburner) in question. Now, the tricky part is to get hold of these "limits"... sometimes they are advertised and sometimes you need to figure them out experimentally.
I understand this is "high level architectural guidelines" but I am not ready to be coding this for you... I hope you forgive me ;-)
“如何对传出请求进行负载平衡,以便不会过于频繁地访问任何一台主机?”
通常,您可以通过设计更好的算法来实现这一点。
例如,随机打乱您的请求。
或者“公平”地对它们进行洗牌,以便您循环访问源。这将是一个简单的队列列表,您可以在其中将来自每个主机的一个请求出列。
"how can I load balance outgoing requests so that I don't hit any one host too often?"
Generally, you do this by designing a better algorithm.
For example, randomly scramble your requests.
Or shuffle them 'fairly' so so that you round-robin through the sources. That would be a simple list of queues where you dequeue one request from each host.