用于检测陈旧数据的可扩展算法

发布于 2024-11-29 12:32:47 字数 214 浏览 2 评论 0原文

问题是:

安装在许多不同服务器上的“代理”每 5 秒向中央服务器发送一次“心跳”信号。如何主动找到心跳超过10秒的人并发出警报?

如果不考虑可扩展性,问题很简单。最简单的形式是,您可以在数据库表中记录从每个代理收到的最新心跳的时间戳,并运行常规查询以查找早于阈值的心跳。

然而,该解决方案无法扩展到数百万个代理。

我正在寻找使这成为可能的算法或技术。

Here is the problem:

"Agents" installed on many different servers send "heartbeat" signals up to a central server every 5 seconds. How can I find the ones that have missed their heartbeat for more than 10 seconds actively and raise an alert?

The problem is simple if you don't think about scalablity. In the simplest form, you can record the timestamp of the latest heartbeat received from each agent in a database table and run a regular query to find the ones older than the threshold.

This solution is however not scalable to millions of agents.

I am looking for algorithms or techologies that make this possible.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

幻梦 2024-12-06 12:32:47
  1. 使用地图:AgentId --> LastHearbeatTime
  2. 使用 11 组(假设 1 秒的分辨率就足够了),每组保存 1 秒窗口中报告的 Agent 的 ID。

每次代理报告心跳时:
1.在地图上找到它
2.从相关集合中删除
3.在地图中更新
4.将其添加到相关集合中

定义一个线程:每秒一次,最旧的集合过期。它应该是空的。如果没有 - 它包含未报告的代理的 ID。一旦集合过期,您可以重复使用它(集合的循环数组)。

我相信它可以在没有锁的情况下实现(也许你需要12套)。

  1. Use a map: AgentId --> LastHearbeatTime
  2. Use 11 sets (assuming a resolution of 1 second is enough), each holds Ids of Agents reported in a 1-second window.

Every time an agent report a hearthbeat:
1. Find it in the map
2. Delete it from the relevant set
3. Update it in the map
4. Add it to the relevant set

Define a thread: Once per second, the oldest set expires. It should be empty. If it doesn't - it contains Ids of agents which did not report. Once a set expires, you can reuse it (cyclic array of sets).

I believe it can be implemented without locks (maybe you'll need 12 sets).

最冷一天 2024-12-06 12:32:47

如果不了解语言和平台,就很难为您提供详细的实施建议,但我的建议与 Lior Kogan 的建议有些相似。
然而,在我看来,你只需要两个集合,并且不涉及映射:

假设你有两个代表集合的变量,A和B。

每次心跳都会从集合A中删除代理ID。
每 5 秒,一个不同的线程会针对 B 中的每个代理 id 发出警报,然后设置 B = A,最后但并非最不重要的一点是创建一个包含所有代理 id 的集合,并将 A 设置为等于该值(如果代理 id 的数量真的很大,你可以在一张支票和另一张支票之间准备新的一套,然后只在剩下的时间里睡觉)。
如果您使用无锁集合集合,则仅在更改指向每个集合的变量时才需要锁定。
性能在很大程度上取决于所述实现的算法复杂性,如果你按照这种方式进行,你应该优先考虑性能最好的算法(不一定是最好的大O,例如,如果最坏情况下的延迟对你很重要)。

附带说明一下,如果内存不是问题或者故障相对较少,那么当您检查是否需要发出警报并执行此操作时,您可以在其自己的线程上执行此操作,并获得可能有趣的性能加速(同样,平台和运行时很重要,因为在 erlang 中这很容易,但在 Windows 中,创建一个成熟的新线程的成本可能会超过性能收益(如果故障很少),而代价是将旧的 B 集保留在内存中。

Without knowing language and platform it's somewhat hard to advise you on a detailed implementation, however my advice is somewhat similar to Lior Kogan's.
In my opinion, however, you only need two sets and no map is involved:

Say you have two variables representing sets, A and B.

Every heartbeat removes the agent id from set A.
Every 5 seconds, a different thread raises an alert for every agent id in B, then sets B = A, and last but not least creates a set with all of the agent ids and sets A to equal that (if the number of agent ids is really large, you can prepare the new set between one check and the other and only sleep for the remaining time).
Locking would only be needed while changing the variables pointing to each set, provided you use a lock-free set collection.
Performance will largely depend on the algorithmic complexity of said implementation, and if you go down this way, you should privilege the one with best performance (not necessarily best big-O, for instance if wost-case latency matters to you) for removals.

As a side note, if memory is not an issue or failures are relatively few, when you check whether you need to raise alerts and do so, you can do that on a thread of its own and getting possibly interesting performance speedups (again, the platform and runtime matter, for in erlang that would be a breeze but in Windows the cost of creating a full-blown new thread might exceed the performance benefit if the failures are few) at the cost of keeping the old B set in memory.

等风来 2024-12-06 12:32:47

MongoDB 非常适合这种类型的使用。虽然不完全是一种算法,但它确实符合创建此服务所需的基础技术的要求。我们在 CopperEgg 中将其用于我们的 RevealCloud 产品,以完全按照您所说的操作 - 我们发送当系统离开一段时间时发出警报 - 每 5 秒采样一次。我很想了解更多有关您的想法和用例的信息。你能提供更多细节吗?

MongoDB is great for this type of use. While not exactly an algorithm, it does fit the bill for a fundamental technology that is needed to create this service. We use it here at CopperEgg for our RevealCloud product to do exactly what you say - we send an alert when the system has gone away for some bit of time - sampling every 5 seconds. I'd love to hear more about your thoughts and use case. Can you provide more details?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文