用于处理未来事件的查找结构（基于时间）

发布于 2024-08-07 06:07:52 字数 712 浏览 11 评论 0原文

我正在寻找一种有效的数据结构，这将允许我提示事件......也就是说，我将拥有一个应用程序，在执行过程中的任何时候，都有可能在未来引发一个事件执行点...类似：

t=20：在 420 秒内，A 发生
t=25：在 13 秒内，B 发生
t=27：在 735 秒内，C 发生
...

所以我想要一个数据结构，我可以在未来的任何时间放入任何事件，并且我可以在其中获取并（通过这样做）删除所有到期事件......此外，如果我能够从其中删除事件，那么将是一个优点数据结构（因为它被取消了）...不过不太重要，因为我可以简单地将其标记为已取消...

我的第一个想法是，也许做某种树，但我猜删除到期事件部分需要大量的重新平衡...

我正在考虑简单地使用一个 int 哈希，将时间戳映射到 null 或在该时间点发生的事件堆栈...我认为在场景中，有很多事件（可能是每秒多个 - 这就是我打算使用的），这实际上并不是一个坏主意......

所以我渴望听到你的意见......:)

编辑：

更具体地说：我认为这里的 n 约为 100K-1M，我猜我可能每秒有大约 1-100 个事件......
t 并不特别重要......这只是为了说明未来的事件可以随时“入队”...

感谢

back2dos

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

一梦浮鱼 2024-08-14 06:07:52

我相信您正在寻找一个优先级队列，其中事件发生时的时间戳是优先级（好吧，较低的时间戳会具有较高的优先级）

只需对您的用例进行一些说明：

...我可以在其中放入任何事件
将来的任何时候...

您可以使用 insertWithPriority 插入优先级队列，使用事件发生时的时间戳。这将是 O(lgN)

...以及我可以在哪里得到（通过做
所以）删除所有到期事件...

您将重复调用 getTop （获取具有最低时间戳的事件）收集截至感兴趣时间的所有元素。

...另外，如果我是的话，一个优点就是
能够从事件中删除
数据结构（因为它是
取消）...不太重要
不过，因为我可以简单地将其标记为
取消..

这是可能的，但由于重新平衡，时间复杂度为 O(lgN)。

回复收藏 0 原文

枕梦 2024-08-14 06:07:52

N有多大？您需要多久插入一次？与其他发生的事情相比，删除项目？如果这超过了总执行时间的 10%，并且如果 N 通常超过 100（比如说），那么是时候关注 big-O 了。我见过一些程序，其优先级队列是用奇特的容器算法实现的，分配迭代器、哈希映射、堆等，并将所有时间都花在创建和释放抽象对象上，其中队列长度的中位数大约是三。

添加：好的，由于 N ~ 10^6，并且频率为 ~ 100hz，您可能需要某种具有 O(log(N)) 插入/删除时间的二叉树或堆。如果您愿意为此投入 1% 的 CPU 时间，即 10^6 微秒 * 1% / 100 = 10^2 微秒/操作。这一点都不困难，因为如果典型的搜索深度为 20，每次比较约 50 纳秒，则搜索时间约为 1 微秒。只需确保保持简单，不要将所有内容都包含在抽象数据类型中。您不必太担心分配/释放树节点所花费的时间，因为每次操作只分配/释放一个节点。重新平衡不需要频繁进行，例如每 1000 次操作后才进行。如果您可以批量收集插入内容，然后以随机顺序插入它们，则可以防止树变得过于不平衡。如果许多事件是同时发生的，您可以向时间代码添加少量噪声，以防止树的某些部分变得更像线性列表。

回复收藏 0 原文

北陌 2024-08-14 06:07:52

好的，我要感谢大家的回答 - 非常有趣且有帮助。 :)

PriorityQueue 绝对是我一直在寻找的正确术语 - 谢谢。
现在一切都与实施有关。

我的想法是这样的：

设 N 为队列的大小，M 为处理时每个时间戳的平均事件量（可以说是“并发”事件）（事件的密度不会均匀分布， “遥远的未来”变得更加稀疏，但随着时间的推移，这个时间区域变得更加密集（实际上，我认为最大密度将在未来 4 到 12 小时内的某个地方））。我正在寻找一种可扩展的解决方案，对于相当大的 M 来说表现良好。目标是在一秒钟内真正处理这些 M 到期事件，所以我想花尽可能少的时间来找到它们。

按照多次建议的那样，采用简单的树方法，我将进行 O(log N) 插入，我想这非常好。如果我是对的话，处理一个时间戳的成本将是 O(M*log N)，这已经不太好了。
- 另一种方法是使用带有事件列表的树而不是单个事件。实现一些 getlistForGivenStampAndCreateIfNoneExists 操作应该是可行的，如果不存在列表，该操作比沿着树向下走两次要快一些。但无论如何，随着 M 的成长，这应该不会太重要。因此，像以前一样，插入的时间复杂度为 O(log N)，处理的时间复杂度为 O(M+log N)，我认为这也很好。
- 我制定的事件列表哈希方法。这也应该具有 O(1) 插入和 O(M) 处理成本，尽管这对于哈希来说并不是太微不足道。事实上，听起来很酷。或者我错过了什么？当然要让一个hash表现良好并不是那么容易，但是除此之外，还有什么问题吗？或者是哈希值有问题？维基百科指出：
  “在维度良好的哈希表中，每次查找的平均成本（指令数量）与表中存储的元素数量无关。许多哈希表设计还允许任意插入和键值对的删除，每次操作的平均成本（实际上是摊销的）恒定。”
  快速基准测试表明我的平台的标准实现似乎与此相符。
- DVK 提供的事件列表数组方法。这有 O(1) 插入。现在很好。但如果我理解正确的话，它有 O(M+T) 处理成本，其中 T 是数组的大小（如果愿意的话，可以是时隙的数量），因为从数组中删除是以线性成本进行的。此外，这仅在存在最大时间偏移时才有效。

其实我想讨论一下数组方法。 O(M+T) 不好。一点也不。但我花了一些心思，这就是我想出的结果：

第一个想法：懒惰

O(T) 可以通过任意因素来压缩，引入一点懒惰，但在最终它会保持O(T)。但这有多糟糕呢？假设 T=2419200，即 28 天。然后，我每天清理一次（最好是在预计低负载时）。这会浪费不到 5% 的阵列。在我的目标平台上，复制操作在相当旧的 2GHz 核心上需要 31 毫秒，所以这似乎并不是一个坏主意。

第二个想法：块

经过一番思考，我想到了这个解决方案：一个间隔哈希，一个间隔（即给定的时间范围）又是一个事件列表数组。间隔大小均相等，最好是简单的时间间隔，例如天或小时。

对于插入，我通过哈希查找正确的间隔（如果不存在则创建），并在间隔中查找正确的事件列表（如果不存在则再次创建），然后插入它，时间复杂度为 O(1)。

对于处理，我只需获取当前间隔，并通过处理当前到期的事件列表来处理到期事件，然后对其进行处理。数组保持恒定长度，因此我们的时间复杂度为 O(M)（这是处理 M 个元素时可以获得的最佳结果）。一旦当前间隔被完全处理（因此，如果该间隔现在代表“过去”），我只需将其处理为 O(1) 即可。我可以保留对当前间隔的额外引用，从而无需查找它，但我想这不会提供任何明显的改进。

在我看来，第二次优化确实是最好的解决方案，因为它快速且不受限制。为间隔选择合适的大小可以优化内存开销与哈希查找开销。我不知道我是否应该担心哈希查找时间。对于高M来说，这应该不重要，不是吗？因此，我选择间隔大小为 1，这使我回到接近数字 3。

如果您对此有任何意见，我将不胜感激。

Ok, I'd like to thank you all for your answers - very interesting and helpful. :)

PriorityQueue is definitely the right term I was searching for - thanks for that.
Now it's all about implementation.

Here is what I think:

Let N be the size of the queue and M be the average amount of events per timestamp ("concurrent" events so to speak) at the time of processing (the density of events will not be evenly distributed, the "far future" beeing much more sparse, but as time moves on, this area of time becomes much more dense (actually, I think the maximum density will be somewhere in the 4 to 12 hours future)). I am looking for a scalable solution, that performs well for considerably big M. The goal is to really process those M due events within one second, so I wanna spend the least time possible on finding them.

Going for the simple tree approach, as suggested several times, I'll be having O(log N) insertion, which is quite good, I guess. The cost of processing one timestamp would be O(M*log N), if I am right, which is not so good anymore.
- An alternative would be, to have a tree with lists of events instead of single events. it should be feasible to implement some getlistForGivenStampAndCreateIfNoneExists-operation that'd be a little faster than going down the tree twice if no list exists. But anyway, as M grows, this shouldn't even matter too much. Thus insertion would be O(log N), as before, and processing would be at O(M+log N), which is also good, I think.
- The hash-of-lists-of-events approach, I formulated. This also should have O(1) insertion and O(M) processing cost, although this is not too trivial with hashes. Sounds cool, actually. Or am I missing something? Of course it is not so easy to make a hash perform well, but apart from that, are there any problems? Or is the hash the problem? Wikipedia states:
  "In a well-dimensioned hash table, the average cost (number of instructions) for each lookup is independent of the number of elements stored in the table. Many hash table designs also allow arbitrary insertions and deletions of key-value pairs, at constant average (indeed, amortized) cost per operation."
  A quick benchmark showed that the standard implementation for my platform seems to match this.
- The array-of-lists-of-events approach provided by DVK. This has O(1) insertion. Now that is good. But if I understand correctly, it has O(M+T) processing cost, with T being the size of the array (the number of time slots if you will), because removal from arrays comes at linear cost. Also, this only works if there is a maximum time offset.

Actually, I would like to discuss the array approach. O(M+T) is not good. Not at all. But I put some brains into it, and this is what I came up with:

First Idea: Lazyness

The O(T) could be crunched down by an arbitrary factor, introducting a bit of lazyness, but in the end it'd stay O(T). But how bad is that? Let's have T=2419200, which is 28 days. And then, once a day I'd clean it up (preferably while low load is expected). That'd waste less than 5% of the array. On my target platform, the copy operation takes 31msecs on a fairly old 2GHz core, so it doesn't seem such a bad idea after all.

Second Idea: Chunks

After thinking a little, I thought of this solution: a hash-of-intervals, an interval (I.e. given time frame) in turn being an array-of-lists-of-events. the intervals are all of equal sizes, preferably something simple, like days or maybe hours.

For insertion, I lookup the right interval through the hash (create if none exists), and in the interval, the right list-of-events (again create if none exists) and then just insert it, which is O(1).

For processing, I simply take the current interval, and process due events, by processing the currently due list-of-events, and then disposing it. The array stays of constant length, so we are at O(M) (which is quite the best you can get for processing M elements). Once the current interval is entirely processed (thus if the interval now represents the "past"), I simply dispose it at O(1). I can keep an extra reference to the current interval, eliminating the need to look it up, but I guess this doesn't provide any noticable improvement.

It seems to me, the second optimization is really the best solution, since it is fast and unbound. Choosing a good size for the intervals allows optimizing memory overhead vs. hash lookup overhead. I don't know, whether i should worry about the the hash lookup time at all. For high M, it shouldn't really matter, should it? Thus I'd choose an interval size of 1, which leads me back to approach number 3.

I'd be really greatful for any input on that.

回复收藏 0 原文

长伴 2024-08-14 06:07:52

如果您的事件有明确定义的上限（例如，未来 2 天之后没有事件发生），您可以简单地使用一个从“时间开始”开始的秒数索引的数组。
数组的值是该偏移处的事件列表。

列出或删除非常有效 - 只需找到您希望列出或切断的时间的偏移量，并获取或重新初始化该偏移量之后的索引所指向的数组。

如果您的事件可以无限期地延伸到未来，那么您自己使用从偏移量到事件列表的哈希图的想法是最好的，但有一个转折点 - 拥有一个已知偏移量的排序列表（无论您希望实现它），这样您将获得非常有效的查找（例如，您不必循环遍历地图上的每个关键离子）。

您不需要从已知偏移列表中删除任何内容，因此重新平衡不会出现问题 - 您只需从 hashmap 指向的数组中删除即可。

另外，从您的问题中似乎不清楚是否有必要知道“t” - 引发事件的时间。如果您需要了解它，请将其存储为事件的一部分。但是对事件发生时间的引用应该相对于某个起点是绝对的（如果它是一个具有无限范围的哈希图，您可以使用纪元秒，并且如果事件有像我列出的第一个数组解决方案中那样的界限，您应该而是使用“自范围开始以来的秒数” - 例如从昨天开始。

回复收藏 0 原文

~没有更多了~