用于处理未来事件的查找结构(基于时间)
我正在寻找一种有效的数据结构,这将允许我提示事件......也就是说,我将拥有一个应用程序,在执行过程中的任何时候,都有可能在未来引发一个事件执行点...类似:
- t=20:在 420 秒内,A 发生
- t=25:在 13 秒内,B 发生
- t=27:在 735 秒内,C 发生
- ...
所以我想要一个数据结构,我可以在未来的任何时间放入任何事件,并且我可以在其中获取并(通过这样做)删除所有到期事件......此外,如果我能够从其中删除事件,那么将是一个优点数据结构(因为它被取消了)...不过不太重要,因为我可以简单地将其标记为已取消...
我的第一个想法是,也许做某种树,但我猜删除到期事件部分需要大量的重新平衡...
我正在考虑简单地使用一个 int 哈希,将时间戳映射到 null 或在该时间点发生的事件堆栈...我认为在场景中,有很多事件(可能是每秒多个 - 这就是我打算使用的),这实际上并不是一个坏主意......
所以我渴望听到你的意见......:)
编辑:
- 更具体地说:我认为这里的 n 约为 100K-1M,我猜我可能每秒有大约 1-100 个事件......
- t 并不特别重要......这只是为了说明未来的事件可以随时“入队”...
感谢
back2dos
I am looking for an efficient data structure, that'll allow me to cue events ... that is, i will be having an app, where at any time in execution, it is possible, that an event will be raised for a future point in execution ... something like:
- t=20: in 420 seconds, A occurs
- t=25: in 13 seconds, B occurs
- t=27: in 735 seconds, C occurs
- ...
so i'd like to have a data structure, where i can put in any event in any time in the future and where i can get and (by doing so) remove all due events ... also, a plus would be, if i were able to remove an event from the datastructure (because it was canceled) ... not too important though, since i can simply flag it as cancelled ...
my first thought was, maybe to do some sort of tree, but i guess the removing-due-events part requires a lot of rebalancing ...
i am considering simply having an int hash, mapping timestamps to either null or stacks of events that are to occur at that point in time ... i think in scenarios, with a lot of events (possibly multiple every second - which is what i intend to work with), this actually isn't such a bad idea after all ...
so i am eager to hear your input ... :)
edit:
- to be more specific: i think n here is at about 100K-1M, and i guess i might be having about 1-100 events/second ...
- the t is of no special importance ... it is only to illustrate that a future event can be "enqueued" at any time ...
thanks
back2dos
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我相信您正在寻找一个优先级队列,其中事件发生时的时间戳是优先级(好吧,较低的时间戳会具有较高的优先级)
只需对您的用例进行一些说明:
您可以使用 insertWithPriority 插入优先级队列,使用事件发生时的时间戳。这将是 O(lgN)
您将重复调用 getTop (获取具有最低时间戳的事件)收集截至感兴趣时间的所有元素。
这是可能的,但由于重新平衡,时间复杂度为 O(lgN)。
I believe you're looking for a Priority Queue with the timestamp of when the event occurs being the priority (well, lower timestamps would be higher priority)
Just a little elucidation with your use cases:
You'd insert into the priority queue with insertWithPriority, using the timestamp of when the event occurs. This would be O(lgN)
You'd repeatedly call getTop (gets event with the lowest timestamp) collecting all elements up to the time of interest.
This would be possible, but would be O(lgN) due to rebalancing.
N有多大?您需要多久插入一次?与其他发生的事情相比,删除项目?如果这超过了总执行时间的 10%,并且如果 N 通常超过 100(比如说),那么是时候关注 big-O 了。我见过一些程序,其优先级队列是用奇特的容器算法实现的,分配迭代器、哈希映射、堆等,并将所有时间都花在创建和释放抽象对象上,其中队列长度的中位数大约是三。
添加:好的,由于 N ~ 10^6,并且频率为 ~ 100hz,您可能需要某种具有 O(log(N)) 插入/删除时间的二叉树或堆。如果您愿意为此投入 1% 的 CPU 时间,即 10^6 微秒 * 1% / 100 = 10^2 微秒/操作。这一点都不困难,因为如果典型的搜索深度为 20,每次比较约 50 纳秒,则搜索时间约为 1 微秒。只需确保保持简单,不要将所有内容都包含在抽象数据类型中。您不必太担心分配/释放树节点所花费的时间,因为每次操作只分配/释放一个节点。重新平衡不需要频繁进行,例如每 1000 次操作后才进行。如果您可以批量收集插入内容,然后以随机顺序插入它们,则可以防止树变得过于不平衡。如果许多事件是同时发生的,您可以向时间代码添加少量噪声,以防止树的某些部分变得更像线性列表。
How big is N? How often do you have to insert & remove items, compared to everything else going on? If this is more than 10% of total execution time, and if N is typically more than 100 (say) maybe, maybe it's time to be concerned about big-O. I've seen programs with priority queues implemented with fancy container algorithms, allocating iterators, hash maps, heaps, etc. and spending all their time in creating and releasing abstract objects, where the median queue length was like three.
ADDED: OK, since N ~ 10^6, and the frequency is ~ 100hz, you probably want some sort of binary tree or heap with O(log(N)) insertion/removal time. If you're willing to devote 1% of CPU time to this, that is 10^6 microseconds * 1% / 100 = 10^2 microseconds / operation. That should not be at all difficult, because if typical search depth is 20, at ~50ns per comparison, that is ~1 microsecond to do the search. Just make sure to keep it simple, not getting all wrapped up in abstract datatypes. You shouldn't have to worry much about time spent in allocating/freeing tree nodes because you only allocate/free one node per operation. Rebalancing need not be done frequently, like maybe only after every 1000 operations. If you can collect your insertions in batches, and then insert them in random order, that may prevent the tree from getting too unbalanced. If many of your events are simultaneous, you could add a small amount of noise to the time code, to prevent parts of the tree from becoming more like a linear list.
好的,我要感谢大家的回答 - 非常有趣且有帮助。 :)
PriorityQueue 绝对是我一直在寻找的正确术语 - 谢谢。
现在一切都与实施有关。
我的想法是这样的:
设 N 为队列的大小,M 为处理时每个时间戳的平均事件量(可以说是“并发”事件)(事件的密度不会均匀分布, “遥远的未来”变得更加稀疏,但随着时间的推移,这个时间区域变得更加密集(实际上,我认为最大密度将在未来 4 到 12 小时内的某个地方))。我正在寻找一种可扩展的解决方案,对于相当大的 M 来说表现良好。目标是在一秒钟内真正处理这些 M 到期事件,所以我想花尽可能少的时间来找到它们。
“在维度良好的哈希表中,每次查找的平均成本(指令数量)与表中存储的元素数量无关。许多哈希表设计还允许任意插入和键值对的删除,每次操作的平均成本(实际上是摊销的)恒定。”
快速基准测试表明我的平台的标准实现似乎与此相符。
其实我想讨论一下数组方法。 O(M+T) 不好。一点也不。但我花了一些心思,这就是我想出的结果:
第一个想法:懒惰
O(T) 可以通过任意因素来压缩,引入一点懒惰,但在最终它会保持O(T)。但这有多糟糕呢?假设 T=2419200,即 28 天。然后,我每天清理一次(最好是在预计低负载时)。这会浪费不到 5% 的阵列。在我的目标平台上,复制操作在相当旧的 2GHz 核心上需要 31 毫秒,所以这似乎并不是一个坏主意。
第二个想法:块
经过一番思考,我想到了这个解决方案:一个间隔哈希,一个间隔(即给定的时间范围)又是一个事件列表数组。间隔大小均相等,最好是简单的时间间隔,例如天或小时。
对于插入,我通过哈希查找正确的间隔(如果不存在则创建),并在间隔中查找正确的事件列表(如果不存在则再次创建),然后插入它,时间复杂度为 O(1)。
对于处理,我只需获取当前间隔,并通过处理当前到期的事件列表来处理到期事件,然后对其进行处理。数组保持恒定长度,因此我们的时间复杂度为 O(M)(这是处理 M 个元素时可以获得的最佳结果)。一旦当前间隔被完全处理(因此,如果该间隔现在代表“过去”),我只需将其处理为 O(1) 即可。我可以保留对当前间隔的额外引用,从而无需查找它,但我想这不会提供任何明显的改进。
在我看来,第二次优化确实是最好的解决方案,因为它快速且不受限制。为间隔选择合适的大小可以优化内存开销与哈希查找开销。我不知道我是否应该担心哈希查找时间。对于高M来说,这应该不重要,不是吗?因此,我选择间隔大小为 1,这使我回到接近数字 3。
如果您对此有任何意见,我将不胜感激。
Ok, I'd like to thank you all for your answers - very interesting and helpful. :)
PriorityQueue is definitely the right term I was searching for - thanks for that.
Now it's all about implementation.
Here is what I think:
Let N be the size of the queue and M be the average amount of events per timestamp ("concurrent" events so to speak) at the time of processing (the density of events will not be evenly distributed, the "far future" beeing much more sparse, but as time moves on, this area of time becomes much more dense (actually, I think the maximum density will be somewhere in the 4 to 12 hours future)). I am looking for a scalable solution, that performs well for considerably big M. The goal is to really process those M due events within one second, so I wanna spend the least time possible on finding them.
"In a well-dimensioned hash table, the average cost (number of instructions) for each lookup is independent of the number of elements stored in the table. Many hash table designs also allow arbitrary insertions and deletions of key-value pairs, at constant average (indeed, amortized) cost per operation."
A quick benchmark showed that the standard implementation for my platform seems to match this.
Actually, I would like to discuss the array approach. O(M+T) is not good. Not at all. But I put some brains into it, and this is what I came up with:
First Idea: Lazyness
The O(T) could be crunched down by an arbitrary factor, introducting a bit of lazyness, but in the end it'd stay O(T). But how bad is that? Let's have T=2419200, which is 28 days. And then, once a day I'd clean it up (preferably while low load is expected). That'd waste less than 5% of the array. On my target platform, the copy operation takes 31msecs on a fairly old 2GHz core, so it doesn't seem such a bad idea after all.
Second Idea: Chunks
After thinking a little, I thought of this solution: a hash-of-intervals, an interval (I.e. given time frame) in turn being an array-of-lists-of-events. the intervals are all of equal sizes, preferably something simple, like days or maybe hours.
For insertion, I lookup the right interval through the hash (create if none exists), and in the interval, the right list-of-events (again create if none exists) and then just insert it, which is O(1).
For processing, I simply take the current interval, and process due events, by processing the currently due list-of-events, and then disposing it. The array stays of constant length, so we are at O(M) (which is quite the best you can get for processing M elements). Once the current interval is entirely processed (thus if the interval now represents the "past"), I simply dispose it at O(1). I can keep an extra reference to the current interval, eliminating the need to look it up, but I guess this doesn't provide any noticable improvement.
It seems to me, the second optimization is really the best solution, since it is fast and unbound. Choosing a good size for the intervals allows optimizing memory overhead vs. hash lookup overhead. I don't know, whether i should worry about the the hash lookup time at all. For high M, it shouldn't really matter, should it? Thus I'd choose an interval size of 1, which leads me back to approach number 3.
I'd be really greatful for any input on that.
如果您的事件有明确定义的上限(例如,未来 2 天之后没有事件发生),您可以简单地使用一个从“时间开始”开始的秒数索引的数组。
数组的值是该偏移处的事件列表。
列出或删除非常有效 - 只需找到您希望列出或切断的时间的偏移量,并获取或重新初始化该偏移量之后的索引所指向的数组。
如果您的事件可以无限期地延伸到未来,那么您自己使用从偏移量到事件列表的哈希图的想法是最好的,但有一个转折点 - 拥有一个已知偏移量的排序列表(无论您希望实现它),这样您将获得非常有效的查找(例如,您不必循环遍历地图上的每个关键离子)。
您不需要从已知偏移列表中删除任何内容,因此重新平衡不会出现问题 - 您只需从 hashmap 指向的数组中删除即可。
另外,从您的问题中似乎不清楚是否有必要知道“t” - 引发事件的时间。如果您需要了解它,请将其存储为事件的一部分。但是对事件发生时间的引用应该相对于某个起点是绝对的(如果它是一个具有无限范围的哈希图,您可以使用纪元秒,并且如果事件有像我列出的第一个数组解决方案中那样的界限,您应该而是使用“自范围开始以来的秒数” - 例如从昨天开始。
If your events have a well defined upper limit (e.g. no events later than 2 days in the future), you can simply have an array indexed by # of seconds from "beginning of time".
The value of the array is a list of events at that offset.
Listing or Removal is very efficient - simply find the offset for the time where you wish to list or cut off and get or re-initialize the arrays pointed to by indices after that offset.
If your events can stretch out indefinitely into the future, then your own idea of using a hashmap from offsets to list of events is the best one, with a twist - have a sorted list (however you wish to implement it) of known offsets, that way you will have very efficient lookups (e.g. you won't have to loop over every key ion the map).
You don't need to delete anything from the list of known offsets so no issues with re-balancing - you merely delete from the arrays that hashmap points to.
Also, it seems unclear from your question whether there's any need to know "t" - the time when the event was raised. If you need to know it, store it as part of the event. but the reference to when the event should happen should all be absolute in relation to some starting point (if it's a hashmap with unbounded range, you can use epoch seconds, and if events have bounds like in the first array solution I listed, you should instead use "# of seconds since beginning of range" - e.g. from start of day yesterday.