返回给定间隔内插入的项目
如何设计一个内存高效的系统,它接受添加到其中的项目,并允许在给定的时间间隔内检索项目(即返回在时间 T1 和时间 T2 之间插入的项目)。不涉及数据库。存储在内存中的项目。涉及到的数据结构和相关算法是什么。
更新: 假设与数据查询相比插入率极高。
How would one design a memory efficient system which accepts Items added into it and allows Items to be retrieved given a time interval (i.e. return Items inserted between time T1 and time T2). There is no DB involved. Items stored in-memory. What is the data structure involved and associated algorithm.
Updated:
Assume extremely high insertion rate compared to data query.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
您可以使用排序的数据结构,其中键是按到达时间排序的。请注意以下事项:
i
在项目j
之后插入,则key(i)>key(j)
]。因此,不鼓励使用树,因为它是“压倒性的”,并且在其中插入的时间复杂度为
O(logn)
,您可以获得O(1)
插入。我建议使用以下之一:(1)数组:数组将始终在其末尾被填充。当分配的数组已满时,重新分配一个更大的[双倍大小]数组,并将现有数组复制到其中。
优点:在数组中通常需要良好的缓存,< code>O(1) 机动化插入,使用空间最多
2*elementSize*#elemetns
缺点:高延迟:当数组已满,添加一个元素需要
O(n)
,所以你需要预料到偶尔会出现代价高昂的操作。(2)跳过列表 跳过列表还允许您
O(logn)
查找并在末尾插入O(1)
,但它不存在延迟问题。然而,它比数组更容易遭受缓存未命中的影响。跳跃列表的平均使用空间为elementSize*#elements + pointSize*#elements*2
。优点:
O(1)
插入,无需昂贵的操作。缺点:预计缓存会很差。
建议:
如果延迟不是问题,我建议使用数组。如果是,您最好使用跳过列表。
在两者中,找到所需的间隔是:
You can use a sorted data structure, where key is by time of arrival. Note the following:
i
was inserted after itemj
thenkey(i)>key(j)
].For this reason, tree is discouraged, since it is "overpower", and insertion in it is
O(logn)
, where you can get anO(1)
insertion. I suggest using one of the followings:(1)Array: the array will be filled up always at its end. When the allocated array is full, reallocate a bigger [double sized] array, and copy existing array to it.
Advantages: good caching is usually expected in arrays,
O(1)
armotorized insertion, used space is at most2*elementSize*#elemetns
Disadvantages: high latency: when the array is full, it will take
O(n)
to add an element, so you need to expect that once in a while, there will be costly operation.(2)Skip list The skip list also allows you also
O(logn)
seek andO(1)
insertion at the end, but it doesn't have latency issues. However, it will suffer more from cache misses then an array. Space used is on averageelementSize*#elements + pointerSize*#elements*2
for a skip list.Advantages:
O(1)
insertion, no costly ops.Distadvantages: bad caching is expected.
Suggestion:
I suggest using an array if latency is not an issue. If it is, you should better use a skip list.
In both, finding the desired interval is:
BTree 或二叉搜索树都可以是完成上述任务的良好内存数据结构。只需在每个节点中保存时间戳即可进行范围查询。
Either BTree or Binary Search Tree could be a good in-memory data structure to accomplish the above. Just save the timestamp in each node and you can do a range query.
您可以将它们全部添加到一个简单的数组中并对它们进行排序。
执行二分搜索以找到 T1 和 T2。它们之间的所有数组元素都是您要查找的内容。
如果仅在添加所有元素后才进行搜索,这会很有帮助。如果没有,您可以使用 AVL 或 红黑树
You can add them all to a simple array and sort them.
Do a binary search to located both T1 and T2. All the array elements between them are what you are looking for.
This is helpful if the searching is done only after all the elements are added. If not you can use an AVL or Red-Black tree
关系区间树怎么样(将您的项目编码为包含只有一个元素,例如 [a,a])?尽管如此,已经有人说过,预期运营的比率很重要(实际上很重要)。但我的两点意见是:
我想在时间 t(X) 插入的项目 X 与该时间戳相关联,对吧?这意味着您现在不插入具有一周前或其他时间戳的项目。如果是这种情况,请使用简单数组并进行插值搜索或类似的操作(您的项目将已经根据查询引用的属性(即时间 t(X))进行排序)。
How about a relation interval tree (encode your items as intervals containing only a single element, e.g., [a,a])? Although, it has been said already that the ratio of the anticipated operations matter (a lot actually). But here's my two cents:
I suppose an item X that is inserted at time t(X) is associated with that timestamp, right? Meaning you don't insert an item now which has a timestamp from a week ago or something. If that's the case go for the simple array and do interpolation search or something similar (your items will already be sorted according to the attribute that your query refers to, i.e., the time t(X)).
我们已经有了一个建议树的答案,但我认为我们需要更具体:这确实是一个好的解决方案的唯一情况是,如果您非常具体地了解如何构建树(然后我会说它是与另一个答案中建议的跳过列表相同)。目标是使树尽可能充满左侧 - 我将在下面更清楚地说明这意味着什么。确保每个节点都有一个指向其(最多)两个子节点的指针和指向其父节点并且知道以该节点为根的子树的深度。
保留一个指向根节点的指针,以便能够在 O(log(n)) 中进行查找,并保留一个指向最后插入的节点 N 的指针(这必然是具有最高键的节点 - 它的时间戳将是最高)。插入节点时,检查 N 有多少个子节点:
附录:为了使缓存行为和内存开销更好,最好的解决方案可能是创建一个数组树或跳过列表。让每个节点都具有一个包含 1024 个时间戳和值的数组,而不是让每个节点都具有单个时间戳和单个值。当数组填满时,您可以在顶级数据结构中添加一个新数组,但在大多数步骤中,您只需将单个元素添加到“当前数组”的末尾即可。这不会影响 big-O 在内存或时间方面的行为,但它会将开销减少 1024 倍,而延迟仍然非常小。
We already have an answer that suggests trees, but I think we need to be more specific: the only situation in which this is really a good solution is if you are very specific about how you build up the tree (and then I would say it's on par with the skip lists suggested in a different answer; ). The objective is to keep the tree as full as possible to the left - I'll make clearer what that means in the following. Make sure each node has a pointer to its (up to) two children and to its parent and knows the depth of the subtree rooted at that node.
Keep a pointer to the root node so that you are able to do lookups in O(log(n)), and keep a pointer to the last inserted node N (which is necessarily the node with the highest key - its timestamp will be the highest). When you are inserting a node, check how many children N has:
Addendum: in order to make cache behaviour and memory overhead better, the best solution is probably to make a tree or skip list of arrays. Instead of every node having a single time stamp and a single value, make every node have an array of, say, 1024 time stamps and values. When an array fills up you add a new one in the top level data structure, but in most steps you just add a single element to the end of the "current array". This wouldn't affect big-O behaviour with respect to either memory or time, but it would reduce the overhead by a factor of 1024, while latency is still very small.