返回给定间隔内插入的项目

发布于 2024-12-14 05:19:06 字数 139 浏览 3 评论 0原文

如何设计一个内存高效的系统,它接受添加到其中的项目,并允许在给定的时间间隔内检索项目(即返回在时间 T1 和时间 T2 之间插入的项目)。不涉及数据库。存储在内存中的项目。涉及到的数据结构和相关算法是什么。

更新: 假设与数据查询相比插入率极高。

How would one design a memory efficient system which accepts Items added into it and allows Items to be retrieved given a time interval (i.e. return Items inserted between time T1 and time T2). There is no DB involved. Items stored in-memory. What is the data structure involved and associated algorithm.

Updated:
Assume extremely high insertion rate compared to data query.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

◇流星雨 2024-12-21 05:19:06

您可以使用排序的数据结构,其中键是按到达时间排序的。请注意以下事项:

  1. 项目未删除
  2. 项目按顺序插入 [如果项目 i 在项目 j 之后插入,则 key(i)>key(j)]。

因此,不鼓励使用树,因为它是“压倒性的”,并且在其中插入的时间复杂度为 O(logn),您可以获得 O(1) 插入。我建议使用以下之一:

(1)数组:数组将始终在其末尾被填充。当分配的数组已满时,重新分配一个更大的[双倍大小]数组,并将现有数组复制到其中。

优点:在数组中通常需要良好的缓存,< code>O(1) 机动化插入,使用空间最多2*elementSize*#elemetns

缺点:延迟:当数组已满,添加一个元素需要O(n),所以你需要预料到偶尔会出现代价高昂的操作。

(2)跳过列表 跳过列表还允许您O(logn) 查找并在末尾插入 O(1) ,但它不存在延迟问题。然而,它比数组更容易遭受缓存未命中的影响。跳跃列表的平均使用空间为 elementSize*#elements + pointSize*#elements*2

优点O(1)插入,无需昂贵的操作。

缺点:预计缓存会很差。

建议:

如果延迟不是问题,我建议使用数组。如果是,您最好使用跳过列表。

在两者中,找到所需的间隔是:

findInterval(T1,T2):
  start <- data.find(T1)
  end <- data.find(T2)
  for each element in data from T1 to T2:
     yield element

You can use a sorted data structure, where key is by time of arrival. Note the following:

  1. items are not remvoed
  2. items are inserted in order [if item i was inserted after item j then key(i)>key(j)].

For this reason, tree is discouraged, since it is "overpower", and insertion in it is O(logn), where you can get an O(1) insertion. I suggest using one of the followings:

(1)Array: the array will be filled up always at its end. When the allocated array is full, reallocate a bigger [double sized] array, and copy existing array to it.

Advantages: good caching is usually expected in arrays, O(1) armotorized insertion, used space is at most 2*elementSize*#elemetns

Disadvantages: high latency: when the array is full, it will take O(n) to add an element, so you need to expect that once in a while, there will be costly operation.

(2)Skip list The skip list also allows you also O(logn) seek and O(1) insertion at the end, but it doesn't have latency issues. However, it will suffer more from cache misses then an array. Space used is on average elementSize*#elements + pointerSize*#elements*2 for a skip list.

Advantages: O(1) insertion, no costly ops.

Distadvantages: bad caching is expected.

Suggestion:

I suggest using an array if latency is not an issue. If it is, you should better use a skip list.

In both, finding the desired interval is:

findInterval(T1,T2):
  start <- data.find(T1)
  end <- data.find(T2)
  for each element in data from T1 to T2:
     yield element
誰認得朕 2024-12-21 05:19:06

BTree 或二叉搜索树都可以是完成上述任务的良好内存数据结构。只需在每个节点中保存时间戳即可进行范围查询。

Either BTree or Binary Search Tree could be a good in-memory data structure to accomplish the above. Just save the timestamp in each node and you can do a range query.

静若繁花 2024-12-21 05:19:06

您可以将它们全部添加到一个简单的数组中并对它们进行排序。

执行二分搜索以找到 T1T2。它们之间的所有数组元素都是您要查找的内容。

如果仅在添加所有元素后才进行搜索,这会很有帮助。如果没有,您可以使用 AVL红黑

You can add them all to a simple array and sort them.

Do a binary search to located both T1 and T2. All the array elements between them are what you are looking for.

This is helpful if the searching is done only after all the elements are added. If not you can use an AVL or Red-Black tree

不羁少年 2024-12-21 05:19:06

关系区间树怎么样(将您的项目编码为包含只有一个元素,例如 [a,a])?尽管如此,已经有人说过,预期运营的比率很重要(实际上很重要)。但我的两点意见是:

我想在时间 t(X) 插入的项目 X 与该时间戳相关联,对吧?这意味着您现在不插入具有一周前或其他时间戳的项目。如果是这种情况,请使用简单数组并进行插值搜索或类似的操作(您的项目将已经根据查询引用的属性(即时间 t(X))进行排序)。

How about a relation interval tree (encode your items as intervals containing only a single element, e.g., [a,a])? Although, it has been said already that the ratio of the anticipated operations matter (a lot actually). But here's my two cents:

I suppose an item X that is inserted at time t(X) is associated with that timestamp, right? Meaning you don't insert an item now which has a timestamp from a week ago or something. If that's the case go for the simple array and do interpolation search or something similar (your items will already be sorted according to the attribute that your query refers to, i.e., the time t(X)).

孤独患者 2024-12-21 05:19:06

我们已经有了一个建议树的答案,但我认为我们需要更具体:这确实是一个好的解决方案的唯一情况是,如果您非常具体地了解如何构建树(然后我会说它是与另一个答案中建议的跳过列表相同)。目标是使树尽可能充满左侧 - 我将在下面更清楚地说明这意味着什么。确保每个节点都有一个指向其(最多)两个子节点的指针指向其父节点并且知道以该节点为根的子树的深度。

保留一个指向根节点的指针,以便能够在 O(log(n)) 中进行查找,并保留一个指向最后插入的节点 N 的指针(这必然是具有最高键的节点 - 它的时间戳将是最高)。插入节点时,检查 N 有多少个子节点:

  • 如果为 0,则将 N 替换为要插入的新节点,并使 N 成为其左子节点。 (此时,您需要更新最多 O(log(n)) 个节点的树深度字段。)
  • 如果为 1,则将新节点添加为其右子节点。
  • 如果是 2,那么事情就会变得有趣。从 N 开始沿着树向上查找,直到找到只有 1 个子节点的节点或根。如果您发现一个节点只有 1 个子节点(这必然是左子节点),则将新节点添加为其新的右子节点。如果直到根节点的所有节点都有两个子节点,则当前树已满。将新节点添加为新根节点,将旧根节点添加为其左子节点。否则不要更改旧的树结构。

附录:为了使缓存行为和内存开销更好,最好的解决方案可能是创建一个数组树或跳过列表。让每个节点都具有一个包含 1024 个时间戳和值的数组,而不是让每个节点都具有单个时间戳和单个值。当数组填满时,您可以在顶级数据结构中添加一个新数组,但在大多数步骤中,您只需将单个元素添加到“当前数组”的末尾即可。这不会影响 big-O 在内存或时间方面的行为,但它会将开销减少 1024 倍,而延迟仍然非常小。

We already have an answer that suggests trees, but I think we need to be more specific: the only situation in which this is really a good solution is if you are very specific about how you build up the tree (and then I would say it's on par with the skip lists suggested in a different answer; ). The objective is to keep the tree as full as possible to the left - I'll make clearer what that means in the following. Make sure each node has a pointer to its (up to) two children and to its parent and knows the depth of the subtree rooted at that node.

Keep a pointer to the root node so that you are able to do lookups in O(log(n)), and keep a pointer to the last inserted node N (which is necessarily the node with the highest key - its timestamp will be the highest). When you are inserting a node, check how many children N has:

  • If 0, then replace N with the new node you are inserting and make N its left child. (At this point you'll need to update the tree depth field of at most O(log(n)) nodes.)
  • If 1, then add the new node as its right child.
  • If 2, then things get interesting. Go up the tree from N until either you find a node that has only 1 child, or the root. If you find a node with only 1 child (this is necessarily the left child), then add the new node as its new right child. If all nodes up to the root have two children, then the current tree is full. Add the new node as the new root node and the old root node as its left child. Don't change the old tree structure otherwise.

Addendum: in order to make cache behaviour and memory overhead better, the best solution is probably to make a tree or skip list of arrays. Instead of every node having a single time stamp and a single value, make every node have an array of, say, 1024 time stamps and values. When an array fills up you add a new one in the top level data structure, but in most steps you just add a single element to the end of the "current array". This wouldn't affect big-O behaviour with respect to either memory or time, but it would reduce the overhead by a factor of 1024, while latency is still very small.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文