查询的数据结构/算法：按A过滤，按B排序，返回N个结果

发布于 2024-12-12 02:26:14 字数 1299 浏览 0 评论 0原文

想象一下，您有一大组带有属性 A 和 B 的 #m 对象。您可以使用什么数据结构作为索引（或哪种算法）来提高以下查询的性能？

find all objects where A between X and Y, order by B, return first N results;

也就是说，按范围 A 过滤并按 B 排序，但仅返回前几个结果（例如，最多 1000 个）。插入非常罕见，因此大量预处理是可以接受的。我对以下选项不满意：

按 B 排序记录（或索引）：扫描 B 顺序，返回第一个 N，其中 A 与 XY 匹配。在最坏的情况下（很少有对象与范围 XY 匹配，或者匹配位于记录/索引的末尾），这将变为 O(m)，对于大小为 m 的大型数据集 还不够好。
按 A 排序的记录（或索引）：执行二分搜索，直到找到与范围 XY 匹配的第一个对象。扫描并创建一个对所有与范围匹配的 k 对象的引用的数组。按 B 对数组排序，返回第一个 N。这就是 O(log m + k + k log k)。如果k很小，那么实际上是O(log m)，但如果k很大，那么排序的成本会变得比对所有m对象进行线性扫描的成本。
自适应 2/1：对 XY 范围的第一个匹配项进行二分搜索（使用 A 上的索引）；对范围的最后一个匹配项进行二分搜索。如果范围较小，则继续算法2；否则恢复到算法 1。这里的问题是我们恢复到算法 1 的情况。虽然我们检查了“许多”对象通过了过滤器（这是算法 1 的好情况），但这个“许多”最多是一个常量（渐进地，O(n) 扫描将始终胜过 O(k log k) 排序。因此，对于某些查询，我们仍然采用 O(n) 算法。

是否有一种算法/数据结构允许在亚线性时间内回答这个查询？

如果不是，为了实现必要的性能，什么是好的妥协方案？例如，如果我不保证返回对象的 B 属性的最佳排名（召回率 < 1.0），那么我只能扫描 B 索引的一小部分。但我可以在以某种方式限制结果质量的同时做到这一点吗？

原文

Imagine that you have a large set of #m objects with properties A and B. What data structure can you use as index(s) (or which algorithm) to improve the performance of the following query?

find all objects where A between X and Y, order by B, return first N results;

That is, filter by range A and sort by B, but only return the first few results (say, 1000 at most). Insertions are very rare, so heavy preprocessing is acceptable. I'm not happy with the following options:

With records (or index) sorted by B: Scan the records/index in B order, return the first N where A matches X-Y. In the worst cases (few objects match the range X-Y, or the matches are at the end of the records/index) this becomes O(m), which for large data sets of size m is not good enough.
With records (or index) sorted by A: Do a binary search until the first object is found which matches the range X-Y. Scan and create an array of references to all k objects which match the range. Sort the array by B, return the first N. That's O(log m + k + k log k). If k is small then that's really O(log m), but if k is large then the cost of the sort becomes even worse than the cost of the linear scan over all mobjects.
Adaptative 2/1: do a binary search for the first match of the range X-Y (using an index over A); do a binary search for the last match of the range. If the range is small continue with algorithm 2; otherwise revert to algorithm 1. The problem here is the case where we revert to algorithm 1. Although we checked that "many" objects pass the filter, which is the good case for algorithm 1, this "many" is at most a constant (asymptotically the O(n) scan will always win over the O(k log k) sort). So we still have an O(n) algorithm for some queries.

Is there an algorithm / data structure which allows answering this query in sublinear time?

If not, what could be good compromises to achieve the necessary performance? For instance, if I don't guarantee returning the objects best ranking for their B property (recall < 1.0) then I can scan only a fraction of the B index. But could I do that while bounding the results' quality somehow?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

萌吟 2024-12-19 02:26:14

您问的问题本质上是一个更通用的版本：

问。您有一个已排序的单词列表，其中每个单词都有一个关联的权重，并且您希望所有单词与给定查询 q 共享一个前缀，并且您希望此列表按关联的 权重排序。

我说得对吗？

如果是这样，您可能需要查看这篇论文，其中讨论了如何在 O(k log n) 时间内完成此操作，其中 k 是所需输出集中的元素数量，n< /em> 是原始输入集中的记录数。我们假设k >记录n。

http://dhruvbird.com/autocomplete.pdf

（我是作者）。

更新：我可以添加的进一步细化是，您提出的问题与二维范围搜索相关，您希望给定 X 范围内的所有内容以及上一组中的前 K 个内容，按 Y 范围排序。

2D 范围搜索可让您查找 X/Y 范围内的所有内容（如果您的两个范围已知）。在这种情况下，您只知道 X 范围，因此您需要重复运行查询并对 Y 范围进行二分搜索，直到获得 K 个结果。如果使用分数级联，则每个查询可以使用 O(log n) 时间执行；如果使用朴素方法，则可以使用 O(log²n) 时间执行。它们中的任何一个都是次线性的，所以你应该没问题。

此外，列出所有条目的时间会给您的运行时间增加一个额外的 O(k) 因子。

回复收藏 0 原文

微暖i 2024-12-19 02:26:14

假设 N << k＜ n，可以在O(logn + k + NlogN)中完成，类似于您在选项2中建议的，但节省了一些时间，您不需要对所有的进行排序k 个元素，但只有 N 个，这要小得多！

数据库按A排序。

(1) find the first element and last element, and create a list containing these
    elements.
(2) find the N'th biggest element, using selection algorithm (*), and create a new 
    list of size N, with a second iteration: populate the last list with the N highest 
    elements.
(3) sort the last list by B.

选择算法：找到第N大的元素。此处为 O(n) 或 O(k)，因为列表的大小为 k。

复杂性：

第一步很简单O(logn + k)。

第 2 步是 O(k) [选择]，另一次迭代也是 O(k)，因为此列表只有 k 个元素。

第3步是O(NlogN)，简单排序，最后一个列表只包含N个元素。

assuming N << k < n, it can be done in O(logn + k + NlogN), similar to what you suggested in option 2, but saves some time, you don't need to sort all the k elements, but only N, which is much smaller!

the data base is sorted by A.

(1) find the first element and last element, and create a list containing these
    elements.
(2) find the N'th biggest element, using selection algorithm (*), and create a new 
    list of size N, with a second iteration: populate the last list with the N highest 
    elements.
(3) sort the last list by B.

Selection algorithm: find the N'th biggest element. it is O(n), or O(k) in here, because the list's size is k.

complexity:

Step one is trivially O(logn + k).

Step 2 is O(k) [selection] and another iteration is also O(k), since this list has only k elements.

Step 3 is O(NlogN), a simple sort, and the last list contains only N elements.

回复收藏 0 原文

逆光下的微笑 2024-12-19 02:26:14

如果您想要返回的项目数量很少（最多为项目总数的 1% 左右），那么简单的堆选择算法就很有效。请参阅当理论遇到实践。但它不是次线性的。

对于预期次线性性能，您可以按A对项目进行排序。查询时，使用二分查找查找第一个A >= X的项，然后依次扫描项，直到A >= X。 Y，使用我在那篇博客文章中概述的堆选择技术。

这将为您提供初始搜索的 O(log n)，然后是 O(m log k)，其中 m 是搜索次数其中 X <= A <= Y 的商品，k 是您要返回的商品数量。是的，对于某些查询，它仍然是O(n log k)。决定因素是m的大小。

回复收藏 0 原文

独自←快乐 2024-12-19 02:26:14

在 A 上设置一个线段树，并为每个线段预先计算范围内的前 N 个线段。要进行查询，请将输入范围分成 O(log m) 段并合并预先计算的结果。查询时间为O(N log log m + log m)；空间为 O(m log N)。

回复收藏 0 原文

神回复 2024-12-19 02:26:14

这并不是一个真正成熟的解决方案，只是一个想法。在 A 轴和 B 轴上构建一个四叉树怎么样？你会以广度优先的方式沿着树走下去；那么：

每当你发现 A 值全部超出给定范围 [X, Y] 的子树时，你就丢弃该子树（并且不递归）；
每当你发现 A 值都在给定范围 [X, Y] 内的子树时，你可以将该子树添加到你正在构建的集合 S 中，并且不递归；
每当你发现一个子树的一些 A 值在 [X, Y] 范围内而一些在范围之外时，你就会递归到它。

现在你有了 A 坐标在 X 和 Y 之间的所有最大子树的集合 S；这些子树最多有 O(sqrt(m)) 个，我将在下面展示。

其中一些子树将包含 O(m) 条目（当然它们将包含所有加在一起的 O(m) 条目），因此我们无法对所有子树的所有条目执行任何操作。现在，我们可以在 S 中创建一个子树堆，以便每个子树的 B 最小值小于堆中其子树的 B 最小值。现在从堆的顶部节点提取 B 最小元素，直到有 N 个；每当您从具有 k 个元素的子树中提取元素时，您需要将该子树分解为不包含最近提取的元素的 O(log(k)) 子树。

现在让我们考虑复杂性。查找 O(sqrt(m)) 子树最多需要 O(sqrt(m)) 步骤（供读者练习，使用下面证明中的参数）。当我们找到它们时，我们可能应该将它们插入堆中；这将需要 O(sqrt(m) * log(sqrt(m))) = O(sqrt(m) * log(m)) 步骤。从堆中的 k 元素子树中提取单个元素需要 O(sqrt(k)) 时间来查找该元素，然后将 O(log(sqrt(k))) = O(log(k)) 子树插入回去进入大小为 O(sqrt(m)) 的堆需要 O(log(k) * log(sqrt(m))) = O(log(k) * log(m)) 步骤。我们可能可以更聪明地使用势，但我们至少可以将 k 与 m 绑定，这样就剩下 N*(O(sqrt(k) + log(k)*log(m))) = O(N * (sqrt( m) + log(m)^2) = O(N*sqrt(m)) 提取步骤，总共 O(sqrt(m)*(N + log(m))) 步骤...这是以 m 为单位的次线性。

这是一个 O(sqrt(m)) 子树的边界证明构建四叉树有多种策略，但为了便于分析，我们假设我们在根节点中创建一棵二叉树；围绕具有中值 A 坐标的点的 A 坐标，然后根据具有中值 B 坐标的点周围的 B 坐标（即该半个点中包含的一半点的中值）向下一层划分数据集。树），并继续交替每层的方向。

现在让我们考虑一下需要递归多少个子树，如果子树包含 A 坐标 X，或者它包含 A 坐标 Y，则需要递归。，或两者兼而有之。在第 (2*k) 层，总共有 2^(2*k) 个子树。到那时，每个子树的 A 范围已经细分了 k 次，每次我们这样做时，只有一半的树包含 A 坐标 X。所以最多 2^k 子树包含 A 坐标 X。类似地，在大多数 2^k 将包含 A 坐标 Y。这意味着我们总共最多会递归到 2*sum(2^k, k = 0 .. log(m)/2) = 2*(2^(log(m)/2 - 1) + 1) = O(sqrt(m)) 子树。

由于我们最多检查第 (2*k) 层以下的 2^k 个子树，因此我们还可以将该层的最多 2^k 个子树添加到 S 中。这给出了最终结果。

This is not really a fully fleshed out solution, just an idea. How about building a quadtree on the A and B axes? You would walk down the tree in, say, a breadth-first manner; then:

whenever you find a subtree with A-values all outside the given range [X, Y], you discard that subtree (and don't recurse);
whenever you find a subtree with A-values all inside the given range [X, Y], you add that subtree to a set S that you're building and don't recurse;
whenever you find a subtree with some A-values inside the range [X, Y] and some outside, you recurse into it.

Now you have the set S of all maximal subtrees with A-coordinates between X and Y; there are at most O(sqrt(m)) of these subtrees, which I will show below.

Some of these subtrees will contain O(m) entries (certainly they will contain O(m) entries all added together), so we can't do anything on all entries of all subtrees. We can now make a heap of the subtrees in S, so that the B-minimum of each subtree is less than the B-minimums of its children in the heap. Now extract B-minimal elements from the top node of the heap until you have N of them; whenever you extract an element from a subtree with k elements, you need to decompose that subtree into O(log(k)) subtrees not containing the recently extracted element.

Now let's consider complexity. Finding the O(sqrt(m)) subtrees will take at most O(sqrt(m)) steps (exercise for the reader, using arguments in the proof below). We should probably insert them into the heap as we find them; this will take O(sqrt(m) * log(sqrt(m))) = O(sqrt(m) * log(m)) steps. Extracting a single element from a k-element subtree in the heap takes O(sqrt(k)) time to find the element, then inserting the O(log(sqrt(k))) = O(log(k)) subtrees back into the heap of size O(sqrt(m)) takes O(log(k) * log(sqrt(m))) = O(log(k) * log(m)) steps. We can probably be smarter using potentials, but we can at least bound k by m, so that leaves N*(O(sqrt(k) + log(k)*log(m))) = O(N * (sqrt(m) + log(m)^2) = O(N*sqrt(m)) steps for the extraction, and O(sqrt(m)*(N + log(m))) steps in total... which is sublinear in m.

Here's a proof of the bound of O(sqrt(m)) subtrees. There are several strategies for building a quadtree, but for ease of analysis, let's say that we make a binary tree; in the root node, we split the data set according to A-coordinate around the point with median A-coordinate, then one level down we split the data set according to B-coordinate around the point with median B-coordinate (that is, median for the half of the points contained in that half-tree), and continue alternating the direction per level.

The height of the tree is log(m). Now let's consider for how many subtrees we need to recurse. We only need to recurse if a subtree contains the A-coordinate X, or it contains the A-coordinate Y, or both. At the (2*k)th level down, there are 2^(2*k) subtrees in total. By then, each subtree has its A-range subdivided k times already, and every time we do that, only half the trees contain the A-coordinate X. So at most 2^k subtrees contain the A-coordinate X. Similarly, at most 2^k will contain the A-coordinate Y. This means that in total we will recurse into at most 2*sum(2^k, k = 0 .. log(m)/2) = 2*(2^(log(m)/2 - 1) + 1) = O(sqrt(m)) subtrees.

Since we examine at most 2^k subtrees at the (2*k)'th level down, we can also add at most 2^k subtrees at that level to S. This gives the final result.

回复收藏 0 原文