查询的数据结构/算法:按A过滤,按B排序,返回N个结果
想象一下,您有一大组带有属性 A
和 B
的 #m
对象。您可以使用什么数据结构作为索引(或哪种算法)来提高以下查询的性能?
find all objects where A between X and Y, order by B, return first N results;
也就是说,按范围 A
过滤并按 B
排序,但仅返回前几个结果(例如,最多 1000 个)。插入非常罕见,因此大量预处理是可以接受的。我对以下选项不满意:
按 B 排序记录(或索引):扫描
B
顺序,返回第一个N
,其中A
与 XY 匹配。在最坏的情况下(很少有对象与范围 XY 匹配,或者匹配位于记录/索引的末尾),这将变为O(m)
,对于大小为m 的大型数据集
还不够好。按 A 排序的记录(或索引):执行二分搜索,直到找到与范围 XY 匹配的第一个对象。扫描并创建一个对所有与范围匹配的
k
对象的引用的数组。按 B 对数组排序,返回第一个N
。这就是O(log m + k + k log k)
。如果k
很小,那么实际上是O(log m)
,但如果k
很大,那么排序的成本会变得比对所有m
对象进行线性扫描的成本。自适应 2/1:对 XY 范围的第一个匹配项进行二分搜索(使用 A 上的索引);对范围的最后一个匹配项进行二分搜索。如果范围较小,则继续算法2;否则恢复到算法 1。这里的问题是我们恢复到算法 1 的情况。虽然我们检查了“许多”对象通过了过滤器(这是算法 1 的好情况),但这个“许多”最多是一个常量(渐进地,
O(n)
扫描将始终胜过O(k log k)
排序。因此,对于某些查询,我们仍然采用O(n)
算法。
是否有一种算法/数据结构允许在亚线性时间内回答这个查询?
如果不是,为了实现必要的性能,什么是好的妥协方案?例如,如果我不保证返回对象的 B
属性的最佳排名(召回率 < 1.0),那么我只能扫描 B 索引的一小部分。但我可以在以某种方式限制结果质量的同时做到这一点吗?
Imagine that you have a large set of #m
objects with properties A
and B
. What data structure can you use as index(s) (or which algorithm) to improve the performance of the following query?
find all objects where A between X and Y, order by B, return first N results;
That is, filter by range A
and sort by B
, but only return the first few results (say, 1000 at most). Insertions are very rare, so heavy preprocessing is acceptable. I'm not happy with the following options:
With records (or index) sorted by B: Scan the records/index in
B
order, return the firstN
whereA
matches X-Y. In the worst cases (few objects match the range X-Y, or the matches are at the end of the records/index) this becomesO(m)
, which for large data sets of sizem
is not good enough.With records (or index) sorted by A: Do a binary search until the first object is found which matches the range X-Y. Scan and create an array of references to all
k
objects which match the range. Sort the array by B, return the firstN
. That'sO(log m + k + k log k)
. Ifk
is small then that's reallyO(log m)
, but ifk
is large then the cost of the sort becomes even worse than the cost of the linear scan over allm
objects.Adaptative 2/1: do a binary search for the first match of the range X-Y (using an index over A); do a binary search for the last match of the range. If the range is small continue with algorithm 2; otherwise revert to algorithm 1. The problem here is the case where we revert to algorithm 1. Although we checked that "many" objects pass the filter, which is the good case for algorithm 1, this "many" is at most a constant (asymptotically the
O(n)
scan will always win over theO(k log k)
sort). So we still have anO(n)
algorithm for some queries.
Is there an algorithm / data structure which allows answering this query in sublinear time?
If not, what could be good compromises to achieve the necessary performance? For instance, if I don't guarantee returning the objects best ranking for their B
property (recall < 1.0) then I can scan only a fraction of the B index. But could I do that while bounding the results' quality somehow?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
您问的问题本质上是一个更通用的版本:
我说得对吗?
如果是这样,您可能需要查看这篇论文,其中讨论了如何在 O(k log n) 时间内完成此操作,其中 k 是所需输出集中的元素数量,n< /em> 是原始输入集中的记录数。我们假设k >记录n。
http://dhruvbird.com/autocomplete.pdf
(我是作者)。
更新:我可以添加的进一步细化是,您提出的问题与二维范围搜索相关,您希望给定 X 范围内的所有内容以及上一组中的前 K 个内容,按 Y 范围排序。
2D 范围搜索可让您查找 X/Y 范围内的所有内容(如果您的两个范围已知)。在这种情况下,您只知道 X 范围,因此您需要重复运行查询并对 Y 范围进行二分搜索,直到获得 K 个结果。如果使用分数级联,则每个查询可以使用 O(log n) 时间执行;如果使用朴素方法,则可以使用 O(log2n) 时间执行。它们中的任何一个都是次线性的,所以你应该没问题。
此外,列出所有条目的时间会给您的运行时间增加一个额外的 O(k) 因子。
The question you are asking is essentially a more general version of:
Am I right?
If so, you might want to check this paper which discusses how to do this in O(k log n) time, where k is the number of elements in the output set desired and n is the number of records in the original input set. We assume that k > log n.
http://dhruvbird.com/autocomplete.pdf
(I am the author).
Update: A further refinement I can add is that the question you are asking is related to 2-dimensional range searching where you want everything in a given X-range and the top-K from the previous set, sorted by the Y-range.
2D range search lets you find everything in an X/Y-range (if both your ranges are known). In this case, you only know the X-range, so you would need to run the query repeatedly and binary search on the Y-range till you get K results. Each query can be performed using O(log n) time if you employ fractional cascading, and O(log2n) if employing the naive approach. Either of them are sub-linear, so you should be okay.
Additionally, the time to list all entries would add an additional O(k) factor to your running time.
假设 N << k< n,可以在
O(logn + k + NlogN)
中完成,类似于您在选项2中建议的,但节省了一些时间,您不需要对所有的进行排序k 个元素,但只有 N 个,这要小得多!数据库按A排序。
选择算法:找到第N大的元素。此处为
O(n)
或O(k)
,因为列表的大小为 k。复杂性:
第一步很简单
O(logn + k)
。第 2 步是
O(k)
[选择],另一次迭代也是O(k)
,因为此列表只有 k 个元素。第3步是
O(NlogN)
,简单排序,最后一个列表只包含N个元素。assuming
N << k < n
, it can be done inO(logn + k + NlogN)
, similar to what you suggested in option 2, but saves some time, you don't need to sort all the k elements, but only N, which is much smaller!the data base is sorted by A.
Selection algorithm: find the N'th biggest element. it is
O(n)
, orO(k)
in here, because the list's size is k.complexity:
Step one is trivially
O(logn + k)
.Step 2 is
O(k)
[selection] and another iteration is alsoO(k)
, since this list has only k elements.Step 3 is
O(NlogN)
, a simple sort, and the last list contains only N elements.如果您想要返回的项目数量很少(最多为项目总数的 1% 左右),那么简单的堆选择算法就很有效。请参阅当理论遇到实践。但它不是次线性的。
对于预期次线性性能,您可以按
A
对项目进行排序。查询时,使用二分查找查找第一个A >= X
的项,然后依次扫描项,直到A >= X
。 Y,使用我在那篇博客文章中概述的堆选择技术。这将为您提供初始搜索的
O(log n)
,然后是O(m log k)
,其中m
是搜索次数其中X <= A <= Y
的商品,k
是您要返回的商品数量。是的,对于某些查询,它仍然是O(n log k)
。决定因素是m
的大小。If the number of items you want to return is small--up to about 1% of the total number of items--then a simple heap selection algorithm works well. See When theory meets practice. But it's not sub-linear.
For expected sub-linear performance, you can sort the items by
A
. When queried, use binary search to find the first item whereA >= X
, and then sequentially scan items untilA > Y
, using the heap selection technique I outlined in that blog post.This should give you
O(log n)
for the initial search, and thenO(m log k)
, wherem
is the number of items whereX <= A <= Y
, andk
is the number of items you want returned. Yes, it will still beO(n log k)
for some queries. The deciding factor will be the size ofm
.在 A 上设置一个线段树,并为每个线段预先计算范围内的前 N 个线段。要进行查询,请将输入范围分成 O(log m) 段并合并预先计算的结果。查询时间为O(N log log m + log m);空间为 O(m log N)。
Set up a segment tree on A and, for each segment, precompute the top N in range. To query, break the input range into O(log m) segments and merge the precomputed results. Query time is O(N log log m + log m); space is O(m log N).
这并不是一个真正成熟的解决方案,只是一个想法。在 A 轴和 B 轴上构建一个 四叉树 怎么样?你会以广度优先的方式沿着树走下去;那么:
现在你有了 A 坐标在 X 和 Y 之间的所有最大子树的集合 S;这些子树最多有 O(sqrt(m)) 个,我将在下面展示。
其中一些子树将包含 O(m) 条目(当然它们将包含所有加在一起的 O(m) 条目),因此我们无法对所有子树的所有条目执行任何操作。现在,我们可以在 S 中创建一个子树堆,以便每个子树的 B 最小值小于堆中其子树的 B 最小值。现在从堆的顶部节点提取 B 最小元素,直到有 N 个;每当您从具有 k 个元素的子树中提取元素时,您需要将该子树分解为不包含最近提取的元素的 O(log(k)) 子树。
现在让我们考虑复杂性。查找 O(sqrt(m)) 子树最多需要 O(sqrt(m)) 步骤(供读者练习,使用下面证明中的参数)。当我们找到它们时,我们可能应该将它们插入堆中;这将需要 O(sqrt(m) * log(sqrt(m))) = O(sqrt(m) * log(m)) 步骤。从堆中的 k 元素子树中提取单个元素需要 O(sqrt(k)) 时间来查找该元素,然后将 O(log(sqrt(k))) = O(log(k)) 子树插入回去进入大小为 O(sqrt(m)) 的堆需要 O(log(k) * log(sqrt(m))) = O(log(k) * log(m)) 步骤。我们可能可以更聪明地使用势,但我们至少可以将 k 与 m 绑定,这样就剩下 N*(O(sqrt(k) + log(k)*log(m))) = O(N * (sqrt( m) + log(m)^2) = O(N*sqrt(m)) 提取步骤,总共 O(sqrt(m)*(N + log(m))) 步骤...这是 以 m 为单位的次线性。
这是一个 O(sqrt(m)) 子树的边界证明 构建四叉树有多种策略,但为了便于分析,我们假设我们在根节点中创建一棵二叉树;围绕具有中值 A 坐标的点的 A 坐标,然后根据具有中值 B 坐标的点周围的 B 坐标(即该半个点中包含的一半点的中值)向下一层划分数据集。树),并继续交替每层的方向。
现在让我们考虑一下需要递归多少个子树,如果子树包含 A 坐标 X,或者它包含 A 坐标 Y,则需要递归。 ,或两者兼而有之。在第 (2*k) 层,总共有 2^(2*k) 个子树。到那时,每个子树的 A 范围已经细分了 k 次,每次我们这样做时,只有一半的树包含 A 坐标 X。所以最多 2^k 子树包含 A 坐标 X。类似地,在大多数 2^k 将包含 A 坐标 Y。这意味着我们总共最多会递归到 2*sum(2^k, k = 0 .. log(m)/2) = 2*(2^(log(m)/2 - 1) + 1) = O(sqrt(m)) 子树。
由于我们最多检查第 (2*k) 层以下的 2^k 个子树,因此我们还可以将该层的最多 2^k 个子树添加到 S 中。这给出了最终结果。
This is not really a fully fleshed out solution, just an idea. How about building a quadtree on the A and B axes? You would walk down the tree in, say, a breadth-first manner; then:
Now you have the set S of all maximal subtrees with A-coordinates between X and Y; there are at most O(sqrt(m)) of these subtrees, which I will show below.
Some of these subtrees will contain O(m) entries (certainly they will contain O(m) entries all added together), so we can't do anything on all entries of all subtrees. We can now make a heap of the subtrees in S, so that the B-minimum of each subtree is less than the B-minimums of its children in the heap. Now extract B-minimal elements from the top node of the heap until you have N of them; whenever you extract an element from a subtree with k elements, you need to decompose that subtree into O(log(k)) subtrees not containing the recently extracted element.
Now let's consider complexity. Finding the O(sqrt(m)) subtrees will take at most O(sqrt(m)) steps (exercise for the reader, using arguments in the proof below). We should probably insert them into the heap as we find them; this will take O(sqrt(m) * log(sqrt(m))) = O(sqrt(m) * log(m)) steps. Extracting a single element from a k-element subtree in the heap takes O(sqrt(k)) time to find the element, then inserting the O(log(sqrt(k))) = O(log(k)) subtrees back into the heap of size O(sqrt(m)) takes O(log(k) * log(sqrt(m))) = O(log(k) * log(m)) steps. We can probably be smarter using potentials, but we can at least bound k by m, so that leaves N*(O(sqrt(k) + log(k)*log(m))) = O(N * (sqrt(m) + log(m)^2) = O(N*sqrt(m)) steps for the extraction, and O(sqrt(m)*(N + log(m))) steps in total... which is sublinear in m.
Here's a proof of the bound of O(sqrt(m)) subtrees. There are several strategies for building a quadtree, but for ease of analysis, let's say that we make a binary tree; in the root node, we split the data set according to A-coordinate around the point with median A-coordinate, then one level down we split the data set according to B-coordinate around the point with median B-coordinate (that is, median for the half of the points contained in that half-tree), and continue alternating the direction per level.
The height of the tree is log(m). Now let's consider for how many subtrees we need to recurse. We only need to recurse if a subtree contains the A-coordinate X, or it contains the A-coordinate Y, or both. At the (2*k)th level down, there are 2^(2*k) subtrees in total. By then, each subtree has its A-range subdivided k times already, and every time we do that, only half the trees contain the A-coordinate X. So at most 2^k subtrees contain the A-coordinate X. Similarly, at most 2^k will contain the A-coordinate Y. This means that in total we will recurse into at most 2*sum(2^k, k = 0 .. log(m)/2) = 2*(2^(log(m)/2 - 1) + 1) = O(sqrt(m)) subtrees.
Since we examine at most 2^k subtrees at the (2*k)'th level down, we can also add at most 2^k subtrees at that level to S. This gives the final result.
您描述的结果是大多数搜索引擎旨在实现的结果(排序、过滤、分页)。如果您还没有这样做,请查看 Norch 或 Solr 等搜索引擎。
The outcome you describe is what most search engines are built to achieve (sorting, filtering, paging). If you havent done so already, check out a search engine like Norch or Solr.