具有查找功能的优先级队列 - 最快的实现
我正在考虑实现一个带有附加要求的优先级队列,一个查找/搜索功能,它将告诉一个项目是否在队列中的任何位置。所以函数将是:insert、del-min 和 find。
我不确定是否应该使用堆或自平衡二叉搜索树。看来 PQ 通常是用堆实现的,但我想知道使用二叉搜索树是否有任何优势,因为我也需要该查找函数。
此外,平均而言,我执行的插入操作将多于删除操作。我也在考虑 d-ary 堆。基本上,每一秒都很重要。
谢谢!
I am looking at implementing a priority queue with an added requirement, a find/search function which will tell whether an item is anywhere within the queue. So the functions will be: insert, del-min and find.
I am unsure whether I should use a Heap or a Self-balancing binary search tree. It appears PQs are usually implemented with a Heap, but I am wondering if there is any advantage in using a binary search tree since I also need that find function.
Furthermore, on average I'll be doing more inserts than deletes. I am also considering a d-ary heap. Basically, every second counts.
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
为什么不能只使用优先级队列和集合?当您将某些内容放入队列时,您将其添加到集合中。当您将其出队时,您将其从集合中删除。这样,设备就会告诉您队列中是否有东西。
Why can't you just use a Priority Queue and a Set? When you enqueue something, you add it to the set. When you dequeue it, you remove it from the set. That way the set will tell you if something is in the queue.
如果您的查找操作相对不频繁(并且您的堆相当小),我只会进行线性搜索。如果相对频繁,或者堆很大,请考虑使用单独的数据结构或对象标志来跟踪堆成员资格(以进行“查找”测试)。外部索引的乐趣在于能够将对象放入任意多个容器中。
如果“查找”实际上是指“查找和修改”(我发现我经常需要从优先级队列中删除内容,而与典型的插入/删除分钟无关),那么以下是我使用过的三种方法
:在相当小的工作集(500-1000)上插入/删除分钟(100k/s连续)和低查找删除率(例如1/s)我对元素进行了线性搜索,然后将其从以标准方式树。
考虑到较高的插入/删除率以及相当频繁的查找删除,我在间接找到已删除的对象后将其标记为“无趣”。实际的释放被推迟到对象正常出队为止。
给定一个小的 std::priority_queue (除了 insert/del-min 之外没有访问方法),只有几个元素和相当不频繁的删除,我只是将整个队列复制到临时 std::vector 并复制修改/所需的部分回到队列中。然后我就哭着睡着了。
If your find operation is relatively infrequent (and your heap fairly small), I'd just do a linear search. If it is relatively frequent, or the heap is enormous, consider tracking heap membership (to do your 'find' test) with a separate data structure or an object flag. The joy of external indexing is being able to put your object in as many containers as you like.
If by 'find' you really mean 'find and modify' (I find I often need to delete things from priority queues independently of the typical insert/del-min), here are three approaches I've used:
Given a high rate of insert/del-min (100k/s continuous) and a low rate of find-delete (say 1/s) over a fairly small working set (500-1000) I did a linear search for the element and then deleted it from the tree in the standard way.
Given a high rate of insert/del-min plus fairly frequent find-deletes I simply marked the deleted objects as "uninteresting" after finding them indirectly. The actual free was deferred until the object was dequeued as normal.
Given a small std::priority_queue (which has no access methods outside of insert/del-min) of only a few elements and fairly infrequent deletions, I just copied the entire queue to a temporary std::vector and copied the modified/desired part back into the queue. Then I cried myself to sleep.
如果您需要多种数据结构的优势,那么您可以在组合中使用它们。
例如,如果您需要优先级队列和二叉搜索树的优势,那么可以对它们执行所需的操作。
如果是
insert
则将元素插入到它们两个中。如果是
find
,那么您可以使用二叉搜索树找到该元素,如果找到,则继续在优先级队列中查找它。如果它是
min
,那么首先将其从优先级队列中删除,现在您知道它是哪个元素,然后您可以将其从二叉搜索树中删除。如果是
del
,那么首先在二叉搜索树中找到它并删除它,然后继续在优先级队列中找到它并从那里删除它。假设二叉树的节点和优先级队列的节点是指向元素的指针。
If you need the benefits of more than one data structure then you can use them in composition.
For example, if you need the benefits of a priority queue and a binary search tree then make your desired actions on both of them.
If it's
insert
then insert the element to both of them.If it's
find
then you can find the element using the binary search tree and if it was found then continue on to find it in the priority queue.If it's
min
then remove it first from the priority queue and now that you know which element it is then you can remove it from the binary search tree.if it's
del
then first find it in the binary search tree and remove it then continue to find it in the priority queue and remove it from there too.It is assumed that the nodes of the binary tree and the nodes of the priority queue are pointers to your elements.
IIRC 在堆上的搜索/查找是
O(n)
而在树上则是O(log(n))
并且其他标准 PQ 操作是相同的。堆只是在经验上通过某些恒定因素更有效,因此如果它是一个大队列,那么树应该更好,如果它很小,则需要测试和分析。从理论上知道什么更快是件好事,但如果这些常数因子很大,那么对于足够小的数据集可能完全无关。
IIRC search/find on a heap is
O(n)
whereas on a tree it isO(log(n))
and the other standard PQ operations are the same.Heaps are only empirically more efficient by some constant factor, so if its a big queue a tree should be better, if its small you need to test and profile. its all good to know in theory whats faster, but if those constant factors are large it may be completely irrelevant for sufficiently small data sets.
具有 min-heap 属性的基数树将提供您需要的属性。这实际上将为您的操作提供恒定的时间复杂度。例如,如果我们查看 这个 Haskell 实现,您提到的所有三个操作的时间复杂度都是O(min(n,W))。其中 n 是元素数量,W 是 int 中的位数(32 或 64)。
Radix trees with a min-heap property will provide the properties you need. This will actually give you constant time complexities for your operations. For example, if we look at this Haskell implementation, all three operations you mention have time complexity O(min(n,W)). Where n is the number of elements, and W is the number of bits in an int (32 or 64).
请检查此代码。我编写了这个程序,这是一个包含您所需功能的优先级队列。
你可以尝试一下。它工作得很好。这里我将升序的最小数添加到最大数。
我使用优先级队列默认函数来通过 switch case 来完成此操作。
C++代码:
Please, check this code. I coded this program and this is a priority queue with your needed functions.
You can try it. It's working perfectly. Here I added ascending order minimum number to maximum number.
I used the priority queue default function to do that with a switch case.
C++ code:
首先介绍一些背景知识。堆通常以数组或向量的形式实现,其中每个节点都有一个索引,其中 0 是最高优先级的节点。
您的堆必须为每个节点存储各种数据:值、回调函数等等。
实际上,不断移动这些节点的成本太高,因此它们往往被分配在堆向量之外,堆向量实际上是指向节点的指针数组。
一般来说,我会让优先级队列的 API 为您提供指向相关节点的指针,并实现一个以该节点的地址作为参数的 delete() 方法。实际上,应用程序要求按优先级(在我的例子中是时间)注册一个事件,并获得一个不透明的“cookie”。如果应用程序想要取消该事件,他们会将该 cookie 传回堆中。
好的,现在回答你的问题!这个节点(cookie)应该保存节点在向量中的索引。这允许 delete() 操作直接转到向量中的该点。然后,您可以交换该节点的堆尾部,并根据需要向上或向下重新堆化。
First some background. The heap is typically implemented as an array or vector, whereby each node has an index, with 0 being the highest-priority node.
Your heap has to store various data per node: the value, a callback function or some such.
In practice moving these nodes around constantly is too expensive, so they tend to be allocated outside the heap vector, which is in fact an array of pointers to nodes.
Generally, I will have the API for the priority queue give you the pointer to the node in question, and implement a delete() method that takes the address of that node as an argument. In effect, the application asks to register an event at a priority (in my case, time), and gets an opaque "cookie." Should the app want to cancel the event, they pass that cookie back in to the heap.
OK, now to answer your question! This node (the cookie) should hold the node's index into the vector. That allows the delete() operation to go directly to that point in the vector. You can then swap the tail of the heap for that node and re-heapify up or down as needed.
将数据存储在您测试过的最快容器中,并使用布隆过滤器来测试容器中是否有某些内容。
我在之前的项目中将布隆过滤器与哈希表配合使用,它在平均包含大约 10k 项的哈希表上将速度提高了 400 倍。
布隆过滤器有一些有趣的属性:
结构以确保该项目确实存在。
Store your data in the fastest container you've tested and use a bloom filter to test if something is in the container.
I mated a bloom filter with a hash table in a previous project and it sped things up 400 times on hash tables with an average of roughly 10k items.
The bloom filter has a few interesting properties:
structure to make sure the item is actually present.