保留前 (N) 个值的固定大小集合
我的代码处理大量值,我正在寻找一种有效的结构来跟踪前 (N) 个值,其中 N 小于 10,因此收集所有数字,然后对列表进行排序并获取第一个 (N) 个值可能不是最有效的方法。
为此,我构建了一个固定大小 N 的集合,以保持前 (N) 个值按降序排序。如果 value 高于任何现有值(在这种情况下最后一个元素将被删除)或者集合是不满。
我能够使用双重LinkedList
因为它具有快速插入和删除的功能,但我想知道是否使用
SortedDictionary
还是优先级队列会更好?
谢谢。
My code processes a huge number of values and I'm looking for an efficient structure to keep track of the top (N) values, where N is less than 10, so collecting ALL numbers then sorting the list and taking the first (N) is probably not the most efficient way.
To do that, I'm building a collection of fixed size N, to keep the top (N) values sorted in descending order. The Add(T value)
method of the sorted collection would add the value to the collection if value is higher than any of the existing values (in which case the last element is removed) or if the collection is not full.
I was able to implement what I wanted using a doubly LinkedList<T>
since it has fast insertion and removal, but I was wondering if using SortedDictionary<TKey, TValue>
or a priority queue would be better ?
Thank you.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
我会简单地使用深度有限的堆。我不知道是否已经存在一个库,但它应该很容易实现。
I would simply use a heap with a limited depth. I do not know whether there already exists a library for that, but it should be easy to implement.
使用 SortedDictionary 或 SortedList 的主要优点是您可以跳过排序智能,因为它们会为您处理(例如,每次添加值时您只需删除第 (n + 1) 个元素)。
但另一方面,采用那种10个元素的复杂结构就像用核武器杀死一只苍蝇一样......
也许链表是一个好方法,而且按顺序插入值的简单线性比较也不是那么慢比二分搜索(我们仍然谈论最多 10 次与 ~3 的比较,当前的 CPU 没有感觉到差异)。
编辑:
固定数组可用于使用二进制堆构建优先级队列,这可能是实现这个的正确方法
The main advantage to use a SortedDictionary or SortedList it is that you can skip the sorting intelligence because they handle it for you( e.g. You just have to remove the (n + 1)th element every time you add a value).
But on the other hands adopt that sort of complex structure for 10 elements resembles to use a nuke to kill a fly...
Maybe the linked list is a good way, and also a simple linear comparison for inserting values in order is not so slower than binary search (we still speak about max 10 comparisons against ~3, current CPUs not event feel the difference).
EDIT:
fixed arrays can be used to build prioriry queues with binary heaps, that probably is the right way to implement this
性能可能真的会改变。
对于N< 10 任何过于复杂的数据结构都可能会显着降低性能(尽管可能不是灾难性的),因此我会使用数组来存储项目。
那么如何排列数组中的项目有 3 种主要可能性:
The performance may really change.
For N < 10 any overly complex data structure will likely drag performance significantly (though perhaps not catastrophically) so I'd use an array to store the items.
Then there are 3 main possibilities on how to arrange the items in the array:
对于这么小的数字,只需保留一个数组即可。扫描数组并跟踪最小值及其位置。如果您的新数字大于该组中最小的数字,请更换它。当然,您应该在插入数字后扫描一次最低值,然后将新数字与该数字进行比较,并且只有在有更大的数字时才采取行动(替换并重新扫描)。
For such a small number, just keep an array. Scan the array keeping track of the smallest value and its position. If your new number is larger than the smallest on in the set, replace it. You should of course scan for the lowest value once after you insert a number, then just compare new numbers to that and only take action if you have something larger (replace and rescan).
除非你有充分的理由不这样做,否则我会使用优先级队列。
有一个技巧可以大大简化逻辑。大多数人的第一个想法是查看每个传入的项目,并将其插入集合中,前提是该集合包含的项目少于所需的项目,或者新项目大于当前集合中的最小项目。
如果你为集合中的一件额外物品留出空间,你可以大大简化事情。 始终将每个传入项目插入集合中,然后如果集合太大,则删除最小的项目。
虽然优先级队列对于只有 10 个项目来说可能有点过大,但它保持了逻辑简单,并且在空间和时间方面都很高效,所以如果您需要 N=10000(或其他),它仍然可以很好地工作。
Unless you have a solid reason to do otherwise, I'd use a priority queue.
There is one trick that can simplify the logic quite a bit. Most people's first idea is to look at each incoming item, and insert it into the collection iff the collection contains fewer items than desired, or the new item is larger than the smallest item currently in the collection.
You can simplify things quite a bit if you leave room for one extra item in the collection. Always insert each incoming item into the collection, and then if the collection is too large, remove the smallest item.
While a priority queue is arguably overkill for only 10 items, it keeps the logic simple, and is efficient both in terms of space and time, so if you ever need N=10000 (or whatever) it'll still work nicely.
编辑:
如果只需要前 N 个值,而其他值没有任何意义,那么一个普通的旧数组将可以廉价地完成工作。
对其进行排序并针对最大的进行测试。并且只有如果需要存储它,则正确插入它并移动其余元素。对于小尺寸来说,这是一种廉价的操作,而且我的猜测是不会经常这样做。
Edit:
If only the first N values are needed and the others are not of any interest, a plain old array will get the work done cheaply.
Keep it sorted and test against the biggest. And only if it needs to be stored, insert it correctly and shift the remaining elements. With small sizes this is a cheap operation, and my guess is it won't be done often.
如果固定大小为 10,为什么不简单地使用长度为 10 的排序数组和二分查找呢?但我不确定在这个大小下,由于一些开销,二分搜索是否比沿数组进行的愚蠢搜索有很大优势。
If you have a fix size of 10, why not simply use a sorted array of length 10 and binary search? But I am not sure if at this size, binary search is not a huge win over a dumb search along the array due to some overhead.
对原始数组使用二进制插入排序,将最小值推到末尾。这通常是用于维护小型排序数组的最快方法,并且通常用作各种排序算法(例如 MergeSort)的特殊情况。
Use binary insertion sort on a raw array, pushing the smallest value off the end. This is routinely the fastest method used to maintain small sorted arrays and, for example, is generally used as a special case for various sorting algorithms (e.g. MergeSort).