获取 Scala Iterable 的前 n 个元素的最简单方法
是否有一个简单有效的解决方案来确定 Scala Iterable 的前 n 个元素?我的意思是类似的东西
iter.toList.sortBy(_.myAttr).take(2)
,但当只有前两个元素感兴趣时,不必对所有元素进行排序。理想情况下,我正在寻找类似的内容
iter.top(2, _.myAttr)
,另请参阅:Solution for the top element using an Ordering: In Scala, how to use Ordering[T] with List .min 或 List.max 并保持代码可读
更新:
谢谢大家的解决方案。最后,我采用了用户未知的原始解决方案,并采用它来使用Iterable
和pimp-my-library模式:
implicit def iterExt[A](iter: Iterable[A]) = new {
def top[B](n: Int, f: A => B)(implicit ord: Ordering[B]): List[A] = {
def updateSofar (sofar: List [A], el: A): List [A] = {
//println (el + " - " + sofar)
if (ord.compare(f(el), f(sofar.head)) > 0)
(el :: sofar.tail).sortBy (f)
else sofar
}
val (sofar, rest) = iter.splitAt(n)
(sofar.toList.sortBy (f) /: rest) (updateSofar (_, _)).reverse
}
}
case class A(s: String, i: Int)
val li = List (4, 3, 6, 7, 1, 2, 9, 5).map(i => A(i.toString(), i))
println(li.top(3, _.i))
Is there a simple and efficient solution to determine the top n elements of a Scala Iterable? I mean something like
iter.toList.sortBy(_.myAttr).take(2)
but without having to sort all elements when only the top 2 are of interest. Ideally I'm looking for something like
iter.top(2, _.myAttr)
see also: Solution for the top element using an Ordering: In Scala, how to use Ordering[T] with List.min or List.max and keep code readable
Update:
Thank you all for your solutions. Finally, I took the original solution of user unknown and adopted it to use Iterable
and the pimp-my-library pattern:
implicit def iterExt[A](iter: Iterable[A]) = new {
def top[B](n: Int, f: A => B)(implicit ord: Ordering[B]): List[A] = {
def updateSofar (sofar: List [A], el: A): List [A] = {
//println (el + " - " + sofar)
if (ord.compare(f(el), f(sofar.head)) > 0)
(el :: sofar.tail).sortBy (f)
else sofar
}
val (sofar, rest) = iter.splitAt(n)
(sofar.toList.sortBy (f) /: rest) (updateSofar (_, _)).reverse
}
}
case class A(s: String, i: Int)
val li = List (4, 3, 6, 7, 1, 2, 9, 5).map(i => A(i.toString(), i))
println(li.top(3, _.i))
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
我的解决方案(绑定到 Int,但应该很容易更改为 Ordered(请几分钟):
用法:
更新(抽象):
用Iterable替换List似乎有点难。
正如 Daniel C. Sobral 在评论中指出的那样,topN 中的高
n
可能会导致大量排序工作,因此进行手动插入排序而不是重复对整个列表进行排序可能会很有用top-n 元素的个数:top/bottom 方法和用法如上。对于顶部/底部元素的小组,很少调用排序,一开始会调用几次,然后随着时间的推移越来越少。例如,top (10) 为 10 000 时 70 次,top (10) 为 100 000 时 90 次。
My solution (bound to Int, but should be easily changed to Ordered (a few minutes please):
usage:
update (abstraction):
To replace List with Iterable seems to be a bit harder.
As Daniel C. Sobral pointed out in the comments, a high
n
in topN can lead to much sorting work, so that it could be useful, to do a manual insertion sort instead of repeatedly sorting the whole list of top-n elements:top/bottom method and usage as above. For small groups of top/bottom Elements, the sorting is rarely called, a few times in the beginning, and then less and less often over time. For example, 70 times with top (10) of 10 000, and 90 times with top (10) of 100 000.
这是另一个简单且性能相当好的解决方案。
大 O 是
O(n + k log n)
,其中k <= n
。因此,对于较小的k
和最坏的n log n
来说,性能是线性的。该解决方案还可以优化为内存复杂度为
O(k)
,性能优化为O(n log k)
。这个想法是使用 MinHeap 始终只跟踪前 k 个项目。这是解决方案。Here's another solution that is simple and has pretty good performance.
The Big O is
O(n + k log n)
, wherek <= n
. So the performance is linear for smallk
and at worstn log n
.The solution can also be optimized to be
O(k)
for memory butO(n log k)
for performance. The idea is to use a MinHeap to track only the topk
items at all times. Here's the solution.还有另一个版本:
使用
Set
强制列表具有唯一值:Yet another version:
Using the
Set
force the list to have unique values:您无需对整个集合进行排序即可确定前 N 个元素。但是,我不认为此功能是由原始库提供的,因此您必须自己开发,可能使用 pimp-my-library 模式。
例如,您可以按如下方式获取集合的第 n 个元素:(
这不是很有用;抱歉)
获取前 n 个元素就很简单了:
不幸的是,我在获取此皮条客时遇到了麻烦通过这个隐式发现:
我明白了:
但是,代码实际上工作正常:
并且
You don't need to sort the entire collection in order to determine the top N elements. However, I don't believe that this functionality is supplied by the raw library, so you would have to roll you own, possibly using the pimp-my-library pattern.
For example, you can get the nth element of a collection as follows:
(This is not very functional; sorry)
It's then trivial to get the top
n
elements:Unfortunately I'm having trouble getting this pimp to be discovered via this implicit:
I get this:
However, the code actually works OK:
And
如果目标是不对整个列表进行排序,那么您可以执行类似的操作(当然可以稍微优化一下,以便当数字显然不应该存在时我们不会更改列表):
If the goal is to not sort the whole list then you could do something like this (of course it could be optimized a tad so that we don't change the list when the number clearly shouldn't be there):
我最近在 Rank Apache Jackrabbit 类(不过在 Java 中)。请参阅
take
方法了解其要点。基本思想是快速排序,但一旦找到前 n 个元素就提前终止。I implemented such an ranking algorithm recently in the Rank class of Apache Jackrabbit (in Java though). See the
take
method for the gist of it. The basic idea is to quicksort but terminate prematurely as soon as the topn
elements have been found.这是渐近 O(n) 解。
由于递归,此解决方案将比上面的一些非线性解决方案花费更长的时间,并且可能导致
java.lang.OutOfMemoryError:超出GC开销限制
。但恕我直言,它的可读性和功能性风格稍好一些。仅用于工作面试;)。
更重要的是,该解决方案可以轻松并行化。
Here is asymptotically O(n) solution.
Due to recursion, this solution will take longer than some non-linear solutions above and can cause
java.lang.OutOfMemoryError: GC overhead limit exceeded
.But slightly more readable IMHO and functional style. Just for job interview ;).
What is more important, that this solution can be easily parallelized.
对于
n
的较小值和较大的列表,可以通过选取最大元素n
次来获取前n
元素:这并不对整个列表进行排序并采用
Ordering
。我相信我在partitionMax
中调用的每个方法都是 O(列表大小),并且我只希望最多调用它n
次,因此小n 的整体效率
将与迭代器的大小成正比。您还可以在
n
接近可迭代的大小时添加一个分支,并切换到iter.toList.sortBy(_.myAttr).take(n)
。它不会返回提供的集合类型,但您可以查看 如果有要求,如何将丰富我的库模式应用于 Scala 集合?。
For small values of
n
and large lists, getting the topn
elements can be implemented by picking out the max elementn
times:This does not sort the entire list and takes an
Ordering
. I believe every method I call inpartitionMax
is O(list size) and I only expect to call itn
times at most, so the overall efficiency for smalln
will be proportional to the size of the iterator.You could also add a branch for when
n
gets close to size of the iterable, and switch toiter.toList.sortBy(_.myAttr).take(n)
.It does not return the type of collection provided, but you can look at How do I apply the enrich-my-library pattern to Scala collections? if this is a requirement.
使用
PriorityQueue
的优化解决方案,时间复杂度为O(nlogk)
。在更新中给出的方法中,您每次都会对sofar
列表进行排序,这是不需要的,并且在其下方使用PriorityQueue
进行优化。An optimised solution using
PriorityQueue
with Time Complexity ofO(nlogk)
. In the approach given in the update, you are sorting thesofar
list every time which is not needed and below it is optimised by usingPriorityQueue
.