Scala PriorityQueue on Array[Int] 具有复杂比较函数的性能问题(需要缓存)
该问题涉及 Scala PriorityQueue[Array[Int]] 在大数据集上的性能。需要进行以下操作:入队、出队和过滤。目前,我的实现如下:
对于 Array[Int] 类型的每个元素,都有一个复杂的求值函数:(我不知道如何以更有效的方式编写它,因为它排除了位置 0
def eval_fun(a : Array[Int]) =
if(a.size < 2) 3
else {
var ret = 0
var i = 1
while(i < a.size) {
if((a(i) & 0x3) == 1) ret += 1
else if((a(i) & 0x3) == 3) ret += 3
i += 1
}
ret / a.size
}
)带有比较函数的是基于评估函数:(反向,降序)
val arr_ord = new Ordering[Array[Int]] {
def compare(a : Array[Int], b : Array[Int]) = eval_fun(b) compare eval_fun(a) }
PriorityQueue 定义为:
val pq: scala.collection.mutable.PriorityQueue[Array[Int]] = PriorityQueue()
问题:
- 是否有更优雅、更高效的方式来编写这样的评估函数?我正在考虑使用fold,但是fold不能排除位置0。
- 是否有数据结构可以生成具有唯一元素的优先级队列?在每个入队操作之后应用过滤操作效率不高。
- 有没有缓存方法来减少评估计算?因为当向队列中添加新元素时,每个元素可能需要再次使用 eval_fun 进行评估,如果每个元素的评估值都可以缓存,则不需要这样做。另外,我应该提到两个不同的元素可能具有相同的评估值。
- 是否有更有效的数据结构而不使用泛型类型?因为如果elements的大小达到10000,size的大小达到1000,性能就慢得要命。
感谢您。
The problem involves the Scala PriorityQueue[Array[Int]] performance on large data set. The following operations are needed: enqueue, dequeue, and filter. Currently, my implementation is as follows:
For every element of type Array[Int], there is a complex evaluation function: (I'm not sure how to write it in a more efficient way, because it excludes the position 0)
def eval_fun(a : Array[Int]) =
if(a.size < 2) 3
else {
var ret = 0
var i = 1
while(i < a.size) {
if((a(i) & 0x3) == 1) ret += 1
else if((a(i) & 0x3) == 3) ret += 3
i += 1
}
ret / a.size
}
The ordering with a comparison function is based on the evaluation function: (Reversed, descendent order)
val arr_ord = new Ordering[Array[Int]] {
def compare(a : Array[Int], b : Array[Int]) = eval_fun(b) compare eval_fun(a) }
The PriorityQueue is defined as:
val pq: scala.collection.mutable.PriorityQueue[Array[Int]] = PriorityQueue()
Question:
- Is there a more elegant and efficient way to write such a evaluation function? I'm thinking of using fold, but fold cannot exclude the position 0.
- Is there a data structure to generate a priorityqueue with unique elements? Applying filter operation after each enqueue operation is not efficient.
- Is there a cache method to reduce the evaluation computation? Since when adding a new element to the queue, every element may need to be evaluated by eval_fun again, which is not necessary if evaluated value of every element can be cached. Also, I should mention that two distinct element may have the same evaluated value.
- Is there a more efficient data structure without using generic type? Because if the size of elements reaches 10,000 and the size of size reaches 1,000, the performance is terribly slow.
Thanks you.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
(1) 如果你想在这里获得最大的性能,我会坚持使用 while 循环,即使它不是非常优雅。否则,如果您在数组上使用
view
,则可以轻松地在进入fold
之前删除第一个元素:(2) 您可以维护两个结构,优先级队列,和一套。两者结合起来会给你一个排序集......所以你可以使用
collection.immutable.SortedSet
,但标准库中没有可变变体。是否想要基于优先级函数或实际数组内容进行相等?因为在后一种情况下,您将无法为每次插入逐个元素地比较数组,从而取消缓存优先级函数值的效果。(3) 只需将计算出的优先级与数组一起放入队列即可。 IE
(1) If you want maximum performance here, I would stick to the while loop, even if it is not terribly elegant. Otherwise, if you use a
view
on Array, you can easily drop the first element before going into thefold
:(2) You can maintain two structures, the priority queue, and a set. Both combined give you a sorted-set... So you could use
collection.immutable.SortedSet
, but there is no mutable variant in the standard library. Do want equality based on the priority function, or the actual array contents? Because in the latter case, you won't get around comparing arrays element by element for each insertion, undoing the effect of caching the priority function value.(3) Just put the calculated priority along with the array in the queue. I.e.
好吧,您可以使用尾递归循环(通常这些更“惯用”:
然后您可以使用它来初始化缓存的优先级值:
您可能还注意到我删除了冗余的 & 操作并且只有单个条件(对于当它等于 2 时,而不是对 1 && 3) 进行两次检查 – 这些应该会产生一些最小的影响,
但这与 0__ 刚刚提出的提案没有太大区别。
Well, you can use a tail recursive loop (generally these are more "idiomatic":
Then you can use that to initialise a cached priority value:
You may note also that I removed a redundant & op and have only the single conditional (for when it equals 2, rather than two checks for 1 && 3) – these should have some minimal effect.
There is not a huge difference from 0__'s proposal that just came though.
我的回答:
如果评估很重要,就保持原样。您可能会通过递归获得更好的性能(不知道为什么,但它确实发生了),但是使用几乎任何其他方法肯定会获得更差的性能。
不,没有,但是您只需修改出队操作就可以非常接近它:
defdistinctDequeue[T](q: PriorityQueue[T]): T = {
val 结果 = q.dequeue
while (q.head == 结果) q.dequeue
结果
}
,您必须保留第二个数据结构来跟踪是否已添加元素。无论哪种方式,等号都相当重,但我建议在下一个项目中使其更快。
但请注意,这需要以其他方式解决成本函数上的关系。
像 0__ 建议的那样,将成本放入优先级队列中。但是如果有帮助的话,您也可以在函数上保留缓存。我会尝试这样的事情:
val evalMap = scala.collection.mutable.HashMapWrappedArray[Int], Int
def eval_fun(a : 数组[Int]) =
if(a.size < 2) 3
否则 evalMap.getOrElseUpdate(a, {
变量 ret = 0
变量我 = 1
while(i < a.size) {
if((a(i) & 0x3) == 1) ret += 1
否则 if((a(i) & 0x3) == 3) ret += 3
我 += 1
}
ret / a.尺寸
})
导入scala.math.Ordering.Implicits._
val pq = new collection.mutable.PriorityQueue[(Int, WrappedArray[Int])]
pq += eval_fun(a) ->; (a : WrappedArray[Int])
请注意,我没有创建特殊的
Ordering
——我使用的是标准Ordering
,以便WrappedArray
> 将打破联系。包装Array
的成本很低,并且您可以使用.array
将其返回,但是,另一方面,您将得到以下结果:Ties will通过比较数组本身来打破。如果成本方面没有太多关系,这应该足够了。如果有,请向元组中添加其他内容,以帮助打破平局,而无需比较数组。
这意味着所有相等的元素将被保留在一起,这将使您能够同时将所有它们出列,给人一种只保留一个元素的印象。
并且
equals
实际上会起作用,因为WrappedArray
会像 Scala 序列一样进行比较。我不明白你说的第四点是什么意思。
My answers:
If evaluation is critical, keep it as it is. You might get better performance with recursion (not sure why, but it happens), but you'll certainly get worse performance with pretty much any other approach.
No, there isn't, but you can come pretty close to it just modifying the dequeue operation:
def distinctDequeue[T](q: PriorityQueue[T]): T = {
val result = q.dequeue
while (q.head == result) q.dequeue
result
}
Otherwise, you'd have to keep a second data structure just to keep track of whether an element has been added or not. Either way, that equals sign is pretty heavy, but I have a suggestion to make it faster in the next item.
Note, however, that this requires that ties on the the cost function get solved in some other way.
Like 0__ suggested, put the cost on the priority queue. But you can also keep a cache on the function if that would be helpful. I'd try something like this:
val evalMap = scala.collection.mutable.HashMapWrappedArray[Int], Int
def eval_fun(a : Array[Int]) =
if(a.size < 2) 3
else evalMap.getOrElseUpdate(a, {
var ret = 0
var i = 1
while(i < a.size) {
if((a(i) & 0x3) == 1) ret += 1
else if((a(i) & 0x3) == 3) ret += 3
i += 1
}
ret / a.size
})
import scala.math.Ordering.Implicits._
val pq = new collection.mutable.PriorityQueue[(Int, WrappedArray[Int])]
pq += eval_fun(a) -> (a : WrappedArray[Int])
Note that I did not create a special
Ordering
-- I'm using the standardOrdering
so that theWrappedArray
will break the ties. There's little cost to wrap theArray
, and you get it back with.array
, but, on the other hand, you'll get the following:Ties will be broken by comparing the array themselves. If there aren't many ties in the cost, this should be good enough. If there are, add something else to the tuple to help break ties without comparing the arrays.
That means all equal elements will be kept together, which will enable you to dequeue all of them at the same time, giving the impression of having kept only one.
And that
equals
will actually work, becauseWrappedArray
compare like Scala sequences do.I don't understand what you mean by that fourth point.