从集合中选择随机元素,比线性时间更快(Haskell)
我想创建这个函数,它从 Set:
randElem :: (RandomGen g) => Set a -> g -> (a, g)
可以编写简单的listy实现。例如(代码更新,验证工作):
import Data.Set as Set
import System.Random (getStdGen, randomR, RandomGen)
randElem :: (RandomGen g) => Set a -> g -> (a, g)
randElem s g = (Set.toList s !! n, g')
where (n, g') = randomR (0, Set.size s - 1) g
-- simple test drive
main = do g <- getStdGen
print . fst $ randElem s g
where s = Set.fromList [1,3,5,7,9]
但是使用 !!
会导致大型(随机选择)n
的线性查找成本。有没有更快的方法来选择集合中的随机元素?理想情况下,重复随机选择应该在所有选项上产生均匀分布,这意味着它不会比其他元素更喜欢某些元素。
编辑:答案中出现了一些很棒的想法,所以我只想对我到底在寻找什么进行更多说明。我用 Sets 提出了这个问题,作为这种情况的解决方案。我更喜欢这样的答案:既
- 避免使用超出集合内部的任何函数外簿记,又
- 保持良好的性能(平均优于 O(n)),即使该函数仅使用一次每个独特的集合。
我也热爱工作代码,所以如果您的答案包含一个工作解决方案,请期待(至少)我+1。
I'd like to create this function, which selects a random element from a Set:
randElem :: (RandomGen g) => Set a -> g -> (a, g)
Simple listy implementations can be written. For example (code updated, verified working):
import Data.Set as Set
import System.Random (getStdGen, randomR, RandomGen)
randElem :: (RandomGen g) => Set a -> g -> (a, g)
randElem s g = (Set.toList s !! n, g')
where (n, g') = randomR (0, Set.size s - 1) g
-- simple test drive
main = do g <- getStdGen
print . fst $ randElem s g
where s = Set.fromList [1,3,5,7,9]
But using !!
incurs a linear lookup cost for large (randomly selected) n
. Is there a faster way to select a random element in a Set? Ideally, repeated random selections should produce a uniform distribution over all options, meaning it does not prefer some elements over others.
Edit: some great ideas are popping up in the answers, so I just wanted to throw a couple more clarifications on what exactly I'm looking for. I asked this question with Sets as the solution to this situation in mind. I'll prefer answers that both
- avoid using any outside-the-function bookkeeping beyond the Set's internals, and
- maintain good performance (better than O(n) on average) even though the function is only used once per unique set.
I also have this love of working code, so expect (at minimum) a +1 from me if your answer includes a working solution.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
Data.Map 有一个索引函数(elemAt),所以使用这个:
并且你有一个与 Data.Set 非常兼容的东西(在接口和性能方面),它也有一个 log(n) 索引函数和你请求的 randElem 函数。
请注意,randElem 是 log(n)(它可能是在这种复杂性下可以获得的最快实现),并且所有其他函数都具有与 Data.Set 中相同的复杂性。如果您需要 Set API 中的任何其他特定功能,请告诉我,我将添加它们。
Data.Map has an indexing function (elemAt), so use this:
And you have something quite compatible with Data.Set (with respect to interface and performance) that also has a log(n) indexing function and the randElem function you requested.
Note that randElem is log(n) (and it's probably the fastest implementation you can get with this complexity), and all the other functions have the same complexity as in Data.Set. Let me know if you need any other specific functions from the Set API and I will add them.
据我所知,正确的解决方案是使用索引集——即
IntMap
。您只需要存储随地图一起添加的元素总数。每次添加一个元素时,都会使用比之前高一级的键来添加它。删除一个元素很好——只是不要改变元素总数计数器。如果在查找键控元素时该元素不再存在,则生成一个新的随机数并重试。直到删除总数超过集合中活动元素的数量为止。如果这是一个问题,您可以保留一组单独的已删除键,以便在插入新元素时进行绘制。As far as I know, the proper solution would be to use an indexed set -- i.e. an
IntMap
. You just need to store the total number of elements added along with the map. Every time you add an element, you add it with a key one higher than previously. Deleting an element is fine -- just don't alter the total elements counter. If, on looking up a keyed element, that element no longer exists, then generate a new random number and try again. This works until the total number of deletions dominates the number of active elements in the set. If that's a problem, you can keep a separate set of deleted keys to draw from when inserting new elements.这是一个想法:你可以进行区间二分。
size s
是常数时间。使用randomR
获取您选择的集合的深度。findMin
和findMax
之间使用不同的值进行split
,直到在您想要的位置获得元素。如果您确实担心该集合是由实数组成并且非常紧密地聚集,则可以每次重新计算findMin
和findMax
以保证每次删除一些元素。性能将为 O(n log n),基本上不比当前的解决方案差,但只有相当弱的条件,即集合没有完全聚集在某个累积点周围,平均性能应该为 ~((logn) ^2),这是相当恒定的。如果它是一组整数,则得到 O(log n * log m),其中 m 是该集合的初始范围;只有实数可能会在区间二分(或其他具有累积点的顺序类型的数据类型)中导致非常糟糕的性能。
附言。这会产生完全均匀的分布,只要注意是否有相互偏离的元素以确保可以获取顶部和底部的元素。
编辑:添加了“代码”
一些不优雅的、未经检查的(伪?)代码。我当前的机器上没有编译器来进行冒烟测试,可能会出现错误,并且可能可以用更少的
if
来完成。一件事:检查mid
是如何生成的;它需要一些调整,具体取决于您是否正在寻找适用于整数集或实数集的东西(区间二分本质上是拓扑的,并且对于具有不同拓扑的集合不应该完全相同)。Here's an idea: You could do interval bisection.
size s
is constant time. UserandomR
to get how far into the set you are selecting.split
with various values between the originalfindMin
andfindMax
until you get the element at the position you want. If you really fear that the set is made up say of reals and is extremely tightly clustered, you can recomputefindMin
andfindMax
each time to guarantee knocking off some elements each time.The performance would be O(n log n), basically no worse than your current solution, but with only rather weak conditions to the effect that the set not be entirely clustered round some accumulation point, the average performance should be ~((logn)^2), which is fairly constant. If it's a set of integers, you get O(log n * log m), where m is the initial range of the set; it's only reals that could cause really nasty performance in an interval bisection (or other data types whose order-type has accumulation points).
PS. This produces a perfectly even distribution, as long as watching for off-by-ones to make sure it's possible to get the elements at the top and bottom.
Edit: added 'code'
Some inelegant, unchecked (pseudo?) code. No compiler on my current machine to smoke test, possibility of off-by-ones, and could probably be done with fewer
if
s. One thing: check out howmid
is generated; it'll need some tweaking depending on whether you are looking for something that works with sets of ints or reals (interval bisection is inherently topological, and oughtn't to work quite the same for sets with different topologies).从 containers-0.5.2.0 开始,
Data.Set
模块有一个elemAt
函数,它通过元素排序序列中从零开始的索引检索值。所以现在编写这个函数很简单,因为
Set.elemAt
和Set.deleteAt
都是 O(log n),其中 n 是集合中元素的数量,整个操作是O(log n)As of containers-0.5.2.0 the
Data.Set
module has anelemAt
function, which retrieves values by their zero-based index in the sorted sequence of elements. So it is now trivial to write this functionSince both
Set.elemAt
andSet.deleteAt
are O(log n) where n is the number of elements in the set, the entire operation is O(log n)如果您有权访问 Data.Set 的内部结构只是一棵二叉树,您可以在树上递归,在每个节点根据各自的大小概率选择一个分支。这非常简单,并且在内存管理和分配方面为您提供了非常好的性能,因为您无需进行额外的簿记工作。 OTOH,您必须调用 RNG O(log n) 次。
一种变体是使用 Jonas 的建议,首先获取大小并根据该大小选择随机元素的索引,然后使用 Data.Set 中的(尚未添加 elemAt)函数。
If you had access to the internals of Data.Set, which is just a binary tree, you could recurse over the tree, at each node selecting one of the branches with probability according to their respective sizes. This is quite straight forward and gives you very good performance in terms of memory management and allocations, as you have no extra book-keeping to do. OTOH, you have to invoke the RNG O(log n) times.
A variant is using Jonas’ suggestion to first take the size and select the index of the random element based on that, and then use a (yet to be added elemAt) function to Data.Set.
如果您不需要修改集合或不需要经常修改它,则可以使用数组作为查找表,访问时间为 O(1)。
If you don't need to modify your set or need to modify it infrequently you can use arrays as lookup table with O(1) access time.
如果您不介意完全消耗您的
RandomGen
,这个问题可以稍微解决一下。对于可拆分的生成器,这是一件很不错的事情。基本思想是为集合创建一个查找表:这将具有非常好的性能:它将花费 O(n+m) 时间,其中 n 是集合的大小,m 是结果的元素数量列出您评估的内容。 (当然,还要加上在范围内随机选择 m 个数字所需的时间。)
This problem can be finessed a bit if you don't mind completely consuming your
RandomGen
. With splittable generators, this is an A-OK thing to do. The basic idea is to make a lookup table for the set:This will have very good performance: it will cost you O(n+m) time, where n is the size of the set and m is the number of elements of the resulting list you evaluate. (Plus the time it takes to randomly choose m numbers in bounds, of course.)
实现此目的的另一种方法可能是使用 Data.Sequence 而不是 Data.Set。这将允许您在 O(1) 时间内将元素添加到末尾,并在 O(log n) 时间内索引元素。如果您还需要能够进行成员资格测试或删除,则必须使用更通用的 Fingertree 包并使用诸如
FingerTree (Sum 1, Max a) a
之类的东西。要插入元素,请使用 Max a 注释来找到正确的插入位置;这基本上需要 O(log n) 时间(对于某些使用模式可能会少一些)。要进行成员资格测试,请执行基本相同的操作,因此时间为 O(log n)(同样,对于某些使用模式,这可能会少一些)。要选择随机元素,请使用Sum 1
注释进行索引,需要 O(log n) 时间(这将是均匀随机索引的平均情况)。Another way to achieve this might be to use Data.Sequence instead of Data.Set. This would allow you to add elements to the end in O(1) time and index elements in O(log n) time. If you also need to be able to do membership tests or deletions, you would have to use the more general fingertree package and use something like
FingerTree (Sum 1, Max a) a
. To insert an element, use theMax a
annotation to find the right place to insert; this basically takes O(log n) time (for some usage patterns it might be a bit less). To do a membership test, do basically the same thing, so it's O(log n) time (again, for some usage patterns this might be a bit less). To pick a random element, use theSum 1
annotation to do your indexing, taking O(log n) time (this will be the average case for uniformly random indices).我想也许很多答案都是旧的,因为现在 set 模块有自己的 elemAt 函数可以很容易地完成这类事情。这是我的实现:
Prelude 函数
null
和length
在集合上工作,因为Set
是Foldable
的实例。I think maybe a lot of the answers are old here because now the set module has its own
elemAt
function to do this sort of thing very easily. Here's my implementation:Prelude functions
null
andlength
work on sets asSet
is an instance ofFoldable
.