从有限集中进行朴素随机选择的 O 值是多少?
这个关于从有限集中获取随机值的问题让我思考......
这对于人们想要从一组 Y 值中检索 X 个唯一值。 例如,我可能想从一副牌中发一手牌。 我想要 5 张卡片,并且我希望它们都是独一无二的。
现在,我可以天真地做到这一点,随机选择一张卡 5 次,每次获得重复卡时重试,直到获得 5 张卡。 然而,对于大型集合中的大量值来说,这并不是那么好。 例如,如果我想要从 1,000,000 组中提取 999,999 个值,则此方法会变得非常糟糕。
问题是:有多糟糕? 我正在找人解释 O() 值。 获得第 x 个数字需要 y 次尝试……但是有多少次呢? 我知道如何针对任何给定值计算出这一点,但是有没有一种直接的方法可以将其推广到整个系列并获得 O() 值?
(问题不是:“我怎样才能改进这个?”因为它相对容易修复,而且我确信它在其他地方已经被介绍过很多次了。)
This question on getting random values from a finite set got me thinking...
It's fairly common for people to want to retrieve X unique values from a set of Y values. For example, I may want to deal a hand from a deck of cards. I want 5 cards, and I want them to all be unique.
Now, I can do this naively, by picking a random card 5 times, and try again each time I get a duplicate, until I get 5 cards. This isn't so great, however, for large numbers of values from large sets. If I wanted 999,999 values from a set of 1,000,000, for instance, this method gets very bad.
The question is: how bad? I'm looking for someone to explain an O() value. Getting the xth number will take y attempts...but how many? I know how to figure this out for any given value, but is there a straightforward way to generalize this for the whole series and get an O() value?
(The question is not: "how can I improve this?" because it's relatively easy to fix, and I'm sure it's been covered many times elsewhere.)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
变量
n = 集合中的项目总数
m = 要从 n 个项目集中检索的唯一值的数量
d(i) = 在步骤 i 中实现某个值所需的预期尝试次数
i = 表示一个特定步骤。 i ∈ [0, n-1]
T(m,n) = 使用朴素算法从 n 个项目集中选择 m 个唯一项目的预期总尝试次数
推理
第一步,i=0,很简单。 无论我们选择哪个值,我们都会在第一次尝试时得到一个独特的值。 因此:
在第二步中,i=1,我们至少需要 1 次尝试(我们选择有效唯一值的尝试)。 最重要的是,我们有可能选择错误的值。 这个机会是(先前拾取的物品数量)/(物品总数)。 在本例中为 1/n。 如果我们选择了错误的项目,我们有 1/n 的机会再次选择错误的项目。 将其乘以 1/n,因为这是我们两次选择错误的组合概率,得出 (1/n)2。 为了理解这一点,绘制一个决策树会很有帮助。 两次选择一个非唯一的项目后,我们有可能会再次选择它。 这会导致步骤 i=1 中的总预期尝试次数增加 (1/n)3。 每次我们选错号码时,我们都有可能再次选错号码。 结果是:
类似地,在一般的第 i: 步中,一次选择中选错项目的机会是 i/n,结果是:
这是一个 几何序列,因此很容易计算它的总和:
然后通过对预期求和来计算总体复杂度每一步的尝试次数:
将上述级数中的分数扩展 n,我们得到:
我们可以使用以下事实:
由于级数有 m 项,并且每一项都满足上面的不等式,我们得到:
可能是(并且可能)可以通过使用某种技术来评估级数而不是通过(项数)*(最大项)的粗略方法来建立稍微严格的上限
结论
这意味着 Big-O 顺序是O(m*n/(n-m+1))。 我看不出有什么办法可以简化这个表达式。
回顾结果来检查它是否有意义,我们发现,如果 n 是常数,并且 m 越来越接近 n,结果将很快增加,因为分母变得非常小。 这就是我们所期望的,例如,如果我们考虑问题中给出的关于选择“从 1,000,000 组中选择 999,999 个值”的示例。 如果我们让 m 保持不变并且 n 变得非常非常大,那么复杂度将在 n → ∞ 的极限内收敛到 O(m)。 这也是我们所期望的,因为当从“接近”无限大小的集合中选择恒定数量的项目时,选择先前选择的值的概率基本上为 0。即,我们需要独立于 n 进行 m 次尝试,因为没有碰撞。
Variables
n = the total amount of items in the set
m = the amount of unique values that are to be retrieved from the set of n items
d(i) = the expected amount of tries needed to achieve a value in step i
i = denotes one specific step. i ∈ [0, n-1]
T(m,n) = expected total amount of tries for selecting m unique items from a set of n items using the naive algorithm
Reasoning
The first step, i=0, is trivial. No matter which value we choose, we get a unique one at the first attempt. Hence:
In the second step, i=1, we at least need 1 try (the try where we pick a valid unique value). On top of this, there is a chance that we choose the wrong value. This chance is (amount of previously picked items)/(total amount of items). In this case 1/n. In the case where we picked the wrong item, there is a 1/n chance we may pick the wrong item again. Multiplying this by 1/n, since that is the combined probability that we pick wrong both times, gives (1/n)2. To understand this, it is helpful to draw a decision tree. Having picked a non-unique item twice, there is a probability that we will do it again. This results in the addition of (1/n)3 to the total expected amounts of tries in step i=1. Each time we pick the wrong number, there is a chance we might pick the wrong number again. This results in:
Similarly, in the general i:th step, the chance to pick the wrong item in one choice is i/n, resulting in:
This is a geometric sequence and hence it is easy to compute it's sum:
The overall complexity is then computed by summing the expected amount of tries in each step:
Extending the fractions in the series above by n, we get:
We can use the fact that:
Since the series has m terms, and each term satisfies the inequality above, we get:
It might be(and probably is) possible to establish a slightly stricter upper bound by using some technique to evaluate the series instead of bounding by the rough method of (amount of terms) * (biggest term)
Conclusion
This would mean that the Big-O order is O(m*n/(n-m+1)). I see no possible way to simplify this expression from the way it is.
Looking back at the result to check if it makes sense, we see that, if n is constant, and m gets closer and closer to n, the results will quickly increase, since the denominator gets very small. This is what we'd expect, if we for example consider the example given in the question about selecting "999,999 values from a set of 1,000,000". If we instead let m be constant and n grow really, really large, the complexity will converge towards O(m) in the limit n → ∞. This is also what we'd expect, since while chosing a constant number of items from a "close to" infinitely sized set the probability of choosing a previously chosen value is basically 0. I.e. We need m tries independently of n since there are no collisions.
如果您已经选择了 i 值,那么您从一组 y 值中选择一个新值的概率为
因此,获得第 (i+1) 个元素的预期尝试次数为
因此,选择 x 唯一元素的预期试验次数是总和
这可以使用谐波数表示为
从维基百科页面您可以得到近似值
因此,从一组 y 元素中挑选 x 个唯一元素所需的试验次数
是
如果需要,您可以通过使用 Hx 的更精确的近似值来获得更精确的近似值。 特别是,当 x 很小时,可以
大大改善结果。
If you already have chosen i values then the probability that you pick a new one from a set of y values is
Hence the expected number of trials to get (i+1)-th element is
Thus the expected number of trials to choose x unique element is the sum
This can be expressed using harmonic numbers as
From the wikipedia page you get the approximation
Hence the number of necessary trials to pick x unique elements from a set of y elements
is
If you need then you can get a more precise approximation by using a more precise approximation for Hx. In particular, when x is small it is possible to
improve the result a lot.
如果您愿意假设您的随机数生成器在循环回给定抽奖之前看到的值之前始终会找到唯一值,则此算法为 O(m^2),其中 m 是唯一值的数量您正在绘制的值。
因此,如果您从一组 n 个值中提取 m 个值,则第一个值将要求您最多提取 1 个值才能获得唯一值。 第二个最多需要 2 个(您看到第一个值,然后是一个唯一值),第三个需要 3,...第 m 个。 因此总共需要 1 + 2 + 3 + ... + m = [m*(m+1)]/2 = (m^2 + m)/2 次抽奖。 这是 O(m^2)。
如果没有这个假设,我不确定如何保证算法能够完成。 很有可能(特别是使用可能有循环的伪随机数生成器),您将不断看到相同的值,而永远不会得到另一个唯一值。
==编辑==
对于一般情况:
在您的第一次抽奖中,您将恰好进行 1 次抽奖。
在您的第二次抽奖中,您期望获得 1(成功抽奖)+ 1/n(“部分”抽奖,代表您重复抽奖的机会)
在您的第三次抽奖中,您期望获得 1(成功抽奖)+ 2/n(“部分”抽奖...)
...
在您的第 m 次抽奖中,您预计会进行 1 + (m-1)/n 次抽奖。
因此,在平均情况下,您将总共进行 1 + (1 + 1/n) + (1 + 2/n) + ... + (1 + (m-1)/n) 抽奖。
这等于 [1 + i/n] 从 i=0 到 (m-1) 的总和。 让我们表示 sum(1 + i/n, i, 0, m-1)。
然后:
我们去掉低阶项和常数,我们得到 O(m^2/n),其中 m 是要绘制的数字,n 是列表的大小。
If you're willing to make the assumption that your random number generator will always find a unique value before cycling back to a previously seen value for a given draw, this algorithm is O(m^2), where m is the number of unique values you are drawing.
So, if you are drawing m values from a set of n values, the 1st value will require you to draw at most 1 to get a unique value. The 2nd requires at most 2 (you see the 1st value, then a unique value), the 3rd 3, ... the mth m. Hence in total you require 1 + 2 + 3 + ... + m = [m*(m+1)]/2 = (m^2 + m)/2 draws. This is O(m^2).
Without this assumption, I'm not sure how you can even guarantee the algorithm will complete. It's quite possible (especially with a pseudo-random number generator which may have a cycle), that you will keep seeing the same values over and over and never get to another unique value.
==EDIT==
For the average case:
On your first draw, you will make exactly 1 draw.
On your 2nd draw, you expect to make 1 (the successful draw) + 1/n (the "partial" draw which represents your chance of drawing a repeat)
On your 3rd draw, you expect to make 1 (the successful draw) + 2/n (the "partial" draw...)
...
On your mth draw, you expect to make 1 + (m-1)/n draws.
Thus, you will make 1 + (1 + 1/n) + (1 + 2/n) + ... + (1 + (m-1)/n) draws altogether in the average case.
This equals the sum from i=0 to (m-1) of [1 + i/n]. Let's denote that sum(1 + i/n, i, 0, m-1).
Then:
We drop the low order terms and the constants, and we get that this is O(m^2/n), where m is the number to be drawn and n is the size of the list.
有一个漂亮的 O(n) 算法可以解决这个问题。 事情如下。 假设您有 n 个项目,您想从中挑选 m 个项目。 我假设函数 rand() 产生一个 0 到 1 之间的随机实数。算法如下:
可以证明该算法确实以相等的概率选择 m 个项目的每个子集,尽管证明并不明显。 不幸的是,我目前手边没有参考资料。
编辑 该算法的优点是,与进行随机播放相比,它只需要 O(m) 内存(假设这些项只是整数或者可以即时生成),而随机播放需要 O( n) 记忆。
There's a beautiful O(n) algorithm for this. It goes as follows. Say you have n items, from which you want to pick m items. I assume the function rand() yields a random real number between 0 and 1. Here's the algorithm:
It can be proved that this algorithm does indeed pick each subset of m items with equal probability, though the proof is non-obvious. Unfortunately, I don't have a reference handy at the moment.
Edit The advantage of this algorithm is that it takes only O(m) memory (assuming the items are simply integers or can be generated on-the-fly) compared to doing a shuffle, which takes O(n) memory.
你的实际问题实际上比我回答的更有趣(而且更难)。 我从来都不擅长统计(而且我已经有一段时间没有做过统计了),但直觉上,我会说该算法的运行时复杂度可能类似于指数。 只要选取的元素数量与数组大小相比足够小,碰撞率就会很小,接近线性时间,但在某些时候,碰撞数量可能会快速增长,并且运行速度会加快。 -时间将会付诸东流。
如果你想证明这一点,我认为你必须根据所需的元素数量对预期的碰撞次数做一些相当聪明的事情。 也可以通过归纳法来实现,但我认为走这条路比第一种选择需要更多的聪明才智。
编辑:经过一番思考后,这是我的尝试:
给定一个
m
元素数组,并寻找n
随机且不同的元素。 很容易看出,当我们想要选择第 i 个元素时,选择我们已经访问过的元素的几率是(i-1)/m 。 这就是该特定选择的预期碰撞次数。 对于拾取n
个元素,预期的碰撞次数将是每次拾取的预期碰撞次数的总和。 我们将其代入 Wolfram Alpha(sum (i-1)/m,i=1 到 n),然后得到答案(n**2 - n)/2m
。 我们的朴素算法的平均选择数是 n + (n**2 - n)/2m。除非我的记忆完全失败(实际上完全有可能),否则这会给出平均情况的运行时间
O(n**2)
。Your actual question is actually a lot more interesting than what I answered (and harder). I've never been any good at statistitcs (and it's been a while since I did any), but intuitively, I'd say that the run-time complexity of that algorithm would probably something like an exponential. As long as the number of elements picked is small enough compared to the size of the array the collision-rate will be so small that it will be close to linear time, but at some point the number of collisions will probably grow fast and the run-time will go down the drain.
If you want to prove this, I think you'd have to do something moderately clever with the expected number of collisions in function of the wanted number of elements. It might be possible do to by induction as well, but I think going by that route would require more cleverness than the first alternative.
EDIT: After giving it some thought, here's my attempt:
Given an array of
m
elements, and looking forn
random and different elements. It is then easy to see that when we want to pick thei
th element, the odds of picking an element we've already visited are(i-1)/m
. This is then the expected number of collisions for that particular pick. For pickingn
elements, the expected number of collisions will be the sum of the number of expected collisions for each pick. We plug this into Wolfram Alpha (sum (i-1)/m, i=1 to n) and we get the answer(n**2 - n)/2m
. The average number of picks for our naive algorithm is thenn + (n**2 - n)/2m
.Unless my memory fails me completely (which entirely possible, actually), this gives an average-case run-time
O(n**2)
.该算法的最坏情况显然是当您选择 N 个项目的完整集合时。 这相当于问:平均来说,在每面至少出现一次之前,我必须滚动 N 面骰子多少次?
答案:N * HN,其中 HN 是第 N 个 谐波数,
著名的近似值
log(N)
。这意味着所讨论的算法是
N log N
。举个有趣的例子,如果您滚动一个普通的 6 面骰子,直到看到每个数字之一,则需要花费平均 6 H6 = 14.7 卷。
The worst case for this algorithm is clearly when you're choosing the full set of N items. This is equivalent to asking: On average, how many times must I roll an N-sided die before each side has come up at least once?
Answer: N * HN, where HN is the Nth harmonic number,
a value famously approximated by
log(N)
.This means the algorithm in question is
N log N
.As a fun example, if you roll an ordinary 6-sided die until you see one of each number, it will take on average 6 H6 = 14.7 rolls.
在详细回答这个问题之前,让我们先定义一下框架。 假设您有一个包含 n 个不同对象的集合 {a1, a2, ..., an},并且想要从该集合中挑选 m 个不同的对象,使得给定对象 aj 出现在结果中的概率对于所有对象来说都是相等的。
如果你已经挑选了 k 个物品,并且从完整集合 {a1, a2, ..., an} 中随机挑选一个物品,则该物品之前未被挑选过的概率为 (nk)/n。 这意味着在获得新对象之前必须采集的样本数为(假设随机采样的独立性) 几何,参数为 (nk)/n。 因此,获得一项额外项目的预期样本数为 n/(nk),如果 k 比 n 小,则该样本数接近 1。
结论是,如果您需要随机选择的 m 个唯一对象,该算法将为您提供
n/n + n/(n-1) + n/(n-2) + n/(n-3) + .... + n /(n-(m-1))
来估计
,正如 Alderath 所示,可以通过m*n / (n-m+1)
。 您可以从这个公式中看到更多信息:
* 随着已选择对象数量的增加(这听起来合乎逻辑),获得新的唯一元素的预期样本数量也会增加。
* 当 m 接近 n 时,特别是当 n 很大时,您可以预期计算时间会非常长。
为了从集合中获取 m 个唯一成员,请使用 David Knuth 算法 的变体来获取随机排列。 在这里,我假设 n 个对象存储在一个数组中。
这里,randInt 从 {i, i+1, ... n} 中采样一个整数,并且交换翻转数组的两个成员。 你只需要洗牌 m 次,所以计算时间是 O(m),而内存是 O(n) (尽管你可以调整它只保存 a[i] <> i 的条目,这会给你时间和内存的 O(m),但常数更高)。
Before being able to answer this question in details, lets define the framework. Suppose you have a collection {a1, a2, ..., an} of n distinct objects, and want to pick m distinct objects from this set, such that the probability of a given object aj appearing in the result is equal for all objects.
If you have already picked k items, and radomly pick an item from the full set {a1, a2, ..., an}, the probability that the item has not been picked before is (n-k)/n. This means that the number of samples you have to take before you get a new object is (assuming independence of random sampling) geometric with parameter (n-k)/n. Thus the expected number of samples to obtain one extra item is n/(n-k), which is close to 1 if k is small compared to n.
Concluding, if you need m unique objects, randomly selected, this algorithm gives you
n/n + n/(n-1) + n/(n-2) + n/(n-3) + .... + n/(n-(m-1))
which, as Alderath showed, can be estimated by
m*n / (n-m+1).
You can see a little bit more from this formula:
* The expected number of samples to obtain a new unique element increases as the number of already chosen objects increases (which sounds logical).
* You can expect really long computation times when m is close to n, especially if n is large.
In order to obtain m unique members from the set, use a variant of David Knuth's algorithm for obtaining a random permutation. Here, I'll assume that the n objects are stored in an array.
here, randInt samples an integer from {i, i+1, ... n}, and exchange flips two members of the array. You only need to shuffle m times, so the computation time is O(m), whereas the memory is O(n) (although you can adapt it to only save the entries such that a[i] <> i, which would give you O(m) on both time and memory, but with higher constants).
大多数人忘记了,如果号码已经跑完,查找也需要一段时间。
正如前面所描述的,必要的尝试次数可以通过以下方式进行评估:
它转到 n*ln(n) 以获得有趣的
m
值但是,对于其中的每一个 '尝试'你将不得不进行查找。 这可能是一个简单的
O(n)
运行,或者类似二叉树的东西。 这将为您提供n^2*ln(n)
或n*ln(n)^2
的总性能。对于较小的
m
值 (m < n/2
),您可以使用以下方法对T(n,m)
进行非常好的近似HA
不等式,得出公式:当
m
趋于n
时,给出了O(n)< 的下界/code> 尝试和性能
O(n^2)
或O(n*ln(n))
。然而,所有结果都比我预期的要好得多,这表明该算法实际上在许多非关键情况下可能很好,您可以接受偶尔更长的运行时间(当您不幸时)。
Most people forget that looking up, if the number has already run, also takes a while.
The number of tries nessesary can, as descriped earlier, be evaluated from:
which goes to
n*ln(n)
for interesting values ofm
However, for each of these 'tries' you will have to do a lookup. This might be a simple
O(n)
runthrough, or something like a binary tree. This will give you a total performance ofn^2*ln(n)
orn*ln(n)^2
.For smaller values of
m
(m < n/2
), you can do a very good approximation forT(n,m)
using theHA
-inequation, yielding the formula:As
m
goes ton
, this gives a lower bound ofO(n)
tries and performanceO(n^2)
orO(n*ln(n))
.All the results are however far better, that I would ever have expected, which shows that the algorithm might actually be just fine in many non critical cases, where you can accept occasional longer running times (when you are unlucky).