与 6 位随机字母数字代码发生冲突的概率是多少?
我使用以下 Perl 代码生成随机字母数字字符串(仅限大写字母和数字),用作 MySQL 数据库中记录的唯一标识符。数据库的行数可能会保持在 1,000,000 行以下,但实际的绝对最大值约为 3,000,000 行。我是否有 2 条记录具有相同随机代码的危险机会,或者这种情况发生的次数可能微不足道?我对概率知之甚少(如果从这个问题的本质来看还不是很清楚的话)并且希望有人提供意见。
perl -le 'print map { ("A".."Z", 0..9)[rand 36] } 1..6'
I'm using the following perl code to generate random alphanumeric strings (uppercase letters and numbers, only) to use as unique identifiers for records in my MySQL database. The database is likely to stay under 1,000,000 rows, but the absolute realistic maximum would be around 3,000,000. Do I have a dangerous chance of 2 records having the same random code, or is it likely to happen an insignificantly small number of times? I know very little about probability (if that isn't already abundantly clear from the nature of this question) and would love someone's input.
perl -le 'print map { ("A".."Z", 0..9)[rand 36] } 1..6'
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
由于生日悖论,它的可能性比你想象的要大。
有 2,176,782,336 个可能的代码,但即使只插入 50,000 行,发生冲突的可能性也已经相当高了。对于 1,000,000 行,几乎不可避免地会出现许多冲突(我认为平均约为 250 次)。
我运行了一些测试,这是在第一次冲突发生之前我可以生成的代码数量:
随着代码数量的增加,冲突将变得更加频繁。
这是我的测试代码(用 Python 编写):
Because of the Birthday Paradox it's more likely than you might think.
There are 2,176,782,336 possible codes, but even inserting just 50,000 rows there is already a quite high chance of a collision. For 1,000,000 rows it is almost inevitable that there will be many collisions (I think about 250 on average).
I ran a few tests and this is the number of codes I could generate before the first collision occurred:
Collisions will become more frequent as the number of codes increases.
Here was my test code (written in Python):
嗯,你有 36**6 个可能的代码,大约是 20 亿。称之为d。使用此处找到的公式,我们发现对于n< /em> 代码,大约为
1 - ((d-1)/d)**(n*(n-1)/2)
对于任何超过 50,000 左右的 n,这相当高。
看起来10个字符的代码的冲突概率只有1/800左右。所以选择 10 个或更多。
Well, you have 36**6 possible codes, which is about 2 billion. Call this d. Using a formula found here, we find that the probability of a collision, for n codes, is approximately
1 - ((d-1)/d)**(n*(n-1)/2)
For any n over 50,000 or so, that's pretty high.
Looks like a 10-character code has a collision probability of only about 1/800. So go with 10 or more.
基于 http://en.wikipedia.org/wiki/Birthday_paradox#Approximation_of_number_of_people,仅插入后就有50%的几率遇到至少一次碰撞大约 55,000 条记录放入此大小的 Universe 中:
http://wolfr.am/niaHIF
尝试插入两条六倍的记录几乎肯定会导致碰撞。您需要非随机分配代码,或使用更大的代码。
Based on the equations given at http://en.wikipedia.org/wiki/Birthday_paradox#Approximation_of_number_of_people, there is a 50% chance of encountering at least one collision after inserting only 55,000 records or so into a universe of this size:
http://wolfr.am/niaHIF
Trying to insert two to six times as many records will almost certainly lead to a collision. You'll need to assign codes nonrandomly, or use a larger code.
如前所述,生日悖论使得这一事件很有可能发生。特别是,当问题被视为碰撞问题时,可以确定精确的近似。令
p(n; d)
为至少两个数字相同的概率,d
为组合数,n
为数字的小径。然后,我们可以证明p(n; d)
大约等于:我们可以轻松地在 R 中绘制它:
给出
如您所见,碰撞概率随着试验/行数的增加而快速增加
As mentioned previously, the birthday paradox makes this event quite likely. In particular, a accurate approximation can be determined when the problem is cast as a collision problem. Let
p(n; d)
be the probability that at least two numbers are the same,d
be the number of combinations andn
the number of trails. Then, we can show thatp(n; d)
is approximately equal to:We can easily plot this in R:
which gives
As you can see the collision probability increases very quickly with the number of trials/rows
虽然我不知道您想要如何使用这些伪随机 ID 的具体细节,但您可能需要考虑生成一个包含 3000000 个整数(从 1 到 3000000)的数组并随机打乱它。这将保证数字是唯一的。
请参阅维基百科上的Fisher-Yates shuffle。
While I don't know the specifics of exactly how you want to use these pseudo-random IDs, you may want to consider generating an array of 3000000 integers (from 1 to 3000000) and randomly shuffling it. That would guarantee that the numbers are unique.
See Fisher-Yates shuffle on Wikipedia.
警告:请注意不要依赖内置的
rand
,其中伪随机数生成器的质量很重要。我最近发现了 Math::Random::MT: :自动:该模块提供了
rand
的替代品,非常方便。您可以使用以下代码生成密钥序列:
前段时间,我写了关于 Windows 上内置
rand
的有限范围。您可能不是使用 Windows,但您的系统可能存在其他限制或陷阱。A caution: Beware of relying on the built-in
rand
where the quality of the pseudo random number generator matters. I recently found out about Math::Random::MT::Auto:The module provides a drop in replacement for
rand
which is handy.You can generate the sequence of keys with the following code:
Some time ago, I wrote about the limited range of built-in
rand
on Windows. You may not be on Windows, but there might be other limitations or pitfalls on your system.