与 6 位随机字母数字代码发生冲突的概率是多少?

发布于 2024-12-07 06:10:23 字数 281 浏览 0 评论 0原文

我使用以下 Perl 代码生成随机字母数字字符串(仅限大写字母和数字),用作 MySQL 数据库中记录的唯一标识符。数据库的行数可能会保持在 1,000,000 行以下,但实际的绝对最大值约为 3,000,000 行。我是否有 2 条记录具有相同随机代码的危险机会,或者这种情况发生的次数可能微不足道?我对概率知之甚少(如果从这个问题的本质来看还不是很清楚的话)并且希望有人提供意见。

perl -le 'print map { ("A".."Z", 0..9)[rand 36] } 1..6'

I'm using the following perl code to generate random alphanumeric strings (uppercase letters and numbers, only) to use as unique identifiers for records in my MySQL database. The database is likely to stay under 1,000,000 rows, but the absolute realistic maximum would be around 3,000,000. Do I have a dangerous chance of 2 records having the same random code, or is it likely to happen an insignificantly small number of times? I know very little about probability (if that isn't already abundantly clear from the nature of this question) and would love someone's input.

perl -le 'print map { ("A".."Z", 0..9)[rand 36] } 1..6'

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

牵强ㄟ 2024-12-14 06:10:24

由于生日悖论,它的可能性比你想象的要大。

有 2,176,782,336 个可能的代码,但即使只插入 50,000 行,发生冲突的可能性也已经相当高了。对于 1,000,000 行,几乎不可避免地会出现许多冲突(我认为平均约为 250 次)。

我运行了一些测试,这是在第一次冲突发生之前我可以生成的代码数量:

  • 73366
  • 59307
  • 79297
  • 36909

随着代码数量的增加,冲突将变得更加频繁。

这是我的测试代码(用 Python 编写):

>>> import random
>>> codes = set()
>>> while 1:
        code=''.join(random.choice('1234567890qwertyuiopasdfghjklzxcvbnm')for x in range(6))
        if code in codes: break
        codes.add(code)

>>> len(codes)
36909

Because of the Birthday Paradox it's more likely than you might think.

There are 2,176,782,336 possible codes, but even inserting just 50,000 rows there is already a quite high chance of a collision. For 1,000,000 rows it is almost inevitable that there will be many collisions (I think about 250 on average).

I ran a few tests and this is the number of codes I could generate before the first collision occurred:

  • 73366
  • 59307
  • 79297
  • 36909

Collisions will become more frequent as the number of codes increases.

Here was my test code (written in Python):

>>> import random
>>> codes = set()
>>> while 1:
        code=''.join(random.choice('1234567890qwertyuiopasdfghjklzxcvbnm')for x in range(6))
        if code in codes: break
        codes.add(code)

>>> len(codes)
36909
孤独岁月 2024-12-14 06:10:24

嗯,你有 36**6 个可能的代码,大约是 20 亿。称之为d。使用此处找到的公式,我们发现对于n< /em> 代码,大约为

1 - ((d-1)/d)**(n*(n-1)/2)

对于任何超过 50,000 左右的 n,这相当高。

看起来10个字符的代码的冲突概率只有1/800左右。所以选择 10 个或更多。

Well, you have 36**6 possible codes, which is about 2 billion. Call this d. Using a formula found here, we find that the probability of a collision, for n codes, is approximately

1 - ((d-1)/d)**(n*(n-1)/2)

For any n over 50,000 or so, that's pretty high.

Looks like a 10-character code has a collision probability of only about 1/800. So go with 10 or more.

梦在深巷 2024-12-14 06:10:24

Based on the equations given at http://en.wikipedia.org/wiki/Birthday_paradox#Approximation_of_number_of_people, there is a 50% chance of encountering at least one collision after inserting only 55,000 records or so into a universe of this size:

http://wolfr.am/niaHIF

Trying to insert two to six times as many records will almost certainly lead to a collision. You'll need to assign codes nonrandomly, or use a larger code.

雨后咖啡店 2024-12-14 06:10:24

如前所述,生日悖论使得这一事件很有可能发生。特别是,当问题被视为碰撞问题时,可以确定精确的近似。令 p(n; d) 为至少两个数字相同的概率,d 为组合数,n 为数字的小径。然后,我们可以证明 p(n; d) 大约等于:

1 - ((d-1)/d)^(n*(n-1)/2)

我们可以轻松地在 R 中绘制它:

> d = 2176782336
> n = 1:100000
> plot(n,1 - ((d-1)/d)^(n*(n-1)/2), type='l')

给出

在此处输入图像描述

如您所见,碰撞概率随着试验/行数的增加而快速增加

As mentioned previously, the birthday paradox makes this event quite likely. In particular, a accurate approximation can be determined when the problem is cast as a collision problem. Let p(n; d) be the probability that at least two numbers are the same, d be the number of combinations and n the number of trails. Then, we can show that p(n; d) is approximately equal to:

1 - ((d-1)/d)^(n*(n-1)/2)

We can easily plot this in R:

> d = 2176782336
> n = 1:100000
> plot(n,1 - ((d-1)/d)^(n*(n-1)/2), type='l')

which gives

enter image description here

As you can see the collision probability increases very quickly with the number of trials/rows

我不是你的备胎 2024-12-14 06:10:24

虽然我不知道您想要如何使用这些伪随机 ID 的具体细节,但您可能需要考虑生成一个包含 3000000 个整数(从 1 到 3000000)的数组并随机打乱它。这将保证数字是唯一的。
请参阅维基百科上的Fisher-Yates shuffle

While I don't know the specifics of exactly how you want to use these pseudo-random IDs, you may want to consider generating an array of 3000000 integers (from 1 to 3000000) and randomly shuffling it. That would guarantee that the numbers are unique.
See Fisher-Yates shuffle on Wikipedia.

清君侧 2024-12-14 06:10:24

警告:请注意不要依赖内置的 rand,其中伪随机数生成器的质量很重要。我最近发现了 Math::Random::MT: :自动

Mersenne Twister 是一种快速伪随机数生成器 (PRNG),能够向可能耗尽可用“真正”随机数据源或系统的应用程序提供大量(> 10^6004)“高质量”伪随机数据-提供 PRNG,例如 rand

该模块提供了 rand 的替代品,非常方便。

您可以使用以下代码生成密钥序列:

#!/usr/bin/env perl

use warnings; use strict;
use Math::Random::MT::Auto qw( rand );

my $SEQUENCE_LENGTH = 1_000_000;
my %dict;
my $picks;

for my $i (1 .. $SEQUENCE_LENGTH) {
    my $pick = pick_one();
    $picks += 1;

    redo if exists $dict{ $pick };

    $dict{ $pick } = undef;
}

printf "Generated %d keys with %d picks\n", scalar keys %dict, $picks;

sub pick_one {
    join '', map { ("A".."Z", 0..9)[rand 36] } 1..6;
}

前段时间,我写了关于 Windows 上内置 rand 的有限范围。您可能不是使用 Windows,但您的系统可能存在其他限制或陷阱。

A caution: Beware of relying on the built-in rand where the quality of the pseudo random number generator matters. I recently found out about Math::Random::MT::Auto:

The Mersenne Twister is a fast pseudorandom number generator (PRNG) that is capable of providing large volumes (> 10^6004) of "high quality" pseudorandom data to applications that may exhaust available "truly" random data sources or system-provided PRNGs such as rand.

The module provides a drop in replacement for rand which is handy.

You can generate the sequence of keys with the following code:

#!/usr/bin/env perl

use warnings; use strict;
use Math::Random::MT::Auto qw( rand );

my $SEQUENCE_LENGTH = 1_000_000;
my %dict;
my $picks;

for my $i (1 .. $SEQUENCE_LENGTH) {
    my $pick = pick_one();
    $picks += 1;

    redo if exists $dict{ $pick };

    $dict{ $pick } = undef;
}

printf "Generated %d keys with %d picks\n", scalar keys %dict, $picks;

sub pick_one {
    join '', map { ("A".."Z", 0..9)[rand 36] } 1..6;
}

Some time ago, I wrote about the limited range of built-in rand on Windows. You may not be on Windows, but there might be other limitations or pitfalls on your system.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文