与 6 位随机字母数字代码发生冲突的概率是多少？

发布于 2024-12-07 06:10:23 字数 281 浏览 8 评论 0原文

我使用以下 Perl 代码生成随机字母数字字符串（仅限大写字母和数字），用作 MySQL 数据库中记录的唯一标识符。数据库的行数可能会保持在 1,000,000 行以下，但实际的绝对最大值约为 3,000,000 行。我是否有 2 条记录具有相同随机代码的危险机会，或者这种情况发生的次数可能微不足道？我对概率知之甚少（如果从这个问题的本质来看还不是很清楚的话）并且希望有人提供意见。

perl -le 'print map { ("A".."Z", 0..9)[rand 36] } 1..6'

原文

I'm using the following perl code to generate random alphanumeric strings (uppercase letters and numbers, only) to use as unique identifiers for records in my MySQL database. The database is likely to stay under 1,000,000 rows, but the absolute realistic maximum would be around 3,000,000. Do I have a dangerous chance of 2 records having the same random code, or is it likely to happen an insignificantly small number of times? I know very little about probability (if that isn't already abundantly clear from the nature of this question) and would love someone's input.

perl -le 'print map { ("A".."Z", 0..9)[rand 36] } 1..6'

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

牵强ㄟ 2024-12-14 06:10:24

由于生日悖论，它的可能性比你想象的要大。

有 2,176,782,336 个可能的代码，但即使只插入 50,000 行，发生冲突的可能性也已经相当高了。对于 1,000,000 行，几乎不可避免地会出现许多冲突（我认为平均约为 250 次）。

我运行了一些测试，这是在第一次冲突发生之前我可以生成的代码数量：

73366
59307
79297
36909

随着代码数量的增加，冲突将变得更加频繁。

这是我的测试代码（用 Python 编写）：

>>> import random
>>> codes = set()
>>> while 1:
        code=''.join(random.choice('1234567890qwertyuiopasdfghjklzxcvbnm')for x in range(6))
        if code in codes: break
        codes.add(code)

>>> len(codes)
36909

Because of the Birthday Paradox it's more likely than you might think.

There are 2,176,782,336 possible codes, but even inserting just 50,000 rows there is already a quite high chance of a collision. For 1,000,000 rows it is almost inevitable that there will be many collisions (I think about 250 on average).

I ran a few tests and this is the number of codes I could generate before the first collision occurred:

73366
59307
79297
36909

Collisions will become more frequent as the number of codes increases.

Here was my test code (written in Python):

>>> import random
>>> codes = set()
>>> while 1:
        code=''.join(random.choice('1234567890qwertyuiopasdfghjklzxcvbnm')for x in range(6))
        if code in codes: break
        codes.add(code)

>>> len(codes)
36909

回复收藏 0 原文

孤独岁月 2024-12-14 06:10:24

嗯，你有 36**6 个可能的代码，大约是 20 亿。称之为d。使用此处找到的公式，我们发现对于n< /em> 代码，大约为

1 - ((d-1)/d)**(n*(n-1)/2)

对于任何超过 50,000 左右的 n，这相当高。

看起来10个字符的代码的冲突概率只有1/800左右。所以选择 10 个或更多。

回复收藏 0 原文

梦在深巷 2024-12-14 06:10:24

基于 http://en.wikipedia.org/wiki/Birthday_paradox#Approximation_of_number_of_people，仅插入后就有50%的几率遇到至少一次碰撞大约 55,000 条记录放入此大小的 Universe 中：

http://wolfr.am/niaHIF

尝试插入两条六倍的记录几乎肯定会导致碰撞。您需要非随机分配代码，或使用更大的代码。

回复收藏 0 原文

雨后咖啡店 2024-12-14 06:10:24

如前所述，生日悖论使得这一事件很有可能发生。特别是，当问题被视为碰撞问题时，可以确定精确的近似。令 p(n; d) 为至少两个数字相同的概率，d 为组合数，n 为数字的小径。然后，我们可以证明 p(n; d) 大约等于：

1 - ((d-1)/d)^(n*(n-1)/2)

我们可以轻松地在 R 中绘制它：

> d = 2176782336
> n = 1:100000
> plot(n,1 - ((d-1)/d)^(n*(n-1)/2), type='l')

给出

在此处输入图像描述

如您所见，碰撞概率随着试验/行数的增加而快速增加

As mentioned previously, the birthday paradox makes this event quite likely. In particular, a accurate approximation can be determined when the problem is cast as a collision problem. Let p(n; d) be the probability that at least two numbers are the same, d be the number of combinations and n the number of trails. Then, we can show that p(n; d) is approximately equal to:

1 - ((d-1)/d)^(n*(n-1)/2)

We can easily plot this in R:

> d = 2176782336
> n = 1:100000
> plot(n,1 - ((d-1)/d)^(n*(n-1)/2), type='l')

which gives

enter image description here

As you can see the collision probability increases very quickly with the number of trials/rows

回复收藏 0 原文

我不是你的备胎 2024-12-14 06:10:24

虽然我不知道您想要如何使用这些伪随机 ID 的具体细节，但您可能需要考虑生成一个包含 3000000 个整数（从 1 到 3000000）的数组并随机打乱它。这将保证数字是唯一的。
请参阅维基百科上的Fisher-Yates shuffle。

回复收藏 0 原文

清君侧 2024-12-14 06:10:24

警告：请注意不要依赖内置的 rand，其中伪随机数生成器的质量很重要。我最近发现了 Math::Random::MT：：自动：

Mersenne Twister 是一种快速伪随机数生成器 (PRNG)，能够向可能耗尽可用“真正”随机数据源或系统的应用程序提供大量（> 10^6004）“高质量”伪随机数据-提供 PRNG，例如 rand。

该模块提供了 rand 的替代品，非常方便。

您可以使用以下代码生成密钥序列：

#!/usr/bin/env perl

use warnings; use strict;
use Math::Random::MT::Auto qw( rand );

my $SEQUENCE_LENGTH = 1_000_000;
my %dict;
my $picks;

for my $i (1 .. $SEQUENCE_LENGTH) {
    my $pick = pick_one();
    $picks += 1;

    redo if exists $dict{ $pick };

    $dict{ $pick } = undef;
}

printf "Generated %d keys with %d picks\n", scalar keys %dict, $picks;

sub pick_one {
    join '', map { ("A".."Z", 0..9)[rand 36] } 1..6;
}

前段时间，我写了关于 Windows 上内置 rand 的有限范围。您可能不是使用 Windows，但您的系统可能存在其他限制或陷阱。

A caution: Beware of relying on the built-in rand where the quality of the pseudo random number generator matters. I recently found out about Math::Random::MT::Auto:

The Mersenne Twister is a fast pseudorandom number generator (PRNG) that is capable of providing large volumes (> 10^6004) of "high quality" pseudorandom data to applications that may exhaust available "truly" random data sources or system-provided PRNGs such as rand.

The module provides a drop in replacement for rand which is handy.

You can generate the sequence of keys with the following code:

#!/usr/bin/env perl

use warnings; use strict;
use Math::Random::MT::Auto qw( rand );

my $SEQUENCE_LENGTH = 1_000_000;
my %dict;
my $picks;

for my $i (1 .. $SEQUENCE_LENGTH) {
    my $pick = pick_one();
    $picks += 1;

    redo if exists $dict{ $pick };

    $dict{ $pick } = undef;
}

printf "Generated %d keys with %d picks\n", scalar keys %dict, $picks;

sub pick_one {
    join '', map { ("A".."Z", 0..9)[rand 36] } 1..6;
}

Some time ago, I wrote about the limited range of built-in rand on Windows. You may not be on Windows, but there might be other limitations or pitfalls on your system.

回复收藏 0 原文

~没有更多了~