当前位置：文江博客话题详情

创建唯一的不可猜测的 36 进制 id

发布于 2024-10-11 10:51:41 字数 430 浏览 4 评论 0原文

对于类似于 URL 缩短服务的应用程序，我想创建不可猜测的 id，我想你们都熟悉这一点。下面是这样一个 id 的示例：

http://example.com/sd23t9

什么是一种好的、有效的技术来生成这些，在将它们作为主键插入数据库表时，冲突风险最小（或没有）？

编辑：
当然，皮斯克沃尔提出了一个很好的观点。我应该提到的是，我的意思是在满足 36^6 限制之前最小的碰撞风险。

编辑2
呃，抛开这一点来说，他的观点所说明的内容当然远不止于此。嗯。那么，也许可以用 id 预生成一个表（就像我已经在其他地方读过的那样）？当我受到 36^6 和非连续约束时，这是否是最有效的技术？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

夜光 2024-10-18 10:51:41

Set ID length. // e.g. 6
do {
  Generate a short random ID of the given length
  Collision?
   - No:
      DONE!
   - Yes:
      increase ID length
 } while true

对于任何有限的 ID 长度，总是存在冲突的风险：假设您有 [a-z0-9]{6} ID，只要您有 2,176,782,336 个 ID，就 100% 保证发生冲突（不更多可用键）。由于生日问题，你会更快地遇到冲突。对于如此小的密钥空间，没有办法避免冲突 - 您需要进行某种冲突恢复。

您可以生成一个 ID，直到它不发生冲突 - 但随着密钥空间被填充，这会逐渐变慢：想象一个 [az] 的密钥空间，其中 [an] 和 [pz] 已经被占用 - 现在每个新的随机 ID 都是发生碰撞的可能性大于不发生碰撞的可能性；当你完全填满键空间时，循环根本不会终止。

我提出的算法在这方面可能过于谨慎：如果它发现冲突，它将尝试逐渐更长的 ID（因为它假设“冲突=>检查较短的密钥空间不可行”）。虽然效率不高，但很可能在几次迭代内找到不冲突的 ID。

Set ID length. // e.g. 6
do {
  Generate a short random ID of the given length
  Collision?
   - No:
      DONE!
   - Yes:
      increase ID length
 } while true

For any finite ID length, there's always a risk of collision: assuming from your example that you'd have [a-z0-9]{6} IDs, as soon as you have 2,176,782,336 IDs, a collision is 100% guaranteed (no more available keys). Due to birthday problem, you'll get collisions much, much sooner. With such a small keyspace, there isn't a way to avoid collisions - you'll need to have some sort of collision recovery instead.

You could generate an ID until it doesn't collide - but this will get progressively slower as the keyspace is being filled: imagine a keyspace of [a-z], with [a-n] and [p-z] already taken - now every new random ID is more likely to collide than not; and when you completely fill the keyspace, the loop will not terminate at all.

The algorithm I propose is maybe overcautious in this regard: if it finds a collision, it will try progressively longer IDs (as it assumes "collision => not feasible to check the shorter keyspace"). Although it's not efficient, it's likely to find a non-colliding ID within a few iterations.

回复收藏 0 原文

拥醉 2024-10-18 10:51:41

有点奇怪的想法。为什么不使用排列呢？
例如
你有一组值 [0-9a-z]
当你生成第一个 id 时。您按照字典顺序进行第一个排列。然后是第二个，依此类推。
为了使其看起来不易猜测，您可以更改字典顺序的规则。说“a”在“t”之后或类似的东西。您也可以使用元组而不是完整的排列。
这将确保不会发生碰撞。

这个想法实际上是关于创建某种双向哈希函数。基本上，如果您能够以某种方式对数字“1”进行编码以获得类似“q8d3dw”的内容，并且能够将“q8d3dw”解码回“1”，您可以确定该函数将为您提供所有值的唯一字符串从 1 到 36^6。

问题实际上是选择这个功能。最简单的方法是将“1”与“000000”、“2”与“000001”、“12”与“00000b”相关联。基本上按字典顺序排列所有可用字符串，并选择等于 id 的位置上的字符串。然而，这确实很容易猜到。所以你可以做的就是人为地改变字典顺序的规则。说不是有一个正常的顺序（0,1,2,3...a,b,c...x,y,z），你可以稍微打乱它并得到类似（a,5,t,3 ...）。这会产生更加混乱的结果。然而，它仍然很容易被猜测，因为第一个元素是“aaaaaa”，第二个元素是“aaaaa5”，然后是“aaaaat”。因此，您可以进一步更改字典顺序规则，使它们取决于字符的位置。说出第一个字符 id (a,5,t,3...) 的顺序，第二个字符 id (y,7,3,r...) 的顺序，等等。

现在，我不会发布任何伪代码，因为它将相当长的一篇。我并不建议你走这条路，除非你有兴趣创建这种有趣的算法:)。然而，如果您选择这条路线，它可能是生成此 id 的一种非常有效的方法，而且不会发生冲突。我建议您阅读 Donald Knuth 博士所著的《计算机编程艺术》第 4 卷。对于此类算法的实现有很多建议。

A bit of a weird idea. Why not to use permutations?
e.g.
you have a set of values [0-9a-z]
when you generate first id. you take a first permutation in lexicographical order. then second and so on.
to make it appear less guessable you can change the rules of lexicographic order. say 'a' goes after 't' or something like that. you can use tuple instead of a full permutations as well.
This will ensure there is no collisions.

What this idea is actually about is making some sort of a two way hash function. basically if you are able to encode number "1" in some way to get something like "q8d3dw" and be able to decode "q8d3dw" back to "1" you can be certain that this function will give you unique strings for all the values from 1 to 36^6.

The problem is actually selecting this function. The easy way would be to associate "1" to "000000", "2" to "000001", "12" to "00000b". Basically arrange all available strings in lexicographical order and select the string on a position which is equal to the id. This however is really easy to guess. So what you could do is to artificially change rules of lexicographic order. Say instead of having a normal order(0,1,2,3...a,b,c...x,y,z) you could shuffle it a bit and get something like (a,5,t,3...). This will produce a bit more obfuscated results. However it will still be quite guessable because the first element would be "aaaaaa", second "aaaaa5", then "aaaaat". So you can change rules of lexicographic order even further, making them depended on the position of the character. Say order for the first char id (a,5,t,3...) for the second one (y,7,3,r...), etc.

Now, i will not post any pseudo code as it will be quite a long one. And I'm not advising you to go with that route unless you are interested in creating this sort of algorithms for fun :). However if you will go with this route it could be a very efficient way of generating this id's with no chance of collision. And I'll advise you to read volume 4 of "Art of computer programming" by Dr Donald Knuth. There are lots of suggestions in implementation of such algorithms.

回复收藏 0 原文

夏尔 2024-10-18 10:51:41

@ivan：这是一个实现。

首先，您需要 6 个字符串，这是生成它们的代码：

$letters = range('a', 'z');
$numbers = range(0, 9);
$char_list = array_merge_recursive($letters, $numbers);
for ($i = 0; $i <= 5; $i++) {
    shuffle($char_list);
    echo '$mysterykey[' . $i . "] = '" . implode($char_list) . "';\n";
}

然后您可以在函数中使用该字符串的输出：

function int_to_x36($value) {
    $mysterykey[0] = 'awkbs81t3jyg20v4qhoxzcuenr65dfimlp97';
    $mysterykey[1] = 'ut17qclz96n3msky8jwp4r25gdvhxo0biaef';
    $mysterykey[2] = 'cewszx142nph05mi9ulafbdvy36o8gq7trjk';
    $mysterykey[3] = '37hp28wjdqe5vnlzxofrsybu4im90k16agtc';
    $mysterykey[4] = 'n9a3jksl5ct0eqfzo7pvgyd4m2hiu18xrb6w';
    $mysterykey[5] = 'mq0nbk3zs529gu4tecli8j176vardxoypfwh';

    $x36 = array();
    for ($i = 5; $i >= 0; $i--) {
        $x36[$i] = 0;  
        $y = pow(36, $i);

        if ($value >= $y) {
            $z = floor($value/$y);
            $value = $value - ($z * $y);
            $x36[$i] = $z;
        }   
    }      

    $encoded = '';
    for ($i = 0; $i <= 5; $i++) {
        $encoded .= $mysterykey[$i][$x36[$i]];
    }

    return $encoded;
}

function x36_to_int($value) {
    $mysterykey[0] = 'awkbs81t3jyg20v4qhoxzcuenr65dfimlp97';
    $mysterykey[1] = 'ut17qclz96n3msky8jwp4r25gdvhxo0biaef';
    $mysterykey[2] = 'cewszx142nph05mi9ulafbdvy36o8gq7trjk';
    $mysterykey[3] = '37hp28wjdqe5vnlzxofrsybu4im90k16agtc';
    $mysterykey[4] = 'n9a3jksl5ct0eqfzo7pvgyd4m2hiu18xrb6w';
    $mysterykey[5] = 'mq0nbk3zs529gu4tecli8j176vardxoypfwh';

    $value36 = str_split($value);

    $x36 = array();
    for ($i = 0; $i <= 5; $i++) {
        $x36[$i] = strpos($mysterykey[$i], $value36[$i]);
    }

    $x = 0;
    for ($i = 5; $i >= 0; $i--) {
        $x += $x36[$i] * pow(36, $i);
    }      

    return $x;
}

我确信我忽略了一些东西，但它似乎正在工作。

@ivan: here is an implementation.

First you need the 6 strings, here's code to generate them:

$letters = range('a', 'z');
$numbers = range(0, 9);
$char_list = array_merge_recursive($letters, $numbers);
for ($i = 0; $i <= 5; $i++) {
    shuffle($char_list);
    echo '$mysterykey[' . $i . "] = '" . implode($char_list) . "';\n";
}

Then you can use the output from that in the function:

function int_to_x36($value) {
    $mysterykey[0] = 'awkbs81t3jyg20v4qhoxzcuenr65dfimlp97';
    $mysterykey[1] = 'ut17qclz96n3msky8jwp4r25gdvhxo0biaef';
    $mysterykey[2] = 'cewszx142nph05mi9ulafbdvy36o8gq7trjk';
    $mysterykey[3] = '37hp28wjdqe5vnlzxofrsybu4im90k16agtc';
    $mysterykey[4] = 'n9a3jksl5ct0eqfzo7pvgyd4m2hiu18xrb6w';
    $mysterykey[5] = 'mq0nbk3zs529gu4tecli8j176vardxoypfwh';

    $x36 = array();
    for ($i = 5; $i >= 0; $i--) {
        $x36[$i] = 0;  
        $y = pow(36, $i);

        if ($value >= $y) {
            $z = floor($value/$y);
            $value = $value - ($z * $y);
            $x36[$i] = $z;
        }   
    }      

    $encoded = '';
    for ($i = 0; $i <= 5; $i++) {
        $encoded .= $mysterykey[$i][$x36[$i]];
    }

    return $encoded;
}

function x36_to_int($value) {
    $mysterykey[0] = 'awkbs81t3jyg20v4qhoxzcuenr65dfimlp97';
    $mysterykey[1] = 'ut17qclz96n3msky8jwp4r25gdvhxo0biaef';
    $mysterykey[2] = 'cewszx142nph05mi9ulafbdvy36o8gq7trjk';
    $mysterykey[3] = '37hp28wjdqe5vnlzxofrsybu4im90k16agtc';
    $mysterykey[4] = 'n9a3jksl5ct0eqfzo7pvgyd4m2hiu18xrb6w';
    $mysterykey[5] = 'mq0nbk3zs529gu4tecli8j176vardxoypfwh';

    $value36 = str_split($value);

    $x36 = array();
    for ($i = 0; $i <= 5; $i++) {
        $x36[$i] = strpos($mysterykey[$i], $value36[$i]);
    }

    $x = 0;
    for ($i = 5; $i >= 0; $i--) {
        $x += $x36[$i] * pow(36, $i);
    }      

    return $x;
}

I'm sure I've overlooked something but it seems to be working.

回复收藏 0 原文

丶情人眼里出诗心の 2024-10-18 10:51:41

例如大随机数和 SHA-256 哈希？您可以稍后缩短它以满足您的需要。

回复收藏 0 原文

城歌 2024-10-18 10:51:41

如果站点足够大，我的意思是大 - 就像“我们每秒需要分配数百个新ID”（这会产生其他问题，例如在一年内耗尽36^6密钥空间），您可以预先生成密钥并分配它们。

以下是一个伪代码示例 - 在如此大流量的站点中，您可能需要优化所使用的算法。

预生成很简单 - 只需从 000000 开始，一直到 zzzzzz，将每一行插入表中并将其标记为未使用：

 ID     | used
==============
 000000   0   
 000001   0   
 ...
 zzzzzz   0

当您收到新的请求时ID，随机选择一个并将其标记为已使用（危险！并发问题！）：

SELECT ID FROM ids WHERE RAND() LIMIT 1
UPDATE ids SET used = 1, url=what_you_want_shortened WHERE ID = whatever_you_got_from_previous_query

您将遇到效率问题（例如使用 WHERE RAND()、表锁定等），但它是可行的。

If the site is sufficiently large, and I mean large - as in "we need hundreds of new IDs assigned per second" (which will have other problems, e.g. exhaust the 36^6 keyspace under a year), you could pre-generate the keys and assign them.

The following is a pseudocode example - in such a large-traffic site, you'd probably need to optimize the algorithms used.

Pre-generating is trivial - just start at 000000 and go all the way to zzzzzz, insert each row into a table and mark them unused:

 ID     | used
==============
 000000   0   
 000001   0   
 ...
 zzzzzz   0

When you get a request for new ID, select a random one and mark it used (danger! concurrency issues!):

SELECT ID FROM ids WHERE RAND() LIMIT 1
UPDATE ids SET used = 1, url=what_you_want_shortened WHERE ID = whatever_you_got_from_previous_query

You'll run into efficiency issues (e.g. with WHERE RAND(), table locking, et al.), but it is doable.

回复收藏 0 原文

独守阴晴ぅ圆缺 2024-10-18 10:51:41

如果您不反对引入 .NET dll，我已经创建了一个项目来完成此任务。来源是在 GitHub 上，二进制文件位于 IdGenerator NuGet 包。

它为您提供用户指定长度的有序序列和/或随机值，全部以 36 为基数。即使有多个 ID 生成器实例或在分布式环境中，ID 也是普遍唯一的。

var generator = new Base36IdGenerator(
                numTimestampCharacters: 11, 
                numServerCharacters: 4, 
                numRandomCharacters: 5, 
                reservedValue: "", 
                delimiter: "-", 
                delimiterPositions: new[] {15, 10, 5});

// This instance would give you a 20-character Id, with an
// 11-character timestamp, 4-character servername hash, 
// and 5 character random sequence. This gives you 1.6 million
// hash combinations for the server component, and 60 million
// possible combinations for the random sequence. Timestamp is
// microseconds since epoch:
Console.WriteLine(generator.NewId());
// 040VZC6SL01003BZ00R2

// Argument name specified for readability only:
Console.WriteLine(generator.NewId(delimited: true));
// 040VZ-C6SL0-1003B-Z00R2

当然，如果您对不可猜测的字符串比对有序序列更感兴趣，则可以使用 GetRandomString 方法：

// 6-characters give you a little over 2 billion possible 
// combinations (36 ^ 6). 7 characters is about 78 billion.
Console.WriteLine(generator.GetRandomString(length: 6));

其背后的基本原理是：

获取自纪元（64 位数字）以来的微秒数
获取随机数（ 64 位）介于 0 和（36 ^ 所需长度）之间（最多 13）
获取服务器名称哈希（简单 Sha1）
将每个组件转换为 base-36 字符串
用 0 填充左侧到所需长度

Base36 编码器本身来自 http://www.stum.de/2008 /10/20/base36-encoderdecoder-in-c/。

If you're not averse to bringing in a .NET dll, I've created a project to do exactly this. Source is on GitHub here, and binaries are in the IdGenerator NuGet Package.

It gives you ordered sequences and/or random values of user-specified length, all in base-36. Ids are universally unique, even with multiple id generator instances or in a distributed environment.

var generator = new Base36IdGenerator(
                numTimestampCharacters: 11, 
                numServerCharacters: 4, 
                numRandomCharacters: 5, 
                reservedValue: "", 
                delimiter: "-", 
                delimiterPositions: new[] {15, 10, 5});

// This instance would give you a 20-character Id, with an
// 11-character timestamp, 4-character servername hash, 
// and 5 character random sequence. This gives you 1.6 million
// hash combinations for the server component, and 60 million
// possible combinations for the random sequence. Timestamp is
// microseconds since epoch:
Console.WriteLine(generator.NewId());
// 040VZC6SL01003BZ00R2

// Argument name specified for readability only:
Console.WriteLine(generator.NewId(delimited: true));
// 040VZ-C6SL0-1003B-Z00R2

Of course if you're more interested in the string not being guessable than in having an ordered sequence, you could just use the GetRandomString method:

// 6-characters give you a little over 2 billion possible 
// combinations (36 ^ 6). 7 characters is about 78 billion.
Console.WriteLine(generator.GetRandomString(length: 6));

The basic principle behind it is: