创建唯一的不可猜测的 36 进制 id
对于类似于 URL 缩短服务的应用程序,我想创建不可猜测的 id,我想你们都熟悉这一点。下面是这样一个 id 的示例:
什么是一种好的、有效的技术来生成这些,在将它们作为主键插入数据库表时,冲突风险最小(或没有)?
编辑:
当然,皮斯克沃尔提出了一个很好的观点。我应该提到的是,我的意思是在满足 36^6 限制之前最小的碰撞风险。
编辑2
呃,抛开这一点来说,他的观点所说明的内容当然远不止于此。嗯。那么,也许可以用 id 预生成一个表(就像我已经在其他地方读过的那样)?当我受到 36^6 和非连续约束时,这是否是最有效的技术?
For an application similar to URL shortener services, I want to create non guessable id's, which you're all familiar with I presume. Here's an example of such an id:
What is a good and efficient technique to generate these, with minimal (or no) risk of collision when inserting these as a primary key in a database table?
EDIT:
Piskvor makes an excellent point of course. I should have mentioned that I meant minimal collision risk before the 36^6 limit is met.
EDIT 2
Eh, scrap that, his point was illustrating much more than that of course. Hmmm. Pregenerating a table with id's then, perhaps (like I've read elsewhere already)? Would that be the most efficient technique when i'm bound to the 36^6 and none-consecutive constraints perhaps?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
对于任何有限的 ID 长度,总是存在冲突的风险:假设您有 [a-z0-9]{6} ID,只要您有 2,176,782,336 个 ID,就 100% 保证发生冲突(不更多可用键)。由于生日问题,你会更快地遇到冲突。对于如此小的密钥空间,没有办法避免冲突 - 您需要进行某种冲突恢复。
您可以生成一个 ID,直到它不发生冲突 - 但随着密钥空间被填充,这会逐渐变慢:想象一个 [az] 的密钥空间,其中 [an] 和 [pz] 已经被占用 - 现在每个新的随机 ID 都是发生碰撞的可能性大于不发生碰撞的可能性;当你完全填满键空间时,循环根本不会终止。
我提出的算法在这方面可能过于谨慎:如果它发现冲突,它将尝试逐渐更长的 ID(因为它假设“冲突=>检查较短的密钥空间不可行”)。虽然效率不高,但很可能在几次迭代内找到不冲突的 ID。
For any finite ID length, there's always a risk of collision: assuming from your example that you'd have [a-z0-9]{6} IDs, as soon as you have 2,176,782,336 IDs, a collision is 100% guaranteed (no more available keys). Due to birthday problem, you'll get collisions much, much sooner. With such a small keyspace, there isn't a way to avoid collisions - you'll need to have some sort of collision recovery instead.
You could generate an ID until it doesn't collide - but this will get progressively slower as the keyspace is being filled: imagine a keyspace of [a-z], with [a-n] and [p-z] already taken - now every new random ID is more likely to collide than not; and when you completely fill the keyspace, the loop will not terminate at all.
The algorithm I propose is maybe overcautious in this regard: if it finds a collision, it will try progressively longer IDs (as it assumes "collision => not feasible to check the shorter keyspace"). Although it's not efficient, it's likely to find a non-colliding ID within a few iterations.
有点奇怪的想法。为什么不使用排列呢?
例如
你有一组值 [0-9a-z]
当你生成第一个 id 时。您按照字典顺序进行第一个排列。然后是第二个,依此类推。
为了使其看起来不易猜测,您可以更改字典顺序的规则。说“a”在“t”之后或类似的东西。您也可以使用元组而不是完整的排列。
这将确保不会发生碰撞。
这个想法实际上是关于创建某种双向哈希函数。基本上,如果您能够以某种方式对数字“1”进行编码以获得类似“q8d3dw”的内容,并且能够将“q8d3dw”解码回“1”,您可以确定该函数将为您提供所有值的唯一字符串从 1 到 36^6。
问题实际上是选择这个功能。最简单的方法是将“1”与“000000”、“2”与“000001”、“12”与“00000b”相关联。基本上按字典顺序排列所有可用字符串,并选择等于 id 的位置上的字符串。然而,这确实很容易猜到。所以你可以做的就是人为地改变字典顺序的规则。说不是有一个正常的顺序(0,1,2,3...a,b,c...x,y,z),你可以稍微打乱它并得到类似(a,5,t,3 ...)。这会产生更加混乱的结果。然而,它仍然很容易被猜测,因为第一个元素是“aaaaaa”,第二个元素是“aaaaa5”,然后是“aaaaat”。因此,您可以进一步更改字典顺序规则,使它们取决于字符的位置。说出第一个字符 id (a,5,t,3...) 的顺序,第二个字符 id (y,7,3,r...) 的顺序,等等。
现在,我不会发布任何伪代码,因为它将相当长的一篇。我并不建议你走这条路,除非你有兴趣创建这种有趣的算法:)。然而,如果您选择这条路线,它可能是生成此 id 的一种非常有效的方法,而且不会发生冲突。我建议您阅读 Donald Knuth 博士所著的《计算机编程艺术》第 4 卷。对于此类算法的实现有很多建议。
A bit of a weird idea. Why not to use permutations?
e.g.
you have a set of values [0-9a-z]
when you generate first id. you take a first permutation in lexicographical order. then second and so on.
to make it appear less guessable you can change the rules of lexicographic order. say 'a' goes after 't' or something like that. you can use tuple instead of a full permutations as well.
This will ensure there is no collisions.
What this idea is actually about is making some sort of a two way hash function. basically if you are able to encode number "1" in some way to get something like "q8d3dw" and be able to decode "q8d3dw" back to "1" you can be certain that this function will give you unique strings for all the values from 1 to 36^6.
The problem is actually selecting this function. The easy way would be to associate "1" to "000000", "2" to "000001", "12" to "00000b". Basically arrange all available strings in lexicographical order and select the string on a position which is equal to the id. This however is really easy to guess. So what you could do is to artificially change rules of lexicographic order. Say instead of having a normal order(0,1,2,3...a,b,c...x,y,z) you could shuffle it a bit and get something like (a,5,t,3...). This will produce a bit more obfuscated results. However it will still be quite guessable because the first element would be "aaaaaa", second "aaaaa5", then "aaaaat". So you can change rules of lexicographic order even further, making them depended on the position of the character. Say order for the first char id (a,5,t,3...) for the second one (y,7,3,r...), etc.
Now, i will not post any pseudo code as it will be quite a long one. And I'm not advising you to go with that route unless you are interested in creating this sort of algorithms for fun :). However if you will go with this route it could be a very efficient way of generating this id's with no chance of collision. And I'll advise you to read volume 4 of "Art of computer programming" by Dr Donald Knuth. There are lots of suggestions in implementation of such algorithms.
@ivan:这是一个实现。
首先,您需要 6 个字符串,这是生成它们的代码:
然后您可以在函数中使用该字符串的输出:
我确信我忽略了一些东西,但它似乎正在工作。
@ivan: here is an implementation.
First you need the 6 strings, here's code to generate them:
Then you can use the output from that in the function:
I'm sure I've overlooked something but it seems to be working.
例如大随机数和 SHA-256 哈希?您可以稍后缩短它以满足您的需要。
Big random number and SHA-256 Hash for example? You can shorten it later to fit your needs.
如果站点足够大,我的意思是大 - 就像“我们每秒需要分配数百个新ID”(这会产生其他问题,例如在一年内耗尽36^6密钥空间) ,您可以预先生成密钥并分配它们。
以下是一个伪代码示例 - 在如此大流量的站点中,您可能需要优化所使用的算法。
预生成很简单 - 只需从
000000
开始,一直到zzzzzz
,将每一行插入表中并将其标记为未使用:当您收到新的请求时ID,随机选择一个并将其标记为已使用(危险!并发问题!):
您将遇到效率问题(例如使用
WHERE RAND()
、表锁定等),但它是可行的。If the site is sufficiently large, and I mean large - as in "we need hundreds of new IDs assigned per second" (which will have other problems, e.g. exhaust the 36^6 keyspace under a year), you could pre-generate the keys and assign them.
The following is a pseudocode example - in such a large-traffic site, you'd probably need to optimize the algorithms used.
Pre-generating is trivial - just start at
000000
and go all the way tozzzzzz
, insert each row into a table and mark them unused:When you get a request for new ID, select a random one and mark it used (danger! concurrency issues!):
You'll run into efficiency issues (e.g. with
WHERE RAND()
, table locking, et al.), but it is doable.如果您不反对引入 .NET dll,我已经创建了一个项目来完成此任务。来源是 在 GitHub 上,二进制文件位于 IdGenerator NuGet 包。
它为您提供用户指定长度的有序序列和/或随机值,全部以 36 为基数。即使有多个 ID 生成器实例或在分布式环境中,ID 也是普遍唯一的。
当然,如果您对不可猜测的字符串比对有序序列更感兴趣,则可以使用 GetRandomString 方法:
其背后的基本原理是:
0
填充左侧到所需长度Base36 编码器本身来自 http://www.stum.de/2008 /10/20/base36-encoderdecoder-in-c/。
If you're not averse to bringing in a .NET dll, I've created a project to do exactly this. Source is on GitHub here, and binaries are in the IdGenerator NuGet Package.
It gives you ordered sequences and/or random values of user-specified length, all in base-36. Ids are universally unique, even with multiple id generator instances or in a distributed environment.
Of course if you're more interested in the string not being guessable than in having an ordered sequence, you could just use the GetRandomString method:
The basic principle behind it is:
0
to desired lengthThe Base36 encoder itself is from http://www.stum.de/2008/10/20/base36-encoderdecoder-in-c/.