创建您自己的 Tinyurl 风格 uid

发布于 2024-07-06 20:37:41 字数 1838 浏览 10 评论 0原文

我正在写一篇关于 Guid/UID 的人类可读替代品的小文章,例如在 TinyURL 上用于 url 哈希值的替代品(通常印刷在杂志上,因此需要简短)。

我生成的简单 uid 是 - 6 个字符:小写字母 (az) 或 0-9。

“根据我的计算队长”,这是 6 个互斥的事件,尽管计算冲突的概率比 P(A 或 B) = P(A) + P(B) 更难,因为显然它包括数字和下面的代码,您可以看到它使用 50/50 计算出是使用数字还是字母。

我对冲突率感兴趣,如果下面的代码是对生成哈希所获得的预期冲突率的真实模拟。 平均每百万次发生 40-50 次冲突,但请记住,uid 不会一次生成一百万次,而可能每分钟只生成大约 10-1000 次。

每次发生冲突的概率是多少,有人能提出更好的方法吗?

static Random _random = new Random();

public static void main()
{
    // Size of the key, 6
    HashSet<string> set = new HashSet<string>();
    int clashes = 0;
    for (int n=0;n < 1000000;n++)
    {
        StringBuilder builder = new StringBuilder();

        for (int i =0;i < 7;i++)
        {
            if (_random.NextDouble() > 0.5)
            {
                builder.Append((char)_random.Next(97,123));
            }
            else
            {
                builder.Append(_random.Next(0,9).ToString());
            }
        }

        if (set.Contains(builder.ToString()))
        {
            clashes++;
            Console.WriteLine("clash: (" +n+ ")" +builder.ToString());
        }

        set.Add(builder.ToString());
        _random.Next();
        //Console.Write(builder.ToString());
    }

    Console.WriteLine("Clashes: " +clashes);
    Console.ReadLine();
}

更新: 这是生成的文章 从这个问题来看

,我真的在这里问了两个问题,所以我在作弊。 我想要的答案是 rcar 的,但是 Sklivvz 的也是第二部分的答案(替代方案)。 是否可以在数据库中制作一个自定义的唯一 ID 生成器,还是可以在客户端(首先可能进行 2 次读取)?

我所追求的总体想法是在数据库或其他商店中使用可通过电话或印刷材料使用的 ID,而不是巨大的 16 字节 GUI。

更新2:我在上面放置了两个互斥事件的公式,而不是两个独立的事件(因为第一次获得“a”并不意味着第二次不能获得“a”)时间)。 应该是 P(A 和 B) = P(A) x P(B)

I'm writing a small article on humanly readable alternatives to Guids/UIDs, for example those used on TinyURL for the url hashes (which are often printed in magazines, so need to be short).

The simple uid I'm generating is - 6 characters: either a lowercase letter (a-z) or 0-9.

"According to my calculations captain", that's 6 mutually exclusive events, although calculating the probability of a clash gets a little harder than P(A or B) = P(A) + P(B), as obviously it includes numbers and from the code below, you can see it works out whether to use a number or letter using 50/50.

I'm interested in the clash rate and if the code below is a realistic simulation of anticipated clash rate you'd get from generating a hash. On average I get 40-50 clashes per million, however bare in mind the uid wouldn't be generated a million times at once, but probably only around 10-1000 times a minute.

What is the probability of a clash each time, and can anyone suggest a better way of doing it?

static Random _random = new Random();

public static void main()
{
    // Size of the key, 6
    HashSet<string> set = new HashSet<string>();
    int clashes = 0;
    for (int n=0;n < 1000000;n++)
    {
        StringBuilder builder = new StringBuilder();

        for (int i =0;i < 7;i++)
        {
            if (_random.NextDouble() > 0.5)
            {
                builder.Append((char)_random.Next(97,123));
            }
            else
            {
                builder.Append(_random.Next(0,9).ToString());
            }
        }

        if (set.Contains(builder.ToString()))
        {
            clashes++;
            Console.WriteLine("clash: (" +n+ ")" +builder.ToString());
        }

        set.Add(builder.ToString());
        _random.Next();
        //Console.Write(builder.ToString());
    }

    Console.WriteLine("Clashes: " +clashes);
    Console.ReadLine();
}

UPDATE:
Here's the resulting article from this question

I really asked two questions here so I was cheating. The answer I was after was rcar's, however Sklivvz's is also the answer to the 2nd part (an alternative). Is it possible to make a custom unique id generator in a database, or would it be client side (which would be 2 possible reads first)?

The general idea I was after was using Ids in databases or other stores that can be used by phone or printed material, not a giant 16 byte guid.

UPDATE 2: I put the formula for two mutually exclusive events above instead of 2 independent ones (as getting an 'a' the first time doesn't mean you can't get an 'a' the second time). Should've been P(A and B) = P(A) x P(B)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

走过海棠暮 2024-07-13 20:37:41

为什么要使用随机函数? 我一直假设tinyurl 使用 62 位 (0-9A-Za-z) 表示顺序 Id。 不会发生冲突,并且 URL 始终尽可能短。

您将有一个类似的数据库表

Id  URL
 1  http://google.com
 2  ...
... ...
156 ...
... ...

,相应的 URL 为:

http://example.com/1
http://example.com/2
...
http://example.com/2W
...

Why do you want to use a random function? I always assumed that tinyurl used a base 62 (0-9A-Za-z) representation of a sequential Id. No clashes and the urls are always as short as possible.

You would have a DB table like

Id  URL
 1  http://google.com
 2  ...
... ...
156 ...
... ...

and the corresponding URLs would be:

http://example.com/1
http://example.com/2
...
http://example.com/2W
...
梦开始←不甜 2024-07-13 20:37:41

查找生日悖论,这正是您遇到的问题。

问题是:你需要多少人聚集在一个房间里,才能有 50% 的机会任意两个人拥有相同的出生日期? 答案可能会让你大吃一惊。

Look up the Birthday Paradox, it's the exact problem that you're running into.

The question is: How many people do you need to get together in a room, so that you have a 50% chance of any two people having the same birthdate? The answer may surprise you.

丶情人眼里出诗心の 2024-07-13 20:37:41

与某一特定 ID 发生冲突的概率为:

p = ( 0.5 * ( (0.5*1/10) + (0.5*1/26) ) )^6

约为 1.7×10^-9。

生成 n 个 ID 后发生冲突的概率为 1-p^n,因此在插入 100 万个 ID 后,每次新插入发生冲突的概率约为 0.17%,在插入 1000 万个 ID 后,发生冲突的概率约为 1.7%,并且1亿后约16%。

每分钟 1000 个 ID 相当于每月 4300 万个左右,因此,正如 Sklivvz 指出的那样,在这种情况下,使用一些递增的 ID 可能是更好的方法。

编辑:

为了解释数学,他本质上是抛硬币,然后选择一个数字或字母 6 次。 抛硬币匹配的概率为 0.5,然后 50% 的概率为 1/10 匹配,50% 的概率为 1/26 匹配。 这种情况独立发生 6 次,因此您可以将这些概率相乘。

The probability of a collision against one specific ID is:

p = ( 0.5 * ( (0.5*1/10) + (0.5*1/26) ) )^6

which is around 1.7×10^-9.

The probability of a collision after generating n IDs is 1-p^n, so you'll have roughly a 0.17% chance of a collision for each new insertion after 1 million IDs have been inserted, around 1.7% after 10 million IDs, and around 16% after 100 million.

1000 IDs/minute works out to about 43 million/month, so as Sklivvz pointed out, using some incrementing ID is probably going to be a better way to go in this case.

EDIT:

To explain the math, he's essentially flipping a coin and then picking a number or letter 6 times. There's a 0.5 probability that the coin flip matches, and then 50% of the time there's a 1/10 chance of matching and a 50% chance of a 1/26 chance of matching. That happens 6 times independently, so you multiply those probabilities together.

等往事风中吹 2024-07-13 20:37:41

前一段时间我就是这样做的,而且是按照Sklivvz提到的方式进行的。 整个逻辑是使用 SQL Server 存储过程和几个 UDF(用户定义函数)开发的。 步骤是:

  • 假设您要缩短此网址: 创建您自己的 Tinyurl 样式uid
  • 在表中插入 URL
  • 获取最后插入的 @@identity 值(数字 id)
  • 根据字母和数字的“域”,将 id 转换为相应的字母数字值(我实际上使用了这个set: "0123456789abcdefghijklmnopqrstuvwxyz")
  • 返回该值,类似于 'cc0'

转换是通过几个非常短的 UDF 实现的。

一个接一个地调用的两个转换将返回如下所示的“顺序”值:

select dbo.FX_CONV (123456) -- returns "1f5n"

select dbo.FX_CONV (123457) -- returns "1f5o"

如果您有兴趣,我可以分享 UDF 的代码。

Some time ago I did exactly this, and I followed the way Sklivvz mentioned. The whole logic was developed with a SQL server stored procedure and a couple of UDF (user defined functions). The steps were:

  • say that you want to shorten this url: Creating your own Tinyurl style uid
  • Insert the URL in a table
  • Obtain the @@identity value of the last insert (a numeric id)
  • Transform the id in a corresponding alphanumeric value, based on a "domain" of letters and numbers (I actually used this set: "0123456789abcdefghijklmnopqrstuvwxyz")
  • Return that value back, something like 'cc0'

The conversion was realized thru a couple of very short UDF.

Two conversion called one after the other would return "sequential" values like these:

select dbo.FX_CONV (123456) -- returns "1f5n"

select dbo.FX_CONV (123457) -- returns "1f5o"

If you are interested I can share the UDF's code.

随风而去 2024-07-13 20:37:41

为什么不直接使用哈希算法呢? 并使用 url 的哈希值?

如果您使用随机数,您很可能会发生冲突,因为它们是不确定的。

哈希值无法证明是唯一的,但字符串的哈希值很有可能是唯一的。

更正

实际上,您希望它们具有人类可读性...如果您将它们放入十六进制,那么它们在技术上是人类可读的。

或者您可以使用一种算法将哈希值转换为人类可读的字符串。 如果人类可读的字符串是散列的不同表示,它也应该与散列一样“唯一”,即原始散列的基数 36。

Why not just use a hashing algorithm? and use a hash of the url?

if you are using random numbers chances are you will get clashes because they are indeterminate.

hashes arent proovably unique but there is a fairly good chance that the hash of a string will be unique.

Correction

Actually wait you want them to be humanly readable... if you put them in hex they are technically humanly readable.

or you could use an algorithm that converted a hash into a humanly readable string. if the humanly readable string is a different representation of the hash it should also be as "unique" as the hash, ie base 36 of the original hash.

星星的軌跡 2024-07-13 20:37:41

我将生成一个代表您要散列的数据的随机值,然后对其进行散列并检查冲突,而不是尝试使用随机的手动散列进行模拟。 这将为您提供更好的指标。 而且您将拥有更多的随机性,因为您将有更多的随机性(假设要散列的数据更大:))。

I would generate a random value representative of the data that you are going to hash, and then hash that and check clahses rather than trying to simulate with random manually made hashes. This will give you a better indicator. And you will have more randomness because you will have more to randomize (Assuming your data to be hashed is larger :) ).

氛圍 2024-07-13 20:37:41

如果您使用 6 个字符(az 和 0-9),则总共 36 个字符。 因此排列的数量是 36^6,即 2176782336.. 所以它应该只冲突 1/2176782336 次。

If you're using 6 characters, a-z and 0-9, thats a total of 36 characters. The number of permutations is thus 36^6 which is 2176782336.. so it should only clash 1/2176782336 times.

甜是你 2024-07-13 20:37:41

来自 维基百科

当需要打印更少的字符时,GUID 有时会编码为 base64 或 Ascii85 字符串。 Base64 编码的 GUID 由 22 到 24 个字符组成(取决于填充),例如:

7QDBkvCA1+B9K/U0vrQx1A
7QDBkvCA1+B9K/U0vrQx1A==

而Ascii85编码只给出20个字符,例如:

5:$Hj:Pf\4RLB9%kU\Lj 

因此,如果您关心唯一性,则使用 base64 编码的 GUID 可以让您更接近您想要的内容,尽管它不是 6 个字符。

最好先以字节为单位进行工作,然后将这些字节转换为十六进制进行显示,而不是直接使用字符。

from wikipedia:

When printing fewer characters is desired, GUIDs are sometimes encoded into a base64 or Ascii85 string. Base64-encoded GUID consists of 22 to 24 characters (depending on padding), for instance:

7QDBkvCA1+B9K/U0vrQx1A
7QDBkvCA1+B9K/U0vrQx1A==

and Ascii85 encoding gives only 20 characters, e. g.:

5:$Hj:Pf\4RLB9%kU\Lj 

So if you are concerned with uniqueness, a base64 encoded GUID gets you somewhat closer to what you want, though its not 6 characters.

Its best to work in bytes first, then translate those bytes into hexadecimal for display, rather than working with characters directly.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文