我可以让 uuid 更随机吗?

发布于 2024-11-18 05:24:53 字数 2781 浏览 3 评论 0原文

我有一个程序可以将消息分派到单独的进程。我需要平衡负载,但不是以非常精确的方式,几乎相同的数字就可以了。由于每条消息都有一个 uuid 字段,因此我想通过 uuid 值来完成。在我测试了 uuid 随机性后,我发现它并不像我想象的那样随机。我的最后一个和第一个大约有 80% 的差异。这是不可接受的,所以我想知道是否有一种算法可以使其更加随机。

这是我的测试代码。

import uuid
from collections import Counter

COUNT = 3000

def b(length):
    holder = []
    for i in xrange(COUNT):
        holder.append(str(uuid.uuid4())[:length])
    return Counter(holder)

def num(part_count):
    sep = 0xffffffffffffffffffffffffffffffff / part_count
    parts = []
    for i in xrange(COUNT):
#        str_hex = str(uuid.uuid4())[:4]
        num = int(uuid.uuid4().hex,16)
        divide = num/sep
        if divide == part_count:
            divide = part_count - 1
        parts.append(divide)
    return Counter(parts)

if __name__ == "__main__":
    print num(200) 

我得到这样的输出:

Counter({127L: 29, 198L: 26, 55L: 25, 178L: 24, 184L: 24, 56L: 23, 132L: 23, 143L: 23, 148L: 23, 195L: 23, 16L: 21, 30L: 21, 44L: 21, 53L: 21, 97L: 21, 158L: 21, 185L: 21, 13L: 20, 146L: 20, 149L: 20, 196L: 20, 2L: 19, 11L: 19, 15L: 19, 19L: 19, 46L: 19, 58L: 19, 64L: 19, 68L: 19, 70L: 19, 89L: 19, 112L: 19, 118L: 19, 128L: 19, 144L: 19, 156L: 19, 192L: 19, 27L: 18, 41L: 18, 42L: 18, 51L: 18, 54L: 18, 85L: 18, 87L: 18, 88L: 18, 93L: 18, 94L: 18, 104L: 18, 106L: 18, 115L: 18, 4L: 17, 22L: 17, 45L: 17, 59L: 17, 79L: 17, 81L: 17, 105L: 17, 125L: 17, 138L: 17, 150L: 17, 159L: 17, 167L: 17, 194L: 17, 3L: 16, 18L: 16, 28L: 16, 31L: 16, 33L: 16, 62L: 16, 65L: 16, 83L: 16, 111L: 16, 123L: 16, 126L: 16, 133L: 16, 145L: 16, 147L: 16, 163L: 16, 166L: 16, 183L: 16, 188L: 16, 190L: 16, 5L: 15, 6L: 15, 9L: 15, 23L: 15, 26L: 15, 34L: 15, 35L: 15, 38L: 15, 69L: 15, 73L: 15, 74L: 15, 77L: 15, 82L: 15, 86L: 15, 107L: 15, 108L: 15, 109L: 15, 110L: 15, 114L: 15, 136L: 15, 141L: 15, 142L: 15, 153L: 15, 160L: 15, 169L: 15, 176L: 15, 180L: 15, 186L: 15, 0L: 14, 1L: 14, 36L: 14, 39L: 14, 43L: 14, 60L: 14, 71L: 14, 72L: 14, 76L: 14, 92L: 14, 113L: 14, 131L: 14, 135L: 14, 157L: 14, 171L: 14, 172L: 14, 181L: 14, 189L: 14, 7L: 13, 17L: 13, 20L: 13, 24L: 13, 25L: 13, 32L: 13, 47L: 13, 49L: 13, 101L: 13, 102L: 13, 117L: 13, 121L: 13, 122L: 13, 124L: 13, 130L: 13, 151L: 13, 152L: 13, 165L: 13, 179L: 13, 14L: 12, 21L: 12, 29L: 12, 50L: 12, 63L: 12, 67L: 12, 80L: 12, 84L: 12, 90L: 12, 91L: 12, 96L: 12, 120L: 12, 129L: 12, 139L: 12, 140L: 12, 182L: 12, 193L: 12, 197L: 12, 52L: 11, 75L: 11, 78L: 11, 103L: 11, 116L: 11, 119L: 11, 134L: 11, 137L: 11, 161L: 11, 173L: 11, 12L: 10, 37L: 10, 66L: 10, 98L: 10, 100L: 10, 162L: 10, 170L: 10, 175L: 10, 177L: 10, 187L: 10, 191L: 10, 199L: 10, 48L: 9, 155L: 9, 164L: 9, 174L: 9, 10L: 8, 95L: 8, 99L: 8, 168L: 8, 8L: 7, 40L: 7, 57L: 7, 61L: 7, 154L: 6})

最后一个是 6,第一个是 29,相差近 5 倍

I have a program that dispatches messages to separate processes. I need to balance the load, but not in very precise way, almost the same number is ok. Since every message has an uuid field, I want to do it by uuid value. After I tested the uuid randomness I found it to not be as random as I expexted. I have the last one and the first one about 80% difference. This is unacceptable, so I want to know if there is an algorithm that can make it more random.

Here is my test code.

import uuid
from collections import Counter

COUNT = 3000

def b(length):
    holder = []
    for i in xrange(COUNT):
        holder.append(str(uuid.uuid4())[:length])
    return Counter(holder)

def num(part_count):
    sep = 0xffffffffffffffffffffffffffffffff / part_count
    parts = []
    for i in xrange(COUNT):
#        str_hex = str(uuid.uuid4())[:4]
        num = int(uuid.uuid4().hex,16)
        divide = num/sep
        if divide == part_count:
            divide = part_count - 1
        parts.append(divide)
    return Counter(parts)

if __name__ == "__main__":
    print num(200) 

and I get the output like this:

Counter({127L: 29, 198L: 26, 55L: 25, 178L: 24, 184L: 24, 56L: 23, 132L: 23, 143L: 23, 148L: 23, 195L: 23, 16L: 21, 30L: 21, 44L: 21, 53L: 21, 97L: 21, 158L: 21, 185L: 21, 13L: 20, 146L: 20, 149L: 20, 196L: 20, 2L: 19, 11L: 19, 15L: 19, 19L: 19, 46L: 19, 58L: 19, 64L: 19, 68L: 19, 70L: 19, 89L: 19, 112L: 19, 118L: 19, 128L: 19, 144L: 19, 156L: 19, 192L: 19, 27L: 18, 41L: 18, 42L: 18, 51L: 18, 54L: 18, 85L: 18, 87L: 18, 88L: 18, 93L: 18, 94L: 18, 104L: 18, 106L: 18, 115L: 18, 4L: 17, 22L: 17, 45L: 17, 59L: 17, 79L: 17, 81L: 17, 105L: 17, 125L: 17, 138L: 17, 150L: 17, 159L: 17, 167L: 17, 194L: 17, 3L: 16, 18L: 16, 28L: 16, 31L: 16, 33L: 16, 62L: 16, 65L: 16, 83L: 16, 111L: 16, 123L: 16, 126L: 16, 133L: 16, 145L: 16, 147L: 16, 163L: 16, 166L: 16, 183L: 16, 188L: 16, 190L: 16, 5L: 15, 6L: 15, 9L: 15, 23L: 15, 26L: 15, 34L: 15, 35L: 15, 38L: 15, 69L: 15, 73L: 15, 74L: 15, 77L: 15, 82L: 15, 86L: 15, 107L: 15, 108L: 15, 109L: 15, 110L: 15, 114L: 15, 136L: 15, 141L: 15, 142L: 15, 153L: 15, 160L: 15, 169L: 15, 176L: 15, 180L: 15, 186L: 15, 0L: 14, 1L: 14, 36L: 14, 39L: 14, 43L: 14, 60L: 14, 71L: 14, 72L: 14, 76L: 14, 92L: 14, 113L: 14, 131L: 14, 135L: 14, 157L: 14, 171L: 14, 172L: 14, 181L: 14, 189L: 14, 7L: 13, 17L: 13, 20L: 13, 24L: 13, 25L: 13, 32L: 13, 47L: 13, 49L: 13, 101L: 13, 102L: 13, 117L: 13, 121L: 13, 122L: 13, 124L: 13, 130L: 13, 151L: 13, 152L: 13, 165L: 13, 179L: 13, 14L: 12, 21L: 12, 29L: 12, 50L: 12, 63L: 12, 67L: 12, 80L: 12, 84L: 12, 90L: 12, 91L: 12, 96L: 12, 120L: 12, 129L: 12, 139L: 12, 140L: 12, 182L: 12, 193L: 12, 197L: 12, 52L: 11, 75L: 11, 78L: 11, 103L: 11, 116L: 11, 119L: 11, 134L: 11, 137L: 11, 161L: 11, 173L: 11, 12L: 10, 37L: 10, 66L: 10, 98L: 10, 100L: 10, 162L: 10, 170L: 10, 175L: 10, 177L: 10, 187L: 10, 191L: 10, 199L: 10, 48L: 9, 155L: 9, 164L: 9, 174L: 9, 10L: 8, 95L: 8, 99L: 8, 168L: 8, 8L: 7, 40L: 7, 57L: 7, 61L: 7, 154L: 6})

the last one is 6 the first one is 29, nearly 5 times difference

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

盛夏已如深秋| 2024-11-25 05:24:53

UUID 并不是随机的,而是唯一的。如果您的平衡器需要关闭它们,它应该首先通过哈希函数运行它们以获得您想要的随机性:

import hashlib
actually_random = hashlib.sha1(uuid).digest()

UUIDs are not meant to be random, just unique. If your balancer needs to be keyed off of them, it should run them through a hash function first to get the randomness you want:

import hashlib
actually_random = hashlib.sha1(uuid).digest()
人生百味 2024-11-25 05:24:53

您的测试方法没有任何意义(见下文)。但首先,这是 uuid4 的实现:

def uuid4():
    """Generate a random UUID."""

    # When the system provides a version-4 UUID generator, use it.
    if _uuid_generate_random:
        _buffer = ctypes.create_string_buffer(16)
        _uuid_generate_random(_buffer)
        return UUID(bytes=_buffer.raw)

    # Otherwise, get randomness from urandom or the 'random' module.
    try:
        import os
        return UUID(bytes=os.urandom(16), version=4)
    except:
        import random
        bytes = [chr(random.randrange(256)) for i in range(16)]
        return UUID(bytes=bytes, version=4)

以及 libuuid 返回的随机性(ctypes 调用),os.urandom< /code> 和 random.randrange 对于大多数非加密内容来说应该足够好了。

编辑:好的,我猜测为什么你的测试方法被破坏了:你正在计算的数字()有两个方面的偏差:首先,它是以下结果除以一个不是 2 的幂的数字(在本例中为 200),这会引入模偏差。其次,ifdivide==part_count:divide=part_count-1引入了更多偏差。

此外,在解释结果之前,您需要弄清楚任何随机数生成器测试的置信区间是多少。不过,我的 stats-foo 在这里不太好,所以我无法真正帮助你......

Your testing methodology doesn't make any sense (see below). But first, this is the implementation of uuid4:

def uuid4():
    """Generate a random UUID."""

    # When the system provides a version-4 UUID generator, use it.
    if _uuid_generate_random:
        _buffer = ctypes.create_string_buffer(16)
        _uuid_generate_random(_buffer)
        return UUID(bytes=_buffer.raw)

    # Otherwise, get randomness from urandom or the 'random' module.
    try:
        import os
        return UUID(bytes=os.urandom(16), version=4)
    except:
        import random
        bytes = [chr(random.randrange(256)) for i in range(16)]
        return UUID(bytes=bytes, version=4)

And the randomness returned by libuuid (the ctypes call), os.urandom and random.randrange should be good enough for most non-crypto stuff.

Edit: Ok, my guess as to why your testing methodology is broken: the number you're counting (divide) is biased in two ways: first, it's the result of dividing by a number which isn't a power of two (in this case, 200), which introduces modulo bias. Second, the if divide == part_count: divide = part_count - 1 introduces more bias.

Additionally, you'll need to figure out what the confidence interval is for any random number generator test before you can interpret the results. My stats-foo isn't great here, though, so I can't really help you with that…

旧情勿念 2024-11-25 05:24:53

嗯,UUID 不应该是随机的,它应该是唯一的:通常,它基于计算机名称/IP、日期等诸如此类的东西:目标不是使其随机,目标是确保两个连续的调用将提供两个不同的值,并且来自不同计算机的 Id 不会发生冲突。如果您想了解更多详细信息,可以查看官方规范(RFC 4122)

现在,如果您的负载平衡器想使用它作为平衡的标准,我认为您的设计是有缺陷的。如果你想要更好的随机性,你可以对其进行散列(如 sha-256),从而稀释所有位之间的小随机性(这就是散列所做的)

Well, UUID is not supposed to be random, it's supposed to be unique : usually, it's based on computer name/ip, date, stuff like that : the goal is not to make it random, the goal is to make sure that two successive calls will provide two different values and that Id from different computers won't collide. If you want more details, you can look at official spec (RFC 4122)

Now, if your load balancer want to use that as a criteria for balancing, I think your design is flawed. If you want a better randomness out of it, you can hash it (like sha-256), thus diluting the little randomness amongst all the bits (that's what a hash is doing)

苹果你个爱泡泡 2024-11-25 05:24:53

仅仅因为某些东西看起来不随机,并不意味着它不是随机的。

也许对于人眼(和头脑)来说,某些序列看起来比其他序列更不随机,但事实并非如此。
当你掷骰子 10 次时,掷出 2-5-1-3-5-1-3-5-2-6 的概率与掷出 1-1-1-1-1-1-1- 的概率一样高1-1-1 或 1-2-3-4-5-6-1-2-3-4。尽管后两个例子似乎不太随机,但事实并非如此。

不要尝试改进随机生成器,因为很可能只会使输出恶化。

例如:您想要生成一个随机序列,但它看起来不够随机,一个字节比另一个字节出现的频率更高。因此,您可以忽略所有具有重复字节(或重复超过 n 次的字节)的序列,以确保更多的随机性。实际上,你正在使你的序列变得不那么随机。

Only because something doesn't look random, doesn't mean it isn't.

Maybe to the human eye (and mind) some sequences look less random than others, they are not.
When you roll a dice 10 times, the probability to roll 2-5-1-3-5-1-3-5-2-6 is as high as rolling 1-1-1-1-1-1-1-1-1-1 or 1-2-3-4-5-6-1-2-3-4. Although the two latter examples seem to be less random, they are not.

Do not try to improve random generators as most probably you will only worsen the output.

For instance: You want to generate a random sequence and it doesn't look random enough to you that one byte appears more frequently than another. Hence you dismiss all sequences with repeated bytes ( or bytes repeated more than n times) in order to assure more randomness. Actually, you are making your sequences less random.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文