我可以让 uuid 更随机吗？

发布于 2024-11-18 05:24:53 字数 2781 浏览 5 评论 0原文

我有一个程序可以将消息分派到单独的进程。我需要平衡负载，但不是以非常精确的方式，几乎相同的数字就可以了。由于每条消息都有一个 uuid 字段，因此我想通过 uuid 值来完成。在我测试了 uuid 随机性后，我发现它并不像我想象的那样随机。我的最后一个和第一个大约有 80% 的差异。这是不可接受的，所以我想知道是否有一种算法可以使其更加随机。

这是我的测试代码。

import uuid
from collections import Counter

COUNT = 3000

def b(length):
    holder = []
    for i in xrange(COUNT):
        holder.append(str(uuid.uuid4())[:length])
    return Counter(holder)

def num(part_count):
    sep = 0xffffffffffffffffffffffffffffffff / part_count
    parts = []
    for i in xrange(COUNT):
#        str_hex = str(uuid.uuid4())[:4]
        num = int(uuid.uuid4().hex,16)
        divide = num/sep
        if divide == part_count:
            divide = part_count - 1
        parts.append(divide)
    return Counter(parts)

if __name__ == "__main__":
    print num(200)

我得到这样的输出：

Counter({127L: 29, 198L: 26, 55L: 25, 178L: 24, 184L: 24, 56L: 23, 132L: 23, 143L: 23, 148L: 23, 195L: 23, 16L: 21, 30L: 21, 44L: 21, 53L: 21, 97L: 21, 158L: 21, 185L: 21, 13L: 20, 146L: 20, 149L: 20, 196L: 20, 2L: 19, 11L: 19, 15L: 19, 19L: 19, 46L: 19, 58L: 19, 64L: 19, 68L: 19, 70L: 19, 89L: 19, 112L: 19, 118L: 19, 128L: 19, 144L: 19, 156L: 19, 192L: 19, 27L: 18, 41L: 18, 42L: 18, 51L: 18, 54L: 18, 85L: 18, 87L: 18, 88L: 18, 93L: 18, 94L: 18, 104L: 18, 106L: 18, 115L: 18, 4L: 17, 22L: 17, 45L: 17, 59L: 17, 79L: 17, 81L: 17, 105L: 17, 125L: 17, 138L: 17, 150L: 17, 159L: 17, 167L: 17, 194L: 17, 3L: 16, 18L: 16, 28L: 16, 31L: 16, 33L: 16, 62L: 16, 65L: 16, 83L: 16, 111L: 16, 123L: 16, 126L: 16, 133L: 16, 145L: 16, 147L: 16, 163L: 16, 166L: 16, 183L: 16, 188L: 16, 190L: 16, 5L: 15, 6L: 15, 9L: 15, 23L: 15, 26L: 15, 34L: 15, 35L: 15, 38L: 15, 69L: 15, 73L: 15, 74L: 15, 77L: 15, 82L: 15, 86L: 15, 107L: 15, 108L: 15, 109L: 15, 110L: 15, 114L: 15, 136L: 15, 141L: 15, 142L: 15, 153L: 15, 160L: 15, 169L: 15, 176L: 15, 180L: 15, 186L: 15, 0L: 14, 1L: 14, 36L: 14, 39L: 14, 43L: 14, 60L: 14, 71L: 14, 72L: 14, 76L: 14, 92L: 14, 113L: 14, 131L: 14, 135L: 14, 157L: 14, 171L: 14, 172L: 14, 181L: 14, 189L: 14, 7L: 13, 17L: 13, 20L: 13, 24L: 13, 25L: 13, 32L: 13, 47L: 13, 49L: 13, 101L: 13, 102L: 13, 117L: 13, 121L: 13, 122L: 13, 124L: 13, 130L: 13, 151L: 13, 152L: 13, 165L: 13, 179L: 13, 14L: 12, 21L: 12, 29L: 12, 50L: 12, 63L: 12, 67L: 12, 80L: 12, 84L: 12, 90L: 12, 91L: 12, 96L: 12, 120L: 12, 129L: 12, 139L: 12, 140L: 12, 182L: 12, 193L: 12, 197L: 12, 52L: 11, 75L: 11, 78L: 11, 103L: 11, 116L: 11, 119L: 11, 134L: 11, 137L: 11, 161L: 11, 173L: 11, 12L: 10, 37L: 10, 66L: 10, 98L: 10, 100L: 10, 162L: 10, 170L: 10, 175L: 10, 177L: 10, 187L: 10, 191L: 10, 199L: 10, 48L: 9, 155L: 9, 164L: 9, 174L: 9, 10L: 8, 95L: 8, 99L: 8, 168L: 8, 8L: 7, 40L: 7, 57L: 7, 61L: 7, 154L: 6})

最后一个是 6，第一个是 29，相差近 5 倍

原文

I have a program that dispatches messages to separate processes. I need to balance the load, but not in very precise way, almost the same number is ok. Since every message has an uuid field, I want to do it by uuid value. After I tested the uuid randomness I found it to not be as random as I expexted. I have the last one and the first one about 80% difference. This is unacceptable, so I want to know if there is an algorithm that can make it more random.

Here is my test code.

import uuid
from collections import Counter

COUNT = 3000

def b(length):
    holder = []
    for i in xrange(COUNT):
        holder.append(str(uuid.uuid4())[:length])
    return Counter(holder)

def num(part_count):
    sep = 0xffffffffffffffffffffffffffffffff / part_count
    parts = []
    for i in xrange(COUNT):
#        str_hex = str(uuid.uuid4())[:4]
        num = int(uuid.uuid4().hex,16)
        divide = num/sep
        if divide == part_count:
            divide = part_count - 1
        parts.append(divide)
    return Counter(parts)

if __name__ == "__main__":
    print num(200)

and I get the output like this:

Counter({127L: 29, 198L: 26, 55L: 25, 178L: 24, 184L: 24, 56L: 23, 132L: 23, 143L: 23, 148L: 23, 195L: 23, 16L: 21, 30L: 21, 44L: 21, 53L: 21, 97L: 21, 158L: 21, 185L: 21, 13L: 20, 146L: 20, 149L: 20, 196L: 20, 2L: 19, 11L: 19, 15L: 19, 19L: 19, 46L: 19, 58L: 19, 64L: 19, 68L: 19, 70L: 19, 89L: 19, 112L: 19, 118L: 19, 128L: 19, 144L: 19, 156L: 19, 192L: 19, 27L: 18, 41L: 18, 42L: 18, 51L: 18, 54L: 18, 85L: 18, 87L: 18, 88L: 18, 93L: 18, 94L: 18, 104L: 18, 106L: 18, 115L: 18, 4L: 17, 22L: 17, 45L: 17, 59L: 17, 79L: 17, 81L: 17, 105L: 17, 125L: 17, 138L: 17, 150L: 17, 159L: 17, 167L: 17, 194L: 17, 3L: 16, 18L: 16, 28L: 16, 31L: 16, 33L: 16, 62L: 16, 65L: 16, 83L: 16, 111L: 16, 123L: 16, 126L: 16, 133L: 16, 145L: 16, 147L: 16, 163L: 16, 166L: 16, 183L: 16, 188L: 16, 190L: 16, 5L: 15, 6L: 15, 9L: 15, 23L: 15, 26L: 15, 34L: 15, 35L: 15, 38L: 15, 69L: 15, 73L: 15, 74L: 15, 77L: 15, 82L: 15, 86L: 15, 107L: 15, 108L: 15, 109L: 15, 110L: 15, 114L: 15, 136L: 15, 141L: 15, 142L: 15, 153L: 15, 160L: 15, 169L: 15, 176L: 15, 180L: 15, 186L: 15, 0L: 14, 1L: 14, 36L: 14, 39L: 14, 43L: 14, 60L: 14, 71L: 14, 72L: 14, 76L: 14, 92L: 14, 113L: 14, 131L: 14, 135L: 14, 157L: 14, 171L: 14, 172L: 14, 181L: 14, 189L: 14, 7L: 13, 17L: 13, 20L: 13, 24L: 13, 25L: 13, 32L: 13, 47L: 13, 49L: 13, 101L: 13, 102L: 13, 117L: 13, 121L: 13, 122L: 13, 124L: 13, 130L: 13, 151L: 13, 152L: 13, 165L: 13, 179L: 13, 14L: 12, 21L: 12, 29L: 12, 50L: 12, 63L: 12, 67L: 12, 80L: 12, 84L: 12, 90L: 12, 91L: 12, 96L: 12, 120L: 12, 129L: 12, 139L: 12, 140L: 12, 182L: 12, 193L: 12, 197L: 12, 52L: 11, 75L: 11, 78L: 11, 103L: 11, 116L: 11, 119L: 11, 134L: 11, 137L: 11, 161L: 11, 173L: 11, 12L: 10, 37L: 10, 66L: 10, 98L: 10, 100L: 10, 162L: 10, 170L: 10, 175L: 10, 177L: 10, 187L: 10, 191L: 10, 199L: 10, 48L: 9, 155L: 9, 164L: 9, 174L: 9, 10L: 8, 95L: 8, 99L: 8, 168L: 8, 8L: 7, 40L: 7, 57L: 7, 61L: 7, 154L: 6})

the last one is 6 the first one is 29, nearly 5 times difference

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

盛夏已如深秋| 2024-11-25 05:24:53

UUID 并不是随机的，而是唯一的。如果您的平衡器需要关闭它们，它应该首先通过哈希函数运行它们以获得您想要的随机性：

import hashlib
actually_random = hashlib.sha1(uuid).digest()

UUIDs are not meant to be random, just unique. If your balancer needs to be keyed off of them, it should run them through a hash function first to get the randomness you want:

import hashlib
actually_random = hashlib.sha1(uuid).digest()

回复收藏 0 原文

人生百味 2024-11-25 05:24:53

您的测试方法没有任何意义（见下文）。但首先，这是 uuid4 的实现：

def uuid4():
    """Generate a random UUID."""

    # When the system provides a version-4 UUID generator, use it.
    if _uuid_generate_random:
        _buffer = ctypes.create_string_buffer(16)
        _uuid_generate_random(_buffer)
        return UUID(bytes=_buffer.raw)

    # Otherwise, get randomness from urandom or the 'random' module.
    try:
        import os
        return UUID(bytes=os.urandom(16), version=4)
    except:
        import random
        bytes = [chr(random.randrange(256)) for i in range(16)]
        return UUID(bytes=bytes, version=4)

以及 libuuid 返回的随机性（ctypes 调用），os.urandom< /code> 和 random.randrange 对于大多数非加密内容来说应该足够好了。

编辑：好的，我猜测为什么你的测试方法被破坏了：你正在计算的数字（除）有两个方面的偏差：首先，它是以下结果除以一个不是 2 的幂的数字（在本例中为 200），这会引入模偏差。其次，ifdivide==part_count:divide=part_count-1引入了更多偏差。

此外，在解释结果之前，您需要弄清楚任何随机数生成器测试的置信区间是多少。不过，我的 stats-foo 在这里不太好，所以我无法真正帮助你......

Your testing methodology doesn't make any sense (see below). But first, this is the implementation of uuid4:

def uuid4():
    """Generate a random UUID."""

    # When the system provides a version-4 UUID generator, use it.
    if _uuid_generate_random:
        _buffer = ctypes.create_string_buffer(16)
        _uuid_generate_random(_buffer)
        return UUID(bytes=_buffer.raw)

    # Otherwise, get randomness from urandom or the 'random' module.
    try:
        import os
        return UUID(bytes=os.urandom(16), version=4)
    except:
        import random
        bytes = [chr(random.randrange(256)) for i in range(16)]
        return UUID(bytes=bytes, version=4)

And the randomness returned by libuuid (the ctypes call), os.urandom and random.randrange should be good enough for most non-crypto stuff.

Edit: Ok, my guess as to why your testing methodology is broken: the number you're counting (divide) is biased in two ways: first, it's the result of dividing by a number which isn't a power of two (in this case, 200), which introduces modulo bias. Second, the if divide == part_count: divide = part_count - 1 introduces more bias.

Additionally, you'll need to figure out what the confidence interval is for any random number generator test before you can interpret the results. My stats-foo isn't great here, though, so I can't really help you with that…

回复收藏 0 原文

旧情勿念 2024-11-25 05:24:53

嗯，UUID 不应该是随机的，它应该是唯一的：通常，它基于计算机名称/IP、日期等诸如此类的东西：目标不是使其随机，目标是确保两个连续的调用将提供两个不同的值，并且来自不同计算机的 Id 不会发生冲突。如果您想了解更多详细信息，可以查看官方规范(RFC 4122)

现在，如果您的负载平衡器想使用它作为平衡的标准，我认为您的设计是有缺陷的。如果你想要更好的随机性，你可以对其进行散列（如 sha-256），从而稀释所有位之间的小随机性（这就是散列所做的）

回复收藏 0 原文

苹果你个爱泡泡 2024-11-25 05:24:53

仅仅因为某些东西看起来不随机，并不意味着它不是随机的。

也许对于人眼（和头脑）来说，某些序列看起来比其他序列更不随机，但事实并非如此。
当你掷骰子 10 次时，掷出 2-5-1-3-5-1-3-5-2-6 的概率与掷出 1-1-1-1-1-1-1- 的概率一样高1-1-1 或 1-2-3-4-5-6-1-2-3-4。尽管后两个例子似乎不太随机，但事实并非如此。

不要尝试改进随机生成器，因为很可能只会使输出恶化。

例如：您想要生成一个随机序列，但它看起来不够随机，一个字节比另一个字节出现的频率更高。因此，您可以忽略所有具有重复字节（或重复超过 n 次的字节）的序列，以确保更多的随机性。实际上，你正在使你的序列变得不那么随机。

回复收藏 0 原文

~没有更多了~