Python 多处理与 uuid.uuid4() 配合不佳

发布于 2024-08-30 19:49:13 字数 986 浏览 2 评论 0原文

我正在尝试为文件名生成 uuid,并且我还在使用多处理模块。令人不快的是,我所有的 uuid 最终都完全相同。这是一个小例子:

import multiprocessing
import uuid

def get_uuid( a ):
    ## Doesn't help to cycle through a bunch.
    #for i in xrange(10): uuid.uuid4()

    ## Doesn't help to reload the module.
    #reload( uuid )

    ## Doesn't help to load it at the last minute.
    ## (I simultaneously comment out the module-level import).
    #import uuid

    ## uuid1() does work, but it differs only in the first 8 characters and includes identifying information about the computer.
    #return uuid.uuid1()

    return uuid.uuid4()

def main():
    pool = multiprocessing.Pool( 20 )
    uuids = pool.map( get_uuid, range( 20 ) )
    for id in uuids: print id

if __name__ == '__main__': main()

我查看了 uuid.py 的代码,它似乎根据平台使用一些操作系统级例程来实现随机性,所以我对 python 级解决方案感到困惑(做一些事情例如重新加载 uuid 模块或选择新的随机种子)。我可以使用 uuid.uuid1(),但只有 8 位不同,而且我认为这些数字完全来自时间,这似乎很危险,特别是考虑到我是多处理(因此代码可以在以下位置执行)完全相同的时间)。对于这个问题,有什么智慧吗?

I'm trying to generate a uuid for a filename, and I'm also using the multiprocessing module. Unpleasantly, all of my uuids end up exactly the same. Here is a small example:

import multiprocessing
import uuid

def get_uuid( a ):
    ## Doesn't help to cycle through a bunch.
    #for i in xrange(10): uuid.uuid4()

    ## Doesn't help to reload the module.
    #reload( uuid )

    ## Doesn't help to load it at the last minute.
    ## (I simultaneously comment out the module-level import).
    #import uuid

    ## uuid1() does work, but it differs only in the first 8 characters and includes identifying information about the computer.
    #return uuid.uuid1()

    return uuid.uuid4()

def main():
    pool = multiprocessing.Pool( 20 )
    uuids = pool.map( get_uuid, range( 20 ) )
    for id in uuids: print id

if __name__ == '__main__': main()

I peeked into uuid.py's code, and it seems to depending-on-the-platform use some OS-level routines for randomness, so I'm stumped as to a python-level solution (to do something like reload the uuid module or choose a new random seed). I could use uuid.uuid1(), but only 8 digits differ and I think there are derived exclusively from the time, which seems dangerous especially given that I'm multiprocessing (so the code could be executing at exactly the same time). Is there some Wisdom out there about this issue?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

春花秋月 2024-09-06 19:49:13

如果您需要这样做,这是生成您自己的 uuid4 的正确方法:

import os, uuid
return uuid.UUID(bytes=os.urandom(16), version=4)

Python 应该自动执行此操作 - 当本机 _uuid_generate_random 不存在时,此代码就来自 uuid.uuid4。您的平台的 _uuid_generate_random 一定有问题。

如果您必须这样做,请不要自己解决这个问题,而让您平台上的其他人受苦; 报告错误

This is the correct way to generate your own uuid4, if you need to do that:

import os, uuid
return uuid.UUID(bytes=os.urandom(16), version=4)

Python should be doing this automatically--this code is right out of uuid.uuid4, when the native _uuid_generate_random doesn't exist. There must be something wrong with your platform's _uuid_generate_random.

If you have to do this, don't just work around it yourself and let everyone else on your platform suffer; report the bug.

爱她像谁 2024-09-06 19:49:13

我也看不出有什么办法可以让这项工作发挥作用。但是您可以在主线程中生成所有 uuid 并将它们传递给工作人员。

I dont see a way to make this work either. But you could just generate all the uuids in the main thread and pass them to the workers.

流心雨 2024-09-06 19:49:13

这对我来说效果很好。你的Python安装有os.urandom吗?如果没有,随机数播种将非常差,并会导致此问题(假设也没有本机 UUID 模块 uuid._uuid_generate_random)。

This works fine for me. Does your Python installation have os.urandom? If not, random number seeding will be very poor and would lead to this problem (assuming there's also no native UUID module, uuid._uuid_generate_random).

念三年u 2024-09-06 19:49:13

目前,我正在编写一个脚本,该脚本从 zip 存档或磁盘中获取文件。获取后,有效负载将通过 Web API 推送到外部工具。
出于性能原因,我使用了 multiprocessing.Pool.map 方法。对于 tmp 文件名 uuid 看起来很方便。但我遇到了你在这里问的同样的问题。

首先请查看 uuid 的官方文档。有一个名为 is_safe 的类属性,它提供了 uuid 是否是多进程安全的更多信息。就我而言,事实并非如此。

经过一番研究,我终于改变了策略,从 uuid 转向进程 pid 和 name。
因为我只需要 uuid 来命名 tmp 文件,所以 pid 和 name 也可以正常工作。我们可以通过 multiprocessing.current_process( 访问当前的worker Process 实例)。如果您确实需要 uuid,您可以以某种方式集成工作进程 pid。

另外,uuid使用系统来生成(uuid 源)。因为对我来说,文件如何命名并不重要,这个解决方案还可以防止丢失

Currently, I am working on a script, which fetches file either from a zip archive or disk. After fetching, the payload gets pushed to an external tool via web API.
For performance reason, I used the multiprocessing.Pool.map method. And for the tmp file name uuid looked quite handy. But I ran into the same issue you asked here.

First please check out the official docs from uuid. There is an class attribute called is_safe which provides more information if the uuid is multiprocess safe or not. In my case it was not.

After some research, I finally changed my strategy and moved from uuid to process pid and name.
Because I just need the uuid for tmp file naming, pid and name also works fine. We can access the current worker Process instance via multiprocessing.current_process(). If you really need an uuid, you could potentially integrate the worker pid somehow.

In addition, uuid uses system entropy for the generation (uuid source). Because for me it does not matter how the file is named, this solution also prevents laking entropy.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文