如何在多进程和多线程环境中生成随机唯一标识符?
我提出的每个解决方案都不是线程保存的。
def uuid(cls,db):
u = hexlify(os.urandom(8)).decode('ascii')
db.execute('SELECT sid FROM sessions WHERE sid=?',(u,))
if db.fetch(): u=cls.uuid(db)
else: db.execute('INSERT INTO sessions (sid) VALUES (?)',(u,))
return u
Every solution I come up with is not thread save.
def uuid(cls,db):
u = hexlify(os.urandom(8)).decode('ascii')
db.execute('SELECT sid FROM sessions WHERE sid=?',(u,))
if db.fetch(): u=cls.uuid(db)
else: db.execute('INSERT INTO sessions (sid) VALUES (?)',(u,))
return u
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
队列通常是 Python 中同步线程的最佳方式——这种方式非常频繁,因此在设计多线程系统时,您的第一个想法应该是“我怎样才能最好地使用队列来做到这一点”。基本思想是让一个线程完全“拥有”共享资源或子系统,并让所有其他“工作”线程仅通过该专用线程使用的队列上的获取和/或放置来访问该资源(队列本质上是线程安全的) 。
在这里,我们创建一个长度仅为 2 的 idqueue(我们不希望 id 生成变得疯狂,预先创建大量 id,这会浪费内存并耗尽熵池 - 不是确定
2
是否是最佳的,但最佳点肯定是一个非常小的整数;-),因此 id 生成器线程在尝试添加第三个时会阻塞,并等待一些空间在队列中打开。idgetter
(也可以通过顶级赋值来简单定义,idgetter = idqueue.get
)通常会找到一个已经存在并等待的 id(并为下一个!)——如果没有,它本质上会阻塞并等待,一旦 id 生成器将新的 id 放入队列中就会醒来。Queue is often the best way to synchronize threads in Python -- that's frequent enough that when designing a multi-thread system your first thought should be "how could I best do this with Queues". The underlying idea is to dedicate a thread to entirely "own" a shared resource or subsystem, and have all other "worker" threads access the resource only by gets and/or puts on Queues used by that dedicated thread (Queue is intrinsically threadsafe).
Here, we make an
idqueue
with a length of only 2 (we don't want the id generation to go wild, making a lot of ids beforehand, which wastes memory and exhausts the entropy pool -- not sure if2
is optimal, but the sweet spot is definitely going to be a pretty small integer;-), so the id generator thread will block when trying to add the third one, and wait until some space opens in the queue.idgetter
(which could also be simply defined by a top-level assignment,idgetter = idqueue.get
) will normally find an id already there and waiting (and make space for the next one!) -- if not, it intrinsically blocks and waits, waking up as soon as the id generator has placed a new id in the queue.你的算法是好的(线程安全,只要你的数据库API模块是安全的),并且可能是最好的方法。它永远不会给你重复的信息(假设你在 sid 上有 PRIMARY 或 UNIQUE 键),但是你有一个很小的机会在 INSERT 上得到
IntegrityError
异常。但你的代码看起来不太好。最好使用尝试次数有限的循环而不是递归(如果代码中出现某些错误,递归可能会变得无限):您可以增加读取的随机字符数,以使失败的机会更小。如果您想让 SID 更短,那么
base64.urlsafe_b64encode()
是您的朋友,但是您必须确保您的数据库对此列使用区分大小写的比较(MySQL 的 VARCHAR 不适合,除非您设置二进制排序规则,但 VARBINARY 是可以的)。Your algorithm is OK (thread safe as far as your DB API module is safe) and probably is the best way to go. It will never give you duplicate (assuming you have PRIMARY or UNIQUE key on sid), but you have a neglectfully small chance to get
IntegrityError
exception on INSERT. But your code doesn't look good. It's better to use a loop with limited number of attempts instead of recursion (which in case of some error in the code could become infinite):You can raise the number of random characters read used to make chance to fail even smaller.
base64.urlsafe_b64encode()
is your friend if you'd like to make SID shorter, but then you have to insure your database uses case-sensitive comparison for this columns (MySQL's VARCHAR is not suitable unless you set binary collation for it, but VARBINARY is OK).我建议对 Denis 接受的答案进行一个小修改:
我们只是尝试插入,而不明确检查生成的 ID。插入很少会失败,因此我们通常只需进行一次数据库调用,而不是两次。
这将通过减少数据库调用来提高效率,而不会影响线程安全(因为这将由数据库引擎有效地处理)。
I'm suggesting just a small modification to the accepted answer by Denis:
We simply attempt the insert without explicitly checking for the generated ID. The insert will very rarely fail, so we most often only have to make the one database call, instead of two.
This will improve efficiency by making fewer database calls, without compromising thread-safety (as this will effectively be handled by the database engine).
如果您需要线程安全,为什么不为您的随机数生成器提供一个使用共享锁的函数:
如果调用 get_random_number 的所有线程都使用相同的锁实例,那么一次只有其中一个可以创建随机数数字。
当然,您也刚刚使用此解决方案在您的应用程序中创建了一个瓶颈。根据您的要求,还有其他解决方案,例如创建唯一标识符块然后并行使用它们。
If you require thread safety why not put you random number generator a function that uses a shared lock:
If all the threads calling
get_random_number
use the same lock instance, then only one of them at time can create a random number.Of course you have also just created a bottle neck in your application with this solution. There are other solutions depending on your requirements such as creating blocks of unique identifiers then consuming them in parallel.
无需调用我认为的数据库:
来自此页面。
No need to call the database I'd think:
From this page.
我将从线程唯一 ID 开始,并(以某种方式)将其与线程本地计数器连接起来,然后通过加密哈希算法将其提供给它。
I'd start with a thread-unique ID and (somehow) concatenate that with a thread-local counter, then feed it through a cryptographic hash algorithm.
如果您绝对需要根据数据库验证 uid 并避免竞争条件,请使用事务:
If you absolutely need to verify uid against database and avoid race conditions, use transactions:
每个线程中不都有唯一的一条数据吗?我很难想象两个线程具有完全相同的数据。尽管我不排除这种可能性。
过去,当我做这种性质的事情时,线程通常有一些独特的东西。用户名或客户名或类似性质的名称。例如,我的解决方案是将用户名和当前时间(以毫秒为单位)连接起来,然后对该字符串进行哈希处理并获取哈希值的十六进制摘要。这给出了一个总是相同长度的漂亮字符串。
两个线程中的两个不同的 John Smith(或其他)在同一毫秒内生成 id 的可能性非常小。如果这种可能性让人紧张,那么可能需要前面提到的锁定路线。
正如已经提到的,已经有获取 GUID 的例程。我个人喜欢摆弄哈希函数,因此我以在大型多线程系统上提到的方式成功地推出了自己的哈希函数。
最终由您决定是否确实有包含重复数据的线程。一定要选择一个好的哈希算法。我已经成功地使用了 md5,但我读到有可能与 md5 生成哈希冲突,尽管我从未这样做过。最近我一直在使用sha1。
Is there not a unique piece of data in each thread? It is difficult for me to imagine two threads with exactly the same data. Though I don't discount the possibility.
In the past when I have done things of this nature there is usually something unique about the thread. User name or client name or something of that nature. The solution for me was to concatenate the UserName, for example, and the current time in milliseconds then hash that string and get a hex digest of the hash. This gives one a nice string that is always the same length.
There is a really remote possibility that two different John Smith's (or whatever) in two threads generate the id in the same millisecond. If that possibility makes one nervous then the locking route as mentioned may be needed.
As was already mentioned there are already routines to get a GUID. I personally like fiddling with hash functions so I have rolled my own in the way mentioned on large multi threaded systems with success.
It is ultimately up to you to decide if you really have threads with duplicate data. Be sure to choose a good hashing algorithm. I have used md5 successfully but I have read that it is possible to generate a hash collision with md5 though I have never done it. Lately I have been using sha1.
mkdtemp 应该是线程安全的、简单且安全的:
mkdtemp should be thread-safe,simple and secure :