guid 可以作为一个好的分区键吗?
我必须在多台机器上存储许多千兆字节的数据。文件由Guid唯一标识,一个文件只能托管在一台机器上。我想知道是否可以使用 Guid 作为分区键来确定应该使用哪台机器来存储数据。如果是这样,我的配分函数是什么?
否则,我如何才能对数据进行分区,使所有机器都获得非常相似的负载?
谢谢!
PS 我没有使用 Sql Server、Oracle 或任何其他数据库。这都是内部代码。 PSS Guid 是使用 .NET 函数 Guid.NewGuid() 生成的。
I have to store many gigabytes of data across multiple machines. The files are uniquely identified by Guid and one file can be hosted on one machine only. I was wondering if I could use the Guid as a partition key to determine which machine should I use to store the data. If so, what would be my partition function?
Otherwise, how could I partition my data in such way that all the machine get a very similar load?
Thanks!
P.S. I am not using Sql Server, Oracle or any other DB. This is all in-house code.
P.S.S. The Guid are generated using the .NET function Guid.NewGuid().
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
正如詹姆斯在评论中所说,你需要一些具有良好、均匀分布的东西。指南不具有此属性。我会推荐一种哈希值,甚至是像 Guid 本身的哈希值一样简单的哈希值。
SHA-1 哈希具有良好的分布。我不建议偶/奇散列,除非您计划仅在两台机器之间分发。
As James said in his comment, you need something that has a good, uniform distribution. Guids do not have this property. I would recommend a hash, even one as simple as a hash of the Guid itself.
A SHA-1 hash has a good distribution. I wouldn't recommend even/odd hashing unless you plan on only distributing between 2 machines.
因为 GUID 是随机的,所以您可以通过将奇数 GUID 存储在一台计算机上并将偶数 GUID 存储在另一台计算机上来分配它们......
给出几乎相等的分布。
编辑
事实上,当拆分超过 2 台机器时,这将不起作用,尽管您可以在其他奇数或偶数字节上再次拆分。
Because GUIDs are random you could distribute them by storing the odd GUIDs on one machine and the even GUIDs on the other...
Gives a near equal distribution.
EDIT
Indeed this will not work when splitting across more than 2 machines although you could then split again on an other byte being odd or even.
如果您想对您的发行版进行循环,我会考虑使用同步计数器的可能性,您可以以经典的循环方式对您拥有的机器数量进行百分比计算。
同步计数器可以是数据库中的一个字段,也可以是单个 Web 服务,或者网络上的文件等。每次放置文件时都可以递增的任何内容。
If you want to round robin your distribution I would be looking at the possibility of a synchronized counter which you % the number of machines you have in a classical round robin manner.
The synchronized counter could be a field in a database, it could be a single web service, or a file on the network etc. Anything which could be incremented every time a file gets placed.