生成/压缩唯一密钥
在我的工作中,我有很多用户,每个用户的主目录中都有一组文件。由于一些预定义的规则,我根据用户文件内容及其创建时间为每个文件指定了一个 UID(唯一标识)。但现在我知道用户帐户中的文件数量不能超过 100 万个。当前 UID 长度约为 32 个字符。有什么方法可以将我的 UID 减少到大约 6 个(理想条件)字符到大约 10-12 个字符,因为当前的 uidl 在我的 NoSQL 数据库中使用了大量空间。
当前的uidl看起来像 timestamp.prrocess_whichcreated_it.size
编辑 让我重新表述一下这个问题。我真正需要的是一个压缩算法: 例如,
我有 1,000,000 个字符串的列表(每个字符串都是唯一的),每个字符串长度为 32 个字符。我需要一个压缩函数 f,使得 F(string) = s2 ,其中 S2 的长度为 10 个字符,并且所有 S2 字符串都是唯一映射的
In my work I have many users and each users have set of files in there home directories. Due to some pre defined rules I have given each file a UID (unique identification), based on the user file content and its creation time. But now I came to know that the number of files in user account cannot exceed say 1 million. The current UID is about 32 characters long. Is there any way through which I can bring down my UID to about 6 (ideal condition) character to about 10-12 character long as the current uidl is using lots of space in my NoSQL database.
Current uidl looks like
timestamp.prrocess_whichcreated_it.size
EDIT
Let me rephrase the problem. What I actually need is a compressing algo:
For e.g.
I have list of 1,000,000 strings( each unique )and each 32 character long. I need a compress function f, such that F(string) = s2 , where S2 is of length 10 characters and all the S2 strings are uniquely mapped
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
对您的 UID 进行排序并用新的 UID 替换旧的 UID,指示旧 UID 的排序数组中的索引,
简化的伪代码应如下所示:
Sort your UID's and replace the old UID's with a new UID indicating the index in the sorted array of the old UID's
a simplified pseudo code should look like that:
很难采用唯一的 id 来压缩它并保持它的唯一性。你往往会遇到碰撞。
@amit 的建议确实是最好的。也许他的实现有点圆滑。
创建一个带有自动递增整数“ID”列和字符串/varchar“OldGUID”的表怎么样?将所有旧的/当前的 GUID 插入表中,现在 GUID 和更短/压缩的“ID”之间有一对一的匹配。当您创建新的 GUID 时,只需将它们插入表中,您将继续进行一对一匹配,以便您可以在长版本和短版本之间来回切换。
It very difficult to take a UNIQUE id compress it and keep it UNIQUE. You tend to run into collisions.
@amit's suggestion really is the best one. Perhaps his implementation was a bit glib though.
How about you create a table with an AUTO INCREMENTING INTEGER "ID" column and a string/varchar "OldGUID". INSERT all your old/current GUIDs into the table and now you have a 1-to-1 match between the GUID and a shorter/compressed "ID". As you create new GUIDs just INSERT them into the table and you'll continue having the 1-to-1 match so you can switch back and forth between long and short version.
如果您只需要一个唯一标识符,那么我的第一个想法是UUID。
然而,通用UUID将消耗16个字节,并且是二进制格式。它不满足您对 6 个字符的要求。与当前使用 32 个字符的方法相比,它“仅”节省了 50% 的空间。
因此,更温和的方案是使用 64 位 UID(8 字节)和通用哈希函数。有了良好的哈希值,只要生成的 UID 总数低于 <0,冲突的概率就保持相当合理。一亿。如果这看起来可以接受,那么 8 字节似乎非常接近您的空间需求。
If you only need a Unique Identifier, then my first thought goes to UUID.
However, generic UUID will consume 16 bytes, and is binary format. It does not meat your requirement of 6 characters. Compared to your current method using 32 characters, it "only" saves 50% space.
Therefore, a milder scheme would be to use 64-bit UID (8 bytes) with a general Hash Function. With a good hash, the probability of collision remains fairly reasonable as long as the total number of UID generated is below < 100 millions. If that seems acceptable, then 8-bytes seems pretty close to your space requirement.