将任意 GUID 编码为可读 ASCII (33-127) 的最有效方法是什么?
GUID 的标准字符串表示形式大约需要 36 个字符。这非常好,但也非常浪费。我想知道如何使用 33-127 范围内的所有 ASCII 字符以最短的方式对其进行编码。天真的实现产生 22 个字符,仅仅是因为 128 位 / 6 位 产生 22。
霍夫曼编码是我的第二好,唯一的问题是如何选择代码...... ,编码必须是无损的。
当然
The standard string representation of GUID takes about 36 characters. Which is very nice, but also really wasteful. I am wondering, how to encode it in the shortest possible way using all the ASCII characters in the range 33-127. The naive implementation produces 22 characters, simply because 128 bits / 6 bits yields 22.
Huffman encoding is my second best, the only question is how to choose the codes....
The encoding must be lossless, of course.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
这是一个老问题,但我必须解决它才能使我正在开发的系统向后兼容。
确切的要求是客户端生成的标识符将写入数据库并存储在 20 个字符的唯一列中。它从未向用户显示,也没有以任何方式建立索引。
由于我无法消除这个要求,所以我真的想使用 Guid (即 统计上唯一),如果我可以将其无损编码为 20 个字符,那么考虑到限制,这将是一个很好的解决方案。
Ascii-85 允许您将 4 字节的二进制数据编码为 5 字节的 Ascii 数据。因此,使用此编码方案,16 字节 guid 正好适合 20 个 Ascii 字符。 Guid 可以有 3.1962657931507848761677563491821e+38 个离散值,而 Ascii-85 的 20 个字符可以有 3.8759531084514355873123178482056e+38 个离散值。
当写入数据库时,我对截断有一些担忧,因此编码中不包含空格字符。我还遇到了 排序规则 的问题,我通过从编码中排除小写字符来解决这个问题。此外,它只能通过参数化命令,因此任何特殊的 SQL 字符都会被自动转义。
我已经包含了执行 Ascii-85 编码和解码的 C# 代码,以防它对任何人有帮助。显然,根据您的使用情况,您可能需要选择不同的字符集,因为我的限制使我选择了一些不寻常的字符,例如“ß”和“Ø” - 但这是简单的部分:
另外,这里是单元测试。它们并不像我想要的那么彻底,而且我不喜欢使用
Guid.NewGuid()
的位置的不确定性,但它们应该让您开始:我希望这可以节省有人遇到麻烦了。
另外,如果您发现任何错误,请告诉我;-)
This is an old question, but I had to solve it in order for a system I was working on to be backward compatible.
The exact requirement was for a client-generated identifier that would be written to the database and stored in a 20-character unique column. It never got shown to the user and was not indexed in any way.
Since I couldn't eliminate the requirement, I really wanted to use a Guid (which is statistically unique) and if I could encode it losslessly into 20 characters, then it would be a good solution given the constraints.
Ascii-85 allows you to encode 4 bytes of binary data into 5 bytes of Ascii data. So a 16 byte guid will just fit into 20 Ascii characters using this encoding scheme. A Guid can have 3.1962657931507848761677563491821e+38 discrete values whereas 20 characters of Ascii-85 can have 3.8759531084514355873123178482056e+38 discrete values.
When writing to the database I had some concerns about truncation so no whitespace characters are included in the encoding. I also had issues with collation, which I addressed by excluding lowercase characters from the encoding. Also, it would only ever be passed through a paramaterized command, so any special SQL characters would be escaped automatically.
I've included the C# code to perform Ascii-85 encoding and decoding in case it helps anyone out there. Obviously, depending on your usage you might need to choose a different character set as my constraints made me choose some unusual characters like 'ß' and 'Ø' - but that's the easy part:
Also, here are the unit tests. They aren't as thorough as I'd like, and I don't like the non-determinism of where
Guid.NewGuid()
is used, but they should get you started:I hope this saves somebody some trouble.
Also, if you find any bugs then let me know ;-)
使用 85 基数。
参见第 4.1 节。 为什么是 85? IPv6 地址的紧凑表示
IPv6 地址(如 GUID)由 8 个 16 位片段组成。
Use Base 85.
See section 4.1. Why 85? of A Compact Representation of IPv6 Addresses
An IPv6 address, like a GUID is made up of eight 16-bit pieces.
您有 95 个可用字符 - 因此,多于 6 位,但少于 7 位(实际上约为 6.57 位)。您可以使用 128/log2(95) = 大约 19.48 个字符来编码为 20 个字符。如果以编码形式保存 2 个字符值得您损失可读性,则类似于(伪代码):
这基本上是通用的“以某种基数编码数字”代码,只不过不需要反转“数字”,因为无论如何,顺序是任意的(小尾数法更直接、更自然)。要从编码字符串中获取 guid,以非常相似的方式进行以 95 为底的多项式计算(当然是在从每个数字中减去 33 之后):
本质上是使用 Horner 的多项式计算方法。
You have 95 characters available -- so, more than 6 bits, but not quite as many as 7 (about 6.57 actually). You could use 128/log2(95) = about 19.48 characters, to encode into 20 characters. If saving 2 characters in the encoded form is worth the loss of readability to you, something like (pseudocode):
which is basically the generic "encode a number in some base" code, except that there's no need to reverse the "digits" since the order's arbitrary anyway (and little-endian is more direct and natural). To get back the guid from the encoded string is, in a very similar way, the polynomial computation in base 95 (after subtracting 33 from each digit of course):
essentially using Horner's approach to polynomial evaluation.
只需转到Base64。
Simply go Base64.
使用从 33(顺便说一句,空格有什么问题吗?)到 127 的完整范围,可以得到 95 个可能的字符。以 95 为基数表示 guid 的
2^128
可能值将使用 20 个字符。这是你能做的最好的事情(模数的事情,比如丢弃恒定的半字节)。省去麻烦 - 使用 base 64。Using the full range from 33 (what's wrong wirh space, incidentally?) to 127 gives you 95 possible characters. Expressing the
2^128
possible values of guid in base 95 will use 20 characters. This (modulo things like dropping nybbles that will be constant) is the best you can do. Save yourself the trouble - use base 64.假设您的所有 GUID 均由相同算法生成,则在应用任何其他编码之前,您可以通过不对算法半字节进行编码来节省 4 位:-|
Assuming that all of your GUIDs are being generated by the same algorithm, you can save 4 bits by not encoding the algorithm nibble, before applying any other encoding :-|
任意 GUID? “朴素”的算法将产生最佳结果。进一步压缩 GUID 的唯一方法是利用“任意”约束排除的数据中的模式。
An arbitrary GUID? The "naive" algorithm will produce optimal results. The only way to compress a GUID further is to make use of patterns in the data excluded by your "arbitrary" constraint.
我同意 Base64 的方法。它将把 32 个字母的 UUID 缩减为 22 个字母的 Base64。
这是简单的十六进制 <-> PHP 的 Base64 转换函数:
I agree with the Base64 approach. It will cut back a 32-letter UUID to 22-letter Base64.
Here are simple Hex <-> Base64 converting functions for PHP: