将大整数压缩为尽可能小的字符串
我在 URL 中传递了一堆 10 位整数。像这样的东西: “4294965286”,“2292964213”。它们将始终为正数并且始终为 10 位数。
我想将这些整数压缩成仍然可以在 URL 中使用的最小形式(又名字母和数字完全没问题),然后稍后解压缩它们。我看过使用 gzipstream 但它会创建更大的字符串,而不是更短的字符串。
我目前使用的是 asp.net,因此 vb.net 或 c# 解决方案是最好的。
谢谢
I have a bunch of 10 digit integers that I'm passing in a URL. Something like:
"4294965286", "2292964213". They will always be positive and always be 10 digits.
I'd like to compress those integers into the smallest possible form that can still be used in in a URL (aka letters and numbers are perfectly fine) and then uncompress them later. I've looked at using gzipstream but it creates larger strings, not shorter.
I'm currently using asp.net so a vb.net or c# solution would be best.
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
是的。 GZIP 是一种压缩算法,它既需要可压缩数据,又具有开销(帧和字典等)。应改用编码算法。
“简单”方法是使用base-64编码。
也就是说,将数字(在字符串中表示为基数 10)转换为表示该数字的实际字节序列(5 个字节将覆盖 10 位十进制数字),然后将结果转换为基数 64。每个 Base-64 字符存储 6 位信息(精确到小数~3.3 位/字符),因此将导致大小大约刚刚超过一半(在这种情况下,需要 6* Base-64 输出字符)。
此外,由于输入/输出长度可从数据本身获得,因此“123”可能最初(在进行 Base-64 编码之前)转换为 1 字节,“30000”转换为 2 字节等。如果不是全部,这将是有利的这些数字的长度大致相同。
快乐编码。
* 使用 base-64 需要 6 个输出字符。
编辑:我最初错了,我说十进制为“2.3 位/字符”,并建议需要少于一半的字符。我已经更新了上面的答案,并在此处显示了(应该是正确的)数学,其中
lg(n)
是以 2 为底的对数。表示输入数字所需的输入位数为
位/字符 * 字符
->lg(10) * 10
(或只是lg(9999999999)
) ->~33.2 位
。使用jball的操作先对数字进行移位,需要的位数为lg(8999999999)
->~33.06 位
。然而,在这种特殊情况下,这种转换无法提高效率(需要将输入位数减少到 30 或以下才能产生影响)。因此,我们尝试找到一个 x(base-64 编码中的字符数),使得:
lg(64) * x = 33.2
->6 * x = 33.2
->x ~ 5.53
。当然,五个半字符是没有意义的,因此我们选择 6 作为在 Base-64 编码中编码高达 999999999 的值所需的最大字符数。这比原来 10 个字符的一半多一点。但是,应该注意的是,要在 Base-64 输出中仅获取 6 个字符,需要非标准的 Base-64 编码器或进行一些操作(大多数 Base-64 编码器仅适用于整个字节)。这是可行的,因为在最初的 5 个“所需字节”中,仅使用了 40 位中的 34 位(前 6 位始终为 0)。需要 7 个 base-64 字符来编码所有 40 位。
这是 Guffa 在他的答案中发布的代码的修改(如果你喜欢它,请给他投票),只需要 6 个 base-64 字符。请参阅 Guffa 的回答和 Base64 for URL applications 中的其他注释,因为下面的方法不使用URL友好的映射。
使其“更漂亮”
由于 base-64 已确定使用 6 个字符,因此任何仍将输入位编码为 6 个字符的编码变体都将创建同样小的输出。使用 base-32 编码 不会完全成功,如 base-32 编码 6字符只能存储 30 位信息 (
lg(32) * 6
)。但是,使用自定义的 base-48(或 52/62)编码可以实现相同的输出大小。 (基数 48-62 的优点是它们只需要字母数字字符的子集,不需要符号;可选地,可以避免变体中的“不明确”符号,例如 1 和“I”)。使用 base-48 系统,6 个字符可以编码约 33.5 位 (
lg(48) * 6
) 的信息,该信息略高于约 33.2(或约33.06)位 (lg( 10) * 10
) 必需。这是一个概念验证:
结果是:
上面考虑了数字“随机且不透明”的情况;也就是说,对于数字的内部结构没有任何可以确定的信息。然而,如果存在已定义的结构(例如,第 7、8 和 9 位始终为零,第 2 和 15 位始终相同),那么——当且仅当可以消除 4 位或更多位信息 /em> 来自输入——仅需要 5 个 base-64 字符。增加的复杂性和对结构的依赖很可能超过任何边际收益。
Yes. GZIP is a compression algorithm which both requires compressible data and has an overhead (framing and dictionaries, etc). An encoding algorithm should be used instead.
The "simple" method is to use base-64 encoding.
That is, convert the number (which is represented as base 10 in the string) to the actual series of bytes that represent the number (5 bytes will cover a 10 digit decimal number) and then base-64 that result. Each base-64 character stores 6 bits of information (to the decimals ~3.3 bits/character) and will thus result in a size of approximately just over half (in this case, 6* base-64 output characters are required).
Additionally, since the input/output lengths are obtainable from the data itself, "123" might be originally (before being base-64 encoded) converted as 1 byte, "30000" as 2 bytes, etc. This would be advantageous if not all the numbers are approximately the same length.
Happy coding.
* Using base-64 requires 6 output characters.
Edit: I was wrong initially where I said "2.3 bits/char" for decimal and proposed that less than half the characters were required. I have updated the answer above and show the (should be correct) math here, where
lg(n)
is log to the base 2.The number of input bits required to represent the input number is
bits/char * chars
->lg(10) * 10
(or justlg(9999999999)
) ->~33.2 bits
. Using jball's manipulation to shift the number first, the number of bits required islg(8999999999)
->~33.06 bits
. However this transformation isn't able to increase the efficiency in this particular case (the number of input bits would need to be reduced to 30 or below to make a difference here).So we try to find an x (number of characters in base-64 encoding) such that:
lg(64) * x = 33.2
->6 * x = 33.2
->x ~ 5.53
. Of course five and a half characters is nonsensical so we choose 6 as the maximum number of characters required to encode a value up to 999999999 in base-64 encoding. This is slightly more than half of the original 10 characters.However, it should be noted that to obtain only 6 characters in base-64 output requires a non-standard base-64 encoder or a little bit of manipulation (most base-64 encoders only work on whole bytes). This works because out of the original 5 "required bytes" only 34 of the 40 bits are used (the top 6 bits are always 0). It would require 7 base-64 characters to encode all 40 bits.
Here is a modification of the code that Guffa posted in his answer (if you like it, go give him an up-vote) that only requires 6 base-64 characters. Please see other notes in Guffa's answer and Base64 for URL applications as the method below does not use a URL-friendly mapping.
Making it "prettier"
Since base-64 has been determined to use 6 characters then any encoding variant that still encodes the input bits into 6 characters will create just as small an output. Using a base-32 encoding won't quite make the cut, as in base-32 encoding 6 character can only store 30 bits of information (
lg(32) * 6
).However, the same output size could be achieved with a custom base-48 (or 52/62) encoding. (The advantage of a base 48-62 is that they only requires a subset of alpha-numeric characters and do not need symbols; optionally "ambiguous" symbols like 1 and "I" can be avoided for variants). With a base-48 system the 6 characters can encode ~33.5 bits (
lg(48) * 6
) of information which is just above the ~33.2 (or ~33.06) bits (lg(10) * 10
) required.Here is a proof-of-concept:
The result is:
The above considers the case where the numbers are "random and opaque"; that is, there is nothing that can be determined about the internals of the number. However, if there is a defined structure (e.g. 7th, 8th, and 9th bits are always zero and 2nd and 15th bits are always the same) then -- if and only if 4 or more bits of information can be eliminated from the input -- only 5 base-64 characters would be required. The added complexities and reliance upon the structure very likely outweigh any marginal gain.
您可以使用 base64 编码将数据减少为七个字符。您需要五个字节来表示数字,并且可以使用 base64 将它们编码为八个字符,但最后一个字符始终是填充符
=
,因此可以将其删除:输出:
要解码文本,再次添加
=
,对其进行解码,并将其读取为数字:输出:
base64 使用的两个字符不适合在 URL 中使用,因此您可以将它们替换为其他字符,然后将它们放回原处。例如,
+
和/
字符可以替换为-
和_
。You could use base64 encoding to reduce the data into seven characters. You need five bytes to represent the number, and those can be encoded into eight characters using base64, but that last character is always the filler
=
, so it can be removed:Output:
To decode the text, you add the
=
again, decode it, and read it as a number:Output:
Two of the characters that base64 uses are not suitable for use in an URL, so you can replace them with other characters, and then replace them back. The
+
and/
characters could for example be replaced by-
and_
.我认为您正在寻找的是哈希 ID: http://hashids.org/
它们有多种语言的实现,虽然看起来 C# 并不是其中之一。
我用 JavaScript 为您做了一个示例: http://codepen.io/codycraven/pen/MbWwQm
请注意,HashID 库可保护您的哈希值不包含粗俗语言。
I think what you're looking for are Hash IDs: http://hashids.org/
They have implementations in many languages, although it looks like C# is not one of them.
I made an example for you in JavaScript: http://codepen.io/codycraven/pen/MbWwQm
Note that the HashIDs libraries protect your hashes from including foul language.
除了更改编码的基础 (pst 我大约在同一时间有同样的想法),因为你所有的数字都是10位十进制数字,你可以从每个数字中减去最小的10位数字(10E9)在对其进行编码之前,然后在解码后将其添加回来。这会将您的编码数字转移到 0 - 8999999999 的范围内,从而在基数更改后允许更小的字符串。
In addition to changing the base of the encoding (pst and I had the same thought around the same time), since all your numbers are 10 decimal digits, you can subtract the smallest 10 digit number (10E9) from each number before you encode it, and then add that back in after decoding. This will shift your encoded numbers into the range of 0 - 8999999999, thus allowing for smaller strings after the base change.
将大数转换为公式怎么样:所以我可能会使用 4^34 而不是 21312312312。 链接
What about converting a big number to a formula: So instead of 21312312312 I might use 4^34. Link
我喜欢@user166390的答案,但我更喜欢从大到小的格式,并认为代码可以改进,因为在编码中不需要使用字典,并且不需要在每次解码时生成。另外,我添加了一个例外并更改为 ulong,因为不支持负值。
如果有人有进一步的性能改进,请随意写。也许如果有更好的替代 StringBuilder
这是我修改的代码。
I liked @user166390 answer but I preferred a most-to-least format and thought the code could be improved since the use of dictionary is unnecessary in encode and don't need to be generated on every decode. Also I added an exception and changed to ulong since negative values are not supported.
If somebody has further performance improvements feel free to write. Maybe if there is a better alternative to StringBuilder
Here is the code modified by me.