在 Java 中缩短已经很短的字符串
我正在寻找一种方法来尽可能缩短已经很短的字符串。
该字符串是主机名:端口组合,可能类似于“my-domain.se:2121”或“123.211.80.4:2122”。
我知道由于所需的开销和缺乏重复,对于这么短的字符串来说常规压缩几乎是不可能的,但我知道如何做到这一点。
因为字母表的长度限制为 39 个字符 ([az][0-9]-:.),每个字符可以容纳 6 位。与 ASCII 相比,长度最多可减少 25%。所以我的建议是这样的:
- 使用某种自定义编码将字符串编码为字节数组
- 将字节数组解码为 UTF-8 或 ASCII 字符串(该字符串显然没有任何意义)。
然后逆向处理即可得到原始字符串。
所以我的问题是:
- 这可行吗?
- 有更好的办法吗?
- 如何?
I'm looking for a way to shorten an already short string as much as possible.
The string is a hostname:port combo and could look like "my-domain.se:2121" or "123.211.80.4:2122".
I know regular compression is pretty much out of the question on strings this short due to the overhead needed and the lack of repetition but I have an idea of how to do it.
Because the alphabet is limited to 39 characters ([a-z][0-9]-:.) every character could fit in 6 bits. This reduce the length with up to 25% compared to ASCII. So my suggestion is somthing along these lines:
- Encode the string to a byte array using some kind of custom encoding
- Decode the byte array to a UTF-8 or ASCII string (this string will obviously not make any sense).
And then reverse the process to get the original string.
So to my questions:
- Could this work?
- Is there a better way?
- How?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
您可以将字符串编码为基数 40,它比基数 64 更紧凑。这将为您提供 12 个这样的标记,长度为 64 位。第 40 个标记可能是字符串标记的结尾,用于给出长度(因为它不再是整数字节)
如果您使用算术编码,它可能会小得多,但您需要每个标记的频率表令牌。 (使用一长串可能的示例)
打印
我将解码作为练习。 ;)
You could encode the string as base 40 which is more compact than base 64. This will give you 12 such tokens into a 64 bit long. The 40th token could be the end of string marker to give you the length (as it will not be a whole number of bytes any more)
If you use arithmetic encoding, it could be much smaller but you would need a table of frequencies for each token. (using a long list of possible examples)
prints
I leave decoding as an exercise. ;)
首先,IP 地址被设计为 4 个字节,端口号为 2 个字节。ascii 表示仅供人类阅读,因此对其进行压缩是没有意义的。
您压缩域名字符串的想法是可行的。
First of all, IP addresses are designed to fit into 4 bytes and port numbers into 2. The ascii representation is only for humans to read, so it doesn't make sense to do compression on that.
Your idea for compressing domain name strings is doable.
对于您的情况,我会为您的用例使用专门的算法。认识到您可以存储字符串以外的其他内容。因此,对于 IPv4 地址:端口,您将有一个捕获 6 个字节的类 - 4 个字节用于地址,2 个字节用于端口。另一个用于字母数字主机名的类型。端口始终以两个字节存储。例如,主机名部分本身也可以专门支持
.com
。因此,示例层次结构可能是:在这种情况下,DotComHostnamePort 允许您从主机名中删除
.com
并保存 4 个字符/字节,具体取决于您是以 puny 形式还是以 UTF16 形式存储主机名。Well in your case, I would use a specialized algo for your usecase. Recognize that you can store something other than strings. So for a IPv4 address : port, you would have a class that captured 6 bytes -- 4 for the address and 2 for the port. Another for type for apha-numeric hostnames. The port would always be stored in two bytes. The hostname part itself could also have specialized support for
.com
, for example. So a sample hierarchy may be:In this case, the DotComHostnamePort allows you to drop
.com
from the host name and save 4 chars/bytes, depending on whether you store hostnames in punyform or in UTF16 form.前两个字节可以包含端口号。如果始终以此固定长度端口号开头,则无需包含分隔符
:
。而是使用一个位来指示后面是否有 IP 地址(请参阅 Karl Bielefeldt 的 解决方案)或主机名。The first two bytes could contain the port number. If you always start with this fixed length port number, you don't need to include the separator
:
. Instead use a bit that indicates whether an IP address follows (see Karl Bielefeldt's solution) or a host name.您可以使用CDC 显示代码对它们进行编码。这种编码在过去比特稀缺且程序员紧张的时候就被使用过。
You could encode them using the CDC Display code. This encoding was used back in the old days when bits were scarce and programmers were nervous.
您所建议的内容与 Base 64 编码/解码类似,并且在查看其中一些实现时可能会有所帮助(Base 64 编码使用 6 位)。
作为初学者,如果您使用 Apache 的 base 64 库,
它会将您的字符串缩短几个字符。这显然是行不通的,因为你最终得到的并不是你开始的。
What you are suggesting is similar to base 64 encoding/decoding and there might be some mileage in looking at some of those implementations (base 64 encoding uses 6 bits).
As a starter if you use Apaches base 64 library
It will shorten your string by a few chars. This obviously does not work as what you end up with is not what you started with.