将字符串编码为具有更多字符的另一个基数?
我知道我可以将数字编码为像 65 这样的基数 减小字符显示的大小(即使二进制数字较小)。
但是,有没有办法将 UTF-8 文本编码为比我们标准 26 字母英文字母表更多字符的另一种基数? 换句话说,而不是需要 4 个“字符”来表示“四”这个词- 我可以只使用2个(即“6$”)来创建表示或散列?
I know that I can encode numbers to a base like 65 to decrease the size of the character display (even if the number is smaller in binary).
However, is there a way to encode UTF-8 text to another base with more characters than our standard 26 letter English alphabet? In other words, Instead of requiring 4 "characters" for the word "four" - I can create a representation or hash using only, maybe 2 (i.e. "6$")?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我相信 Base64 的要点是您可以轻松地将任何二进制数据转换为“人类可读”的字母和数字。它可以轻松地将任意数据转录到新闻组或通过基于文本的协议传输它们。
如果你想进一步“压缩”这些数据,你需要弄清楚你想要允许多少个字符。 8 位的组合只有这么多。最有效的方法是使用所有这些,在这种情况下为什么不使用 gzip 呢?
I believe the point of Base64 is you can easily convert any binary data into "human readable" letters and numbers. It makes it easy to transcribe arbitrary data to newsgroups or transmit them over text based protocols.
If you want to further "compress" this data, you need to figure out how many characters you want to allow. There's only so many combinations of 8 bits. The most efficient would be to use all of them, in which case why just not use gzip?
您的问题似乎与 Order-0 熵编码有关:
http://en.wikipedia.org/wiki/Entropy_encoding
这个家族最著名的算法是霍夫曼编码:
http://en.wikipedia.org/wiki/Huffman_coding
霍夫曼不仅会告诉你,使用 64 个字符,因此每个字符只需 6 位:它还会区分频繁字符(例如(空格))和罕见字符(例如 (;))。然后,它将创建一个代码,其中频繁出现的字符使用的位数少于较少出现的字符,从而获得更好的压缩效果(在英文文本中,每个字符通常为 4.5 位)。
霍夫曼编码是一种全方位的压缩技术,用作许多压缩算法的一部分,包括 zip。
您可以在此处找到一个仅应用一次霍夫曼压缩 (Huff0) 的演示程序,它将帮助您确定通过对示例输入使用此技术可以获得多少收益:
http://fastcompression.blogspot.com/p/huff0-range0-entropy -coders.html
Your question seems related to Order-0 entropy coding :
http://en.wikipedia.org/wiki/Entropy_encoding
The most famous algorithm is this family is Huffman coding :
http://en.wikipedia.org/wiki/Huffman_coding
Huffman will not only tells you that only 64 characters are used and therefore only 6 bits per characters are necessary : it will also make a difference between frequent characters, such as (space), and rare ones, such as (;). It will then create a code in which frequent characters use less bits than rarer ones, resulting in better compression (typically 4.5bits per character on English texts).
Huffman coding is an all-around compression technique, used as part of many compression algorithms, including zip.
You can find a demo program which only applies one pass of Huffman compression here (Huff0), it will help you determine how much can be gained by using this technique for your sample inputs :
http://fastcompression.blogspot.com/p/huff0-range0-entropy-coders.html