将 UTF-8(或其他 8 位编码)压缩到 7 位或更少
我希望获取一个以 UTF-8 编码且不使用超过 128 个不同字符的文件,然后将其移至 7 位编码以节省 1/8 的空间。例如,如果我有一个 16 MB 的文本文件,仅使用前 128 个(ascii)字符,我想删除多余的位以将文件减少到 14 MB。
我该怎么做呢?
似乎没有现有的免费或专有程序可以做到这一点,所以我想我可以尝试制作一个简单的(如果效率低下的)程序。
我的基本想法是从每个字符使用的当前十六进制/十进制/二进制值到我在七位编码中拥有的 128 个值创建一个函数,然后扫描文件并将每个修改后的值写入新文件。
因此,如果文件看起来像(我将使用十进制示例,因为我尽量不必用十六进制思考)
127 254 025 212 015 015 132... 它将变成
001 002 003 004 005 005 006。
如果 127 映射到 001,254 映射到 005,等等,
不过,我不太确定一些事情。
- 这足以实际缩短文件大小吗?我有一种不好的预感,这只会在二进制字符串上留下一个额外的 0——11011001 可能会被映射到 01000001 而不是 1000001,而且我实际上不会节省空间。 如果发生这种情况,我该如何摆脱零?
- 如何打开文件以二进制/十进制/十六进制读/写,而不仅仅是文本? 我主要使用 Python,但如果有必要的话,我也可以混用 C。
谢谢。
I wish to take a file encoded in UTF-8 that doesn't use more than 128 different characters, then move it to a 7-bit encoding to save the 1/8 of space. For example, if I have a 16 MB text file that only uses the first 128(ascii) characters, I would like to shave off the extra bit to reduce the file to 14MB.
How would I go about doing this?
There doesn't seem to be an existing free or proprietary program to do so, so I was thinking I might try and make a simple(if inefficient) one.
The basic idea I have is to make a function from the current hex/decimal/binary values used for each character to the 128 values I would have in the seven bit encoding, then scan through the file and write each modified value to a new file.
So if the file looked like(I'll use a decimal example because I try not to have to think in hex)
127 254 025 212 015 015 132...
It would become
001 002 003 004 005 005 006
If 127 mapped to 001, 254 mapped to 005, etc.
I'm not entirely sure on a couple things, though.
- Would this be enough to actually shorten the filesize? I have a bad feeling this would simply leave an extra 0 on the binary string--11011001 might get mapped to 01000001 rather than 1000001, and I won't actually save space.
If this would happen, how do I get rid of the zero? - How do I open the file to read/write in binary/decimal/hex rather than just text?
I've mostly worked with Python, but I can muddle through C if I must.
Thank you.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
只需使用gzip压缩,0%的努力就能节省60-70%的时间!
Just use gzip compression, and save 60-70% with 0% effort!
你知道文件是按字节划分的吗?因此,如果你这样做,你将在字节 1 中拥有第一个字母的 7 位,加上第二个字母的 1 位,然后在字节 2 中,你将拥有第二个字母的 6 位,以及第二个字母的 2 位。第三,依此类推。它看起来像这样:
Do you understand that files are divided into bytes? Thus, if you did that, you'd have 7 bits of the first letter in bytes 1, plus 1 bit of the second letter, then in byte two, you'd have 6 bits of the second letter, and 2 bits of the third, so on. It would look like this:
你的想法是在正确的轨道上,但需要一些发展。如果您对这种数据压缩感兴趣,您可能需要研究霍夫曼编码。这是一种简单的数据压缩技术,可用于许多现实情况。
我可以推荐 Mark Nelson 的《数据压缩书》,这是对数据压缩技术。
Your idea is on the right track, but needs some development. If you're interested in this kind of data compression, you may want to investigate Huffman coding. This is a simple data compression technique that is used in many real-world situations.
I can recommend The Data Compression Book by Mark Nelson which is a great introduction to data compression techniques.
你的想法不太可能行得通。如果将字节 0x05 写入文件,则该字节将被写入,其中所有 8 位均带有前导零。要实际完成您的需要,您可以将每个 8 个字节编码为 7 个字节(因为您只需要 8*7 位来编码 8 个值)。一种方法是将 7 个值保留在其字节的 7 个低位中,并将第 8 个字节分布在 7 个 MSBit 上。
对于Python,以二进制写入模式打开文件是
open(filename, 'wb')
。您还必须了解如上所述的用于打包字节的位操作。只是一个小例子:
这会将
a
的最低位放入c
的 MSBit 中,而c
的其余部分是的值b.
。我相信你可以从这里拿走它。
Your idea is unlikely to work. If you write the byte 0x05 into a file, the byte gets written, all 8 bits of it - with leading zeros. To actually accomplish what you need, you can encode each 8 bytes in 7 bytes (since you only need 8*7 bits to encode 8 values). One approach is keep the 7 values in the 7 low bits of their bytes, and spread the 8th byte over the 7 MSBits.
As for Python, opening a file in binary write mode is
open(filename, 'wb')
. You'll also have to learn about bit operations to pack bytes as described above.Just a small example:
This places the lowest bit of
a
into the MSBit ofc
and the rest ofc
is the value ofb
.I'm sure you can take it from here.
“这只会在二进制字符串上留下一个额外的 0 - 11011001 可能会映射到 01000001 而不是 1000001,而且我实际上不会节省空间。”
正确的。你的计划不会起任何作用。
"this would simply leave an extra 0 on the binary string--11011001 might get mapped to 01000001 rather than 1000001, and I won't actually save space."
Correct. Your plan will do nothing.
你需要的是UTF-7。
编辑: UTF-7 的优点是“仅”膨胀特殊字符,因此,如果输入中特殊字符很少见,则获得的字节数比仅将 UTF-8 转换为 7 位要少得多。这就是 UTF-7 的用途。
What you need is UTF-7.
Edit: UTF-7 has the advantage of bloating "only" special characters, so if special characters are rare in the input, you get far less bytes than by just converting UTF-8 to 7 bit. That's what UTF-7 is for.