我在如何将霍夫曼编码字符串转换为二进制 python 时遇到问题。
这道题与霍夫曼算法无关。
它是这样的:
我可以获得一个编码的霍夫曼字符串,比如01010101010
。 注意,它是一个字符串。
但现在我想将字符串表示形式保存到真正的二进制文件中。
在哈夫曼编码的字符串中,每个0和1都是一个字节。
我想要的是每一个0和1都是一个位。
我怎样才能在Python中做到这一点?
编辑1:
请原谅我没有足够清楚地描述我的问题。
让我解释一下我目前将 0 和 1 写入二进制的方法。
比如说,我们可以得到一个代码字符串 s='010101010'。
- 我使用
int
将其转换为整数
- 然后使用
unichr
将其转换为字符串,以便我可以将其写入文件
- 以二进制模式将字符串写入文件
还要注意,我需要读取该文件才能解码霍夫曼代码。
所以我的方法是,
- 从文件中读取字节,
- 将它们恢复为 int,
- 将 int 转换为其二进制表示字符串。
- 解码字符串
在第 2 步,问题发生了,我变得毫无头绪。
因为有些哈夫曼字符串可能很短(例如10
),而有些则可能很长(010101010101001
)。这导致它们的 int 值的字节长度不同(
一些短字符串可能只需要一个字节,而长字符串可能需要两个甚至更多
)
以下代码说明了我的问题:
ss=['010101','10010101010']
# first one is short and takes only one byte in its int value
# second one is long and takes two bytes
print 'write it to file'
with open('binary.bin','wb') as f:
for s in ss:
n=int(s,2)
print n
s=unichr(n)
f.write(s)
print 'read it to file'
with open('binary.bin','rb') as f:
for s in f.read():
print ord(s)
我在第二个 with 部分中一次读取一个字节,但这实际上是不正确的。因为字符串10010101010
占用两个字节。
那么,当我从文件中读取这些字节时,我应该一次读取多少字节?
I am having problem with how to convert huffman encoding string to binary python.
This question involves nothing of the huffman algorithm.
It is like this:
I can get an encoded huffman string, say 01010101010
. Note, it is a string.
But now I want to save the string representation into real binary.
In the huffman encoded string, every 0 and 1 is a byte.
What I want is every 0 and 1 is a bit.
How can I do that in python?
Edit 1:
Please forgive I did not describe my problem clear enough.
Let me explain my current approach of writing to zeros and ones to binary.
Say, we can a code string s='010101010'.
- I use
int
to convert it to integer
- Then use
unichr
to convert it to string so that I can write it to file
- write the string to file in binary mode
Also to be noted, I need to read the file in order to decode the huffman code.
So my approach is,
- read the bytes from file
- restore them to int
- convert the int to their binary representation string.
- decode the string
And at step 2, the problem happens and I became clueless.
As some huffman string can be short(like, 10
), while some can be long(010101010101001
). This results in their different byte length in their int value(
some short string may take just one byte,while long ones can take two or even more
)
The following code illustrates my problem:
ss=['010101','10010101010']
# first one is short and takes only one byte in its int value
# second one is long and takes two bytes
print 'write it to file'
with open('binary.bin','wb') as f:
for s in ss:
n=int(s,2)
print n
s=unichr(n)
f.write(s)
print 'read it to file'
with open('binary.bin','rb') as f:
for s in f.read():
print ord(s)
I am reading one byte a time in the second with part, but this is actually not correct. Because string 10010101010
takes up two bytes.
So, when I read those bytes from the file, How many bytes should I read at once?
发布评论
评论(3)
您可能想要使用 Python 中两种不同的“二进制”表示形式。
Bignum
One 是“bignum”或任意精度整数。这种类型在 Python 2.x 中称为
long
,在 Python 3.x 中称为int
。顾名思义,这种表示形式在语义上是任意长度的整数,因此如果您计划对结果数字进行算术运算,它会很有用。要解析二进制数字字符串,请使用。
bitstring
库。或者,正如 Marc B 在评论中建议的那样,使用
位串
库。具体来说,要进行转换,请使用bitstring.pack
函数< /a>.对于霍夫曼编码,使用
位串
可能比将数据存储在字节
字符串中更好,因为霍夫曼编码通常不是 8 位的倍数;bitstring
允许您操作任意长度的位串。缺点:bitstring
不是标准库的一部分。There are two different "binary" representations in Python that you might want to use.
Bignum
One is a "bignum" or arbitrary-precision integer. This type is called
long
in Python 2.x andint
in Python 3.x. As the name suggests, this representation is semantically an integer of arbitrary length, so it's useful if you plan to do arithmetic on the resulting number. To parse a string of binary digits, useor
bitstring
libraryAlternatively, as Marc B suggested in the comments, use the
bitstring
library. Specifically, for conversion, use thebitstring.pack
function.For Huffman coding, using
bitstring
is probably preferable to storing data in abyte
-string, since Huffman codes are generally not multiples of 8 bits;bitstring
lets you manipulate bit-strings of arbitrary lengths. Disadvantage:bitstring
is not part of the standard library.一种可能的方法(使用bitstring库)有一定道理,但仍然包含不正确之处:
使用bitstring库(感谢Mechanical snail和 >Marc B )
用于写入文件。
步骤:
读取:
代码:
One possible approach(using bitstring library) which makes some sense, but still contain incorrectness:
Use bitstring library(Thanks to Mechanical snail and Marc B )
For writing to file.
Steps:
For reading:
Code:
您有一个需要转换为数字的字符串。
int
采用可选的“base”作为参数。因此,对于示例中的字符串,一旦有了数字(而不是字符串),想要“真正的”二进制就没有意义,因为数字是相同的,您可以以任何基数显示它。这意味着二进制
100
与十进制4
是相同的数字,在您的程序中它们不是不同的数字。因此,一旦将字符串转换为数字,您就可以修改其中的位。You have a string that you need to convert into a number.
int
takes an optional 'base' as an argument. So for the string in your example,Once you have a number (not a string), it doesn't make sense to want "real" binary, since the number is the same, you can display it in any base. That means the binary
100
is the same number as the decimal4
, inside your program they're not different numbers. So, once you turn your string into a number you can fiddle with the bits in it.