将哈夫曼码字符串转换为二进制

发布于 2024-12-01 17:54:10 字数 1500 浏览 1 评论 0 原文

我在如何将霍夫曼编码字符串转换为二进制 python 时遇到问题。

这道题与霍夫曼算法无关。

它是这样的:

我可以获得一个编码的霍夫曼字符串,比如01010101010注意,它是一个字符串。

但现在我想将字符串表示形式保存到真正的二进制文件中。

在哈夫曼编码的字符串中,每个0和1都是一个字节

我想要的是每一个0和1都是一个

我怎样才能在Python中做到这一点?

编辑1:

请原谅我没有足够清楚地描述我的问题。

让我解释一下我目前将 0 和 1 写入二进制的方法。

比如说,我们可以得到一个代码字符串 s='010101010'。

  1. 我使用int将其转换为整数
  2. 然后使用unichr将其转换为字符串,以便我可以将其写入文件
  3. 以二进制模式将字符串写入文件

还要注意,我需要读取该文件才能解码霍夫曼代码。

所以我的方法是,

  1. 从文件中读取字节,
  2. 将它们恢复为 int,
  3. 将 int 转换为其二进制表示字符串。
  4. 解码字符串

第 2 步,问题发生了,我变得毫无头绪。

因为有些哈夫曼字符串可能很短(例如10),而有些则可能很长(010101010101001)。这导致它们的 int 值的字节长度不同( 一些短字符串可能只需要一个字节,而长字符串可能需要两个甚至更多 )

以下代码说明了我的问题:

ss=['010101','10010101010'] 
# first one is short and takes only one byte in its int value
# second one is long and takes two bytes   

print 'write it to file'
with open('binary.bin','wb') as f:
    for s in ss:
        n=int(s,2)
        print n
        s=unichr(n)
        f.write(s)

print 'read it to file'
with open('binary.bin','rb') as f:
    for s in f.read(): 
        print ord(s)

我在第二个 with 部分中一次读取一个字节,但这实际上是不正确的。因为字符串10010101010占用两个字节。

那么,当我从文件中读取这些字节时,我应该一次读取多少字节?

I am having problem with how to convert huffman encoding string to binary python.

This question involves nothing of the huffman algorithm.

It is like this:

I can get an encoded huffman string, say 01010101010. Note, it is a string.

But now I want to save the string representation into real binary.

In the huffman encoded string, every 0 and 1 is a byte.

What I want is every 0 and 1 is a bit.

How can I do that in python?

Edit 1:

Please forgive I did not describe my problem clear enough.

Let me explain my current approach of writing to zeros and ones to binary.

Say, we can a code string s='010101010'.

  1. I use int to convert it to integer
  2. Then use unichr to convert it to string so that I can write it to file
  3. write the string to file in binary mode

Also to be noted, I need to read the file in order to decode the huffman code.

So my approach is,

  1. read the bytes from file
  2. restore them to int
  3. convert the int to their binary representation string.
  4. decode the string

And at step 2, the problem happens and I became clueless.

As some huffman string can be short(like, 10), while some can be long(010101010101001). This results in their different byte length in their int value(
some short string may take just one byte,while long ones can take two or even more
)

The following code illustrates my problem:

ss=['010101','10010101010'] 
# first one is short and takes only one byte in its int value
# second one is long and takes two bytes   

print 'write it to file'
with open('binary.bin','wb') as f:
    for s in ss:
        n=int(s,2)
        print n
        s=unichr(n)
        f.write(s)

print 'read it to file'
with open('binary.bin','rb') as f:
    for s in f.read(): 
        print ord(s)

I am reading one byte a time in the second with part, but this is actually not correct. Because string 10010101010 takes up two bytes.

So, when I read those bytes from the file, How many bytes should I read at once?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

ペ泪落弦音 2024-12-08 17:54:10

您可能想要使用 Python 中两种不同的“二进制”表示形式。

Bignum

One 是“bignum”或任意精度整数。这种类型在 Python 2.x 中称为 long,在 Python 3.x 中称为 int。顾名思义,这种表示形式在语义上是任意长度的整数,因此如果您计划对结果数字进行算术运算,它会很有用。要解析二进制数字字符串,请使用

# Python 2
long(digit_str, 2)

# Python 3
int(digit_str, 2)

bitstring 库。

或者,正如 Marc B 在评论中建议的那样,使用 位串。具体来说,要进行转换,请使用 bitstring.pack 函数< /a>.

对于霍夫曼编码,使用位串可能比将数据存储在字节字符串中更好,因为霍夫曼编码通常不是 8 位的倍数; bitstring 允许您操作任意长度的位串。缺点:bitstring 不是标准库的一部分。

There are two different "binary" representations in Python that you might want to use.

Bignum

One is a "bignum" or arbitrary-precision integer. This type is called long in Python 2.x and int in Python 3.x. As the name suggests, this representation is semantically an integer of arbitrary length, so it's useful if you plan to do arithmetic on the resulting number. To parse a string of binary digits, use

# Python 2
long(digit_str, 2)

or

# Python 3
int(digit_str, 2)

bitstring library

Alternatively, as Marc B suggested in the comments, use the bitstring library. Specifically, for conversion, use the bitstring.pack function.

For Huffman coding, using bitstring is probably preferable to storing data in a byte-string, since Huffman codes are generally not multiples of 8 bits; bitstring lets you manipulate bit-strings of arbitrary lengths. Disadvantage: bitstring is not part of the standard library.

故事↓在人 2024-12-08 17:54:10

一种可能的方法(使用bitstring库)有一定道理,但仍然包含不正确之处:

使用bitstring库(感谢Mechanical snail >Marc B )

用于写入文件。

步骤:

  1. 将纯文本编码为二进制表示字符串
  2. 连接所有这些字符串以形成一个更长的字符串
  3. 使用 bitstring.BitArray 转换为十六进制格式
  4. 将十六进制字符串写入文件

读取:

  1. 从以下位置读取十六进制字符串文件
  2. 使用 BitArray 将其转换回位字符串
  3. 开始解码

代码:

ss=['01010100','10010101010','010101110101010101'] #encoded message


from bitstring import BitArray,BitStream
print 'write it to file'
with open('binary.bin','wb') as f:
    s=''.join(ss);
    b=BitArray(bin=s)                 
    f.write(b.tobytes())# thanks to Scott, tobytes() method is very useful

print 'read it to file'
b=BitArray(filename='binary.bin')
print b.bin

One possible approach(using bitstring library) which makes some sense, but still contain incorrectness:

Use bitstring library(Thanks to Mechanical snail and Marc B )

For writing to file.

Steps:

  1. encode the plain text to binary representation string
  2. concatenate all those strings to form one longer one
  3. use bitstring.BitArray to convert to hex format
  4. write the hex string to a file

For reading:

  1. read the hex string from file
  2. convert it back to bit string using BitArray
  3. start decoding

Code:

ss=['01010100','10010101010','010101110101010101'] #encoded message


from bitstring import BitArray,BitStream
print 'write it to file'
with open('binary.bin','wb') as f:
    s=''.join(ss);
    b=BitArray(bin=s)                 
    f.write(b.tobytes())# thanks to Scott, tobytes() method is very useful

print 'read it to file'
b=BitArray(filename='binary.bin')
print b.bin
我要还你自由 2024-12-08 17:54:10

您有一个需要转换为数字的字符串。 int 采用可选的“base”作为参数。因此,对于示例中的字符串,

>>> int('01010101010', 2)
682

一旦有了数字(而不是字符串),想要“真正的”二进制就没有意义,因为数字是相同的,您可以以任何基数显示它。这意味着二进制 100 与十进制 4 是相同的数字,在您的程序中它们不是不同的数字。因此,一旦将字符串转换为数字,您就可以修改其中的位。

You have a string that you need to convert into a number. int takes an optional 'base' as an argument. So for the string in your example,

>>> int('01010101010', 2)
682

Once you have a number (not a string), it doesn't make sense to want "real" binary, since the number is the same, you can display it in any base. That means the binary 100 is the same number as the decimal 4, inside your program they're not different numbers. So, once you turn your string into a number you can fiddle with the bits in it.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文