使用 Python 位串测量霍夫曼编码的效率
我有以下字符串,我想对其进行霍夫曼编码并有效地存储到位数组中:
>>> print sequence
GTCAGGACAAGAAAGACAANTCCAATTNACATTATG|
sequence
中符号的频率是:
>>> print freqTuples
[(0.40540540540540543, 'A'), (0.1891891891891892, 'T'), (0.16216216216216217, 'C'), (0.16216216216216217, 'G'), (0.05405405405405406, 'N'), (0.02702702702702703, '|')]`
我将其转换为霍夫曼代码字典:
>>> print codeDict
{'A': '1', 'C': '010', 'G': '001', 'N': '0110', 'T': '000', '|': '0111'}
然后我使用Python bitstring
包将字符串逐字符转换为 BitArray
类的实例,我将其称为 bitArray
,其中包含用于每个字符都用其各自的霍夫曼代码编码:
>>> print bitArray.bin
0b001000010100100110101100111100110101101100000100101100000001101010100000010000010111
这是以字节为单位的位数组:
>>> print bitArray.tobytes()
!I\254\363[^D\260^Z\240Ap
我必须使用 tobytes()
而不是 bytes
,因为我生成的位数组没有均匀地划分为 8 位段。
当我计算 BitArray
表示形式的存储效率(位数组和输入字符串的大小之比)时,我得到的性能比未对输入字符串进行编码时的性能更差:
>>> sys.getsizeof(bitArray.tobytes()) / float(len(sequence))
1.2972972973
我是否在测量存储效率正确吗? (如果我对更长的输入字符串进行编码,这个比率会提高,但它似乎接近 0.28 左右的渐近极限。我想确认这是否是衡量事物的正确方法。)
编辑
以下两种方法会产生不同的答案:
>>> print len(bitArray.tobytes()) / float(len(mergedSequence))
0.297297297297
>>> print bitArray.len / (8.*len(mergedSequence))
0.283783783784
我不确定该相信哪一个。但在将数据写入存储的过程中,我认为我需要字节表示,这使我倾向于选择第一个结果。
I have the following string that I would like to Huffman-encode and store efficiently into a bit array:
>>> print sequence
GTCAGGACAAGAAAGACAANTCCAATTNACATTATG|
The frequencies of the symbols in sequence
are:
>>> print freqTuples
[(0.40540540540540543, 'A'), (0.1891891891891892, 'T'), (0.16216216216216217, 'C'), (0.16216216216216217, 'G'), (0.05405405405405406, 'N'), (0.02702702702702703, '|')]`
I translate this into a Huffman code dictionary:
>>> print codeDict
{'A': '1', 'C': '010', 'G': '001', 'N': '0110', 'T': '000', '|': '0111'}
I then used the Python bitstring
package to translate the string, character by character, into an instance of the BitArray
class, which I call bitArray
, which contains bits for each character encoded with its respective Huffman code:
>>> print bitArray.bin
0b001000010100100110101100111100110101101100000100101100000001101010100000010000010111
Here is the bit array in bytes:
>>> print bitArray.tobytes()
!I\254\363[^D\260^Z\240Ap
I must use tobytes()
instead of bytes
, as the bit array I generate does not divide evenly into 8-bit segments.
When I calculate the storage efficiency of the BitArray
representation (the ratio of the sizes of the bit array and the input string), I get worse performance than if I had left the input string unencoded:
>>> sys.getsizeof(bitArray.tobytes()) / float(len(sequence))
1.2972972973
Am I measuring storage efficiency correctly? (If I encode longer input strings, this ratio improves, but it seems to approach an asymptotic limit of around 0.28. I'd like to confirm if this is the right way to measure things.)
Edit
The following two approaches yield different answers:
>>> print len(bitArray.tobytes()) / float(len(mergedSequence))
0.297297297297
>>> print bitArray.len / (8.*len(mergedSequence))
0.283783783784
I'm not sure which to believe. But in the process of writing data to storage, I think I would need the byte representation, which makes me inclined towards choosing the first result.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
意味着编码版本比原始序列长 30%。
我认为您不想在这里使用 getsizeof - 如果您想最小化 Python 对象的大小,您也应该使用 getsizeof(sequence) ,而不是
len
。相反,如果您想要执行霍夫曼编码的目的,并最小化二进制表示,那么您需要在两者上使用
len
(假设序列表示为每个字符一个字节)。因此,您的实际比率是 11 / 37。
我假设您正在使用霍夫曼编码作为练习,因为这似乎不是有效存储带有终止字符的四位代码的逻辑方法。至少使用算术编码会更好,这将允许您使用 base-5 编码而不是 base-2,这对于 5 个可能的字符来说是最佳的。
实际上,我假设在一个足够长值得压缩的序列中,存在已知的 G:A:C:T 比率和/或固定长度 2 位编码将同样有效(比率接近 1:1: 1:1) 因为您实际上不需要对终止字符进行编码。
Implies that the encoded version is 30% longer than the original sequence.
I don't think you want to use
getsizeof
here -- if you want to minimize the size of the Python object, you should be usinggetsizeof(sequence)
as well, rather thanlen
.If instead, you want to do what Huffman coding is meant to do, and minimize the binary representation, then you want to use
len
on both (assuming the sequence is represented as one-byte-per-character).So, your real ratio is 11 / 37.
I assume you're using Huffman coding as an exercise, as this doesn't seem like a logical way to efficiently store what is just a four-bit code with a termination character. At least it would be better to use arithmetic coding, which will allow you to use base-5 encoding instead of base-2, which is optimal for 5 possible characters.
Really, I would assume in a sequence long enough to be worth compressing, there is a known ratio of G:A:C:T and / or fixed length 2-bit encoding will be just as efficient (the ratios approach 1:1:1:1) since you don't really need to encode the termination character.
我不太确定位数组的东西,但你不应该能够这样做:
我并不是说这会解决你的问题,但它可能是“getsizeof”的东西(再次,我是不太熟悉)会让你失望。
从你在那里写的内容来看,你似乎在将苹果与橙子进行比较。
I'm not really sure about the bitarray stuff, but shouldn't you just be able to do:
I'm not saying that will solve your problem, but it could be that the "getsizeof" thing (again, something I'm not really all that familiar with) is throwing you off.
From what you've written up there, it kind of looks like you're comparing apples to oranges a bit.
你知道答案是错误的,因为霍夫曼字典每个字符小于4位,所以真正的答案一定小于0.5。如果字典和字符频率对于较长的字符串没有变化,那么随着字符串变长,压缩率不应降低到渐近极限。
来自 sys 的文档:
您需要一个函数来返回位串本身的长度,而不是位串+开销。 BitString 文档说
len
或length
属性返回以位为单位的长度。所以尝试这样做:You know that the answer is wrong, because the Huffman dictionary is less than 4 bits per character, so the real answer must be less than .5. If the dictionary and character frequency doesn't change for longer strings, then the compression ratio shouldn't decrease toward an asymptotic limit as the string gets longer.
From the documentation of sys:
You need a function that will return the length of the bitstring itself, not the bitstring + overhead. The BitString documentation says that the
len
orlength
property returns the length in bits. So try doing: