序列压缩?
最近我遇到了一个让我很困惑的问题 问题是: 我想压缩一个序列,这样就不会丢失任何信息,例如:
a,a,a,b --> a,b
a,b,a,a,c --> a,b,a,a,c(它不能压缩为a,b,a,c,因为这样我们就失去了a,a)
有没有算法可以做这样的事情?这个问题的名称是什么?是压缩吗?或者其他什么? 我真的很感激任何帮助 提前致谢
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
另一个好的算法是 Lempel–Ziv–Welch
我发现这个简单的 Javascript LZW 函数非常棒,来自 140 字节的 javascript 的魔术师:
Another good algorithm is Lempel–Ziv–Welch
I found marvellous this simple Javascript LZW function, from the magicians at 140 bytes of javascript :
每种能够以占用更少内存的方式转换数据的算法称为压缩。可能是无损的,也可能是有损的。
例如(“给定示例”的压缩形式 :-) )
以下是恕我直言最简单的形式,称为游程长度编码,简称 RLE:
正如您所见,所有相同的后续字符都被压缩为一。
您还可以搜索后续模式,这要困难得多:
有很多关于压缩算法的文献和网络资源,您应该使用它们来获得更深入的了解。
Every algorithm which is able to transform data in such a way that is takes up less memory is called compression. May it be lossless or lossy.
e.g. (compressed form for "example given" :-) )
The following is imho the simplest form, called run length encoding, in short RLE:
As you can see all subsequent characters which are identical are compressed into one.
You can also search for subsequent patterns which is much more difficult:
There are lots of literature and web sources about compression algorithms, you should use them to get a deeper view.
RLE
RLE
是的,压缩。一个简单的算法是游程编码。还有信息论,它是压缩算法的基础。
信息论:更常见的输入应该更短,从而使句子长度更短。
因此,如果您要编码二进制,其中序列 0101 非常常见(大约占输入的 25%),那么简单的压缩将是:
因此输入:
0101 1100 0101 0101 1010 0101 1111 0101
将被压缩为:
0 11100 0 0 11010 0 11111 0
这是 32 位的压缩 -> 20 位。
一个重要的教训:压缩算法的选择完全取决于输入。错误的算法可能会使数据变得更长。
Yep, compression. A simple algorithm would be runlength encoding. There also information theory, which is the basis for compression algorithms.
Information theory: More common inputs should be shorter, thus making the sentence length shorter.
So, if you're encoding binary, where the sequence 0101 is very commmon (about 25% of the input), then a simple compression would be:
So the input:
0101 1100 0101 0101 1010 0101 1111 0101
Would be compressed to:
0 11100 0 0 11010 0 11111 0
Thats a compression of 32 bits -> 20 bits.
An important lesson: The compression algorithm choice is entirely dependent on the input. The wrong algorithm and you will likely make the data longer.
除非您必须自己编写一些解决方案,否则您可以使用一些适合您正在使用的编程语言的 ZIP 压缩库。
是的,这就是数据压缩。
Unless you have to code some solution yourself, you could use some ZIP compression library for the programming language you're using.
And yes, this is data compression.
我们可以利用哈希表,使用LZW压缩算法来高效、快速地压缩文本文件。
We can use the LZW compression algorithm to compress text files efficiently and quickly by making use of hash tables.
单个序列的压缩限制
序列的压缩限制是多少?
这个问题可以给出两个答案:
序列的压缩极限是一个未解决的问题。此外,已经表明这个极限无法计算。
香农定义了一个具有很大实际意义的子问题,其中可以定义理论极限(源编码定理)。这个子问题是纯粹的概率问题,称为源编码。
最一般形式的信息传输问题的定义
为了理解这两个答案,有必要介绍最一般形式的信息传输问题。为了在两点之间进行信息传输,两个主体必须共享一种交流语言。这个要求是必要的,没有通信语言,传输的消息无法被理解,因此不携带任何信息。 因此,信息传输问题以其最一般的形式,研究共享通信语言的两点之间任何形式的信息传输。
这个定义的问题在于“信息”的概念以任何形式”很难用数学形式化。 因此,有必要对问题进行第一次简化,其中信息被定义为一个数值序列。这样就得到了一个子问题,其中信息的概念是定义明确并且可以从数学角度形式化。
因此,信息传输所需的两个元素是:携带信息的压缩序列以及编码器和解码器都已知的通信语言。
压缩序列:表示从要传输的序列导出的任何信息。从实际的角度来看,从序列中获得的任何对解码器返回原始消息有用的信息都必须被视为压缩消息的一部分。
通信语言:代表一组允许通信并因此在两个主体之间传输信息的规则。从实践的角度来看,在消息生成之前存在的任何信息都可以被视为通信语言的一部分。
源编码
香农在他的著名文章《通信的数学理论》中引入了对信息传输问题的进一步修改,这使他能够获得一个可以定义理论压缩极限的子问题。为此,香农介绍了生成消息的源。此外,将符号转换为码字的编码方案是根据源的信息生成的,而不是根据从消息获得的信息生成的。 因此,在消息生成之前定义的编码方案可以被视为通信语言的组成部分。然后,香农形式化了信息的概念,为此他开发了一个唯一的变量定义在存在编码方案的情况下传输的消息。香农提出的解决方案是将信息视为变异的度量。执行此测量所生成的函数称为熵 H。以下句子取自文章“通信的数学理论”,使我们了解香农的熵概念。
“在一个概率为 1 的极限情况下(确定性)而其他所有为零(不可能),则 H 为零(完全没有不确定性 - 没有选择自由 - 没有信息)。”
关于他定义的子问题,香农设法定义了通过开发源编码定理,得出理论压缩极限。该定理证明,熵定义了替换压缩消息中符号的码字的最小平均长度。因此,给定随机变量 iid 的源 X,它生成长度为 N 的消息 压缩消息不能长于 NH(X) 消息生成之前定义的编码方案不是压缩消息的一部分,
当源未知时应用香农开发的理论。
当源未知时,可以计算零阶经验熵H0(m),其中频率是从要传输的消息m获得的。在这种情况下,使用序列中符号的频率,在生成消息之后开发编码方案,因此它必须是压缩消息的一部分。因此,压缩消息由NH0(M)+编码方案定义。当H0(m)的值非常接近H(X)的值时,编码方案表示冗余,这也称为熵编码的低效率。 实际上,对生成消息的源的不了解决定了压缩消息中是否存在冗余。
此外,编码方案的长度取决于所使用的语言,因此不可能精确地定义它。因此,无法定义压缩消息的长度限制。 因此,利用香农引入的形式主义,我们证明了源未知序列的压缩极限是不可计算的。
源熵 H(X) 与零阶经验熵 H0 之间的差异(m)
NH(X)的值代表压缩消息。 因此,在这种情况下,熵定义了更一般意义上的消息信息,而不仅仅是香农信息。不幸的是,这种方法并不代表信息传输的一般情况,而是代表一个子问题,其中编码器和解码器知道生成消息的源。
NH0(m)的值并不代表压缩消息。 因此,零阶经验熵仅代表单个符号的香农信息的平均值。因此,零阶经验熵的一般意义不如源熵。然而,在这种情况下,信息传输的问题正在以比源编码的情况更一般的方式得到解决。
为了证明刚才所说的内容,给出下面的例子:让我们采用随机变量iid的统一源X,它生成长度为N的消息。当然,这个序列不能被压缩。因此,平均而言,压缩消息的长度必须大于或等于 N。在第一种情况下,编码器和解码器知道源,因此我们可以应用源编码。获得的结果是所有消息的长度为 NH(X),并且由于 H(X)=1(我们使用字母表的维度作为熵的基础),压缩消息的长度为 N。因此,在这种情况下,我们达到了理论压缩极限。在第二种情况下,编码器和解码器都不知道源。因此,编码器必须使用消息中符号的频率值来计算熵。这样就得到了零阶经验熵H0(m)。来源是统一的,但只有少量生成的消息具有统一的符号分布。因此,消息的零阶经验熵H0(m)的平均值将小于H(X)。因此,如果我们不将编码方案视为压缩消息的一部分,我们会得到不合逻辑的结果,事实上,我们有NH0(m)<N。在实践中,我们设法压缩随机序列。 错误取决于这样一个事实:在第二种情况下,我们在生成消息后获得了编码方案,因此它必须是压缩消息的一部分。总之,要获得正确的结果,压缩字符串必须由长度大于N的NH0(m)+编码方案定义。因此,由于不知道源,我们有冗余。
信息论的现代方法
所涵盖的最重要的方面之一是理解源编码是最一般形式的信息传输问题的子问题。 因此,当源未知时,只能计算零阶经验熵,该参数的值比源的熵要有限得多。事实上,将消息长度NH0(m)的零阶经验熵并不代表压缩消息。在这种情况下,编码方案也必须是压缩消息的一部分,在消息生成之后定义,因此不能包含在通信语言中。这种方法代表了观点的重要转变。事实上,即使源未知,也可以将压缩消息仅视为 NH0(m),而不像源编码那样考虑编码方案,这是绝对正常的。
随着新理论的出现,这种观点的改变是必要的,新理论可以减少在来源未知时产生的冗余。该理论仅适用于字母表大于等于3的情况。 该理论的目的是让编码序列NH0(m)和编码方案更有效地交互。
现在,我将解释这个理论的基础。如果我们必须对长度为 N 且字母为 A 的序列执行熵编码,而我们不知道其来源,则我们拥有的唯一信息是该序列将是 |A|^N 可能序列之一的一部分。因此,使用熵编码来平均提高压缩序列(编码序列+编码方案)长度的变换的唯一可能方法是将所有可能序列的集合变换为一组新的序列相同大小的序列组成,平均可以在更小的空间中压缩。这样,即使我们不知道生成序列的源,我们也知道如果应用此变换,我们可以,平均而言,获得比未变换序列的压缩消息长度更小的压缩消息。
具有这些特征的集合是由长度为N+K的序列组成的维度|A|^N的集合。增加序列的长度也会增加包含所有可能序列的集合的大小,该集合变为|A|^N+K。因此,从该集合中可以选择由具有最小零阶经验熵值的序列组成的大小|A|^N的子集。
这样,我们得到以下结果:给定由随机变量 iid 的源 X(未知)生成的长度为 N 的消息 m,应用所描述的变换 f(m),我们平均得到:
NH(X )< Avg(NtH0(f(m)+编码方案)
其中Nt>N
Avg(NH0(m)+编码方案)=编码序列m的平均值+编码方案
Avg(NtH0(f(m)+编码方案)=编码变换序列f(m)的均值+编码方案
NH(X)是未知的压缩极限,实际上,我们设定了源未知的条件。正如一开始提到的,当源未知时,在目前的知识水平下,不可能定义压缩极限。
现在,我们证明实验得出的结果是,我们取一个长度为 3、字母为 3 的序列,则可能的序列为 27 个。如果我们将长度增加到 4 并保持字母为 3,则可能的序列将变为 81 个。在这 81 个中,我们选择了 27 个H0(m) 的最小值。这样,两个集合的元素数量相同,因此,我们可以定义两个集合的序列之间的一一对应关系。两组中的。在第一列中,我们有消息 m,在第二列中,我们有 NH0(m),在第三列中,我们有转换后的消息 f(m),在第四列中,我们有 NtH0(f(m) ) 变换后的消息 f(m)。
在此处输入图像描述
零阶经验熵平均值乘以字符串长度 N=3消息 m 为:
平均值(NH0(m))=2.893
其中N=3。
零阶经验熵的平均值乘以变换后的消息f(m)的字符串长度Nt=4为:
平均值(NtH0(f(m))=2.884
常数Nt=4
因此,您可以看到,虽然变换后的消息较长,但应用变换时,零阶经验熵乘以消息长度的平均值平均较小。 因此,即使我们不知道来源,我们也知道通过应用变换,我们平均会得到零阶经验熵乘以消息长度的值的减少。 因此,编码消息的长度(由码字替换的符号)减少。
现在,让我们看看编码方案,在这种情况下进行评估会更加困难,因为编码方案的长度取决于所使用的压缩方法。但是,我们知道,从理论角度来看,对编码方案长度影响最大的参数是字母表的大小。因此,我们计算两组字母A的平均大小。
对于表第一列中的消息,字母大小 |A| 的平均值为:
平均值(|A|)=2.1
在表的第三列中转换消息 f(m) 的情况下,字母大小的平均值 |A| 为:
平均值(|A|)=1.89
获得的结果不仅仅对报告的病例有效。 事实上,无论字母表的大小如何,由较长序列组成的转换集总是具有更多数量的类型类,其符号数量小于|A|。对于类型类,我们指的是具有相同符号频率的字符串集合,例如字符串 113 和字符串 131 实际上属于同一类型类,两者的符号频率分别为 1=2/3 和 3=1/ 4。
因此,平均为 NH0 (f(m))<NH0(m)并且编码方案f(m)<NH0(m)编码方案 m 中,我们通过实验证明了以下不等式:
NH(X)< Avg(NH0(f(m)+编码方案)< Avg(NtH0(m)+编码方案)
这样,变换后的消息集合具有平均压缩消息长度,使用熵编码,小于因此,当源未知时,如果应用所描述的变换,则平均会获得压缩消息的减少,因此,我们能够减少由于不知道源而导致的低效率
。结论是,我们在开始时提出的关于序列的最小压缩长度是否存在的问题代表了一个尚未解决的问题,但是,信息理论的新发展设法减少了所产生的低效率。当我们不知道来源时,就可以在这个主题上迈出重要的一步。
Compression limit of an individual sequence
What is the compression limit of a sequence?
Two answers can be given to this question:
The compression limit of a sequence is an unsolved problem. Furthermore, it has been shown that this limit cannot be calculated.
Shannon defined a subproblem of great practical relevance, in which it is possible to define a theoretical limit (source coding theorem). This subproblem is purely probabilistic and is called source coding.
Definition of the information transfer problem in its most general form
To understand these two answers it is essential to introduce the problem of information transmission in its most general form. For the transmission of information to take place between two points, it is essential that the two subjects share a communication language. This requirement is necessary, without a language of communication, the transmitted message is not understood and therefore does not carry any information. Consequently, the information transmission problem, in its most general form, studies the transfer of information in any form between two points that share a language of communication.
The problem with this definition is that the concept of "information in any form" is difficult to formalize mathematically. For this reason, it is essential to introduce a first simplification of the problem, in which the information is defined as a numerical sequence. In this way, a subproblem is obtained, in which the concept of information is well-defined and can be formalized from the mathematical point of view.
Consequently, the two elements necessary for the transfer of information are: a compressed sequence which carries the information and a communication language known by both the encoder and the decoder.
Compressed sequence: represents any information derived from the sequence to be transmitted. From a practical point of view, any information obtained from the sequence that is useful for the decoder to go back to the original message must be considered part of the compressed message.
Communication language: represents a set of rules that allows communication and therefore the transfer of information between two subjects. From a practical point of view, any information present before the generation of the message can be considered part of the communication language.
Source coding
Shannon in his famous article “A Mathematical Theory of Communication” introduces a further modification to the information transfer problem, which allows him to obtain a subproblem in which it is possible to define the theoretical compression limit. For this purpose, Shannon introduces the source that generates the message. Furthermore, the coding scheme which converts the symbols into codewords is generated on the information of the source and not on the information obtained from the message. Therefore, the coding scheme being defined before the generation of the message can be considered as an integral part of the communication language. Then, Shannon formalizes the concept of information, for this purpose he develops a variable that uniquely defines the message transmitted in the presence of an encoding scheme. The solution developed by Shannon is to consider information as a measure of a variation. The function generated that performs this measurement is called entropy H. The following sentence, taken from the article "A Mathematical Theory of Communication”, makes us understand Shannon's idea of entropy.
“In the limiting case where one probability is unity (certainty) and all the others zero (impossibility), then H is zero (no uncertainty at all- no freedom of choice - no information).”
With regard to the subproblem he defined, Shannon manages to define a theoretical compression limit by developing the source coding theorem. This theorem proves that entropy defines the minimum average length of the codewords that replace the symbols in the compressed message. Thus, given a source X of random variables i.i.d. which generates a message of length N the compressed message cannot be longer than NH(X). The encoding scheme being defined before message generation is not part of the compressed message.
Application of the theory developed by Shannon when the source is not known
When the source is not known, it is possible to calculate the zero order empirical entropy H0(m), in which the frequencies are obtained from the message m to be transmitted. In this case, using the frequencies of the symbols in the sequence the coding scheme is developed after the message has been generated therefore it must be part of the compressed message. Consequently, the compressed message is defined by NH0(M)+encoding scheme. Being the value of H0(m) very close to the value of H(X) the coding scheme represents a redundancy, which is also called inefficiency of the entropy coding. In practice, the non-knowledge of the source that generates the message determines the presence of a redundancy in the compressed message.
Furthermore, the length of the coding scheme depends on the language used, therefore, it is not possible to define it precisely. Consequently, it’s not possible to define a limit on the length of the compressed message. Therefore, using the formalism introduced by Shannon, we have shown that the compression limit of a sequence whose source is unknown is not computable.
Difference between the source entropy H(X) and the zero order empirical entropy H0(m)
The value of NH(X) represents the compressed message. So entropy, in this case, defines message information in a more general sense, not just Shannon information. Unfortunately, this method does not represent the general case of information transmission, but represents a subproblem in which both the encoder and the decoder know the source that generates the message.
The value of NH0(m) does not represent the compressed message. Therefore, the zero-order empirical entropy represents only the average value of the Shannon information of the single symbol. Consequently, the zero-order empirical entropy has a less general meaning than the entropy of the source. However, in this case, the problem of information transmission is being addressed in a much more general way than in the case of source coding.
To demonstrate what has just been said, the following example is given: let us take a uniform source X of random variables i.i.d. which generates a message of length N. Of course, this sequence cannot be compressed. Thus, on average the compressed message must have a length greater than or equal to N. In the first case, the encoder and the decoder know the source and therefore we can apply the source encoding. The result obtained is that all the messages have length NH(X) and since H(X)=1 (we use the dimension of the alphabet as the base of the entropy) the compressed message has a length N. So, in this case , we reach the theoretical compression limit. In the second case,both the encoder and the decoder do not know the source. Consequently, the encoder must use the value of the frequencies of the symbols in the message to calculate the entropy. In this way, the zero order empirical entropy H0(m) is obtained. The source is uniform but only a small amount of the generated messages have uniform symbol distribution. Therefore, the average value of the zero-order empirical entropy H0(m) of the messages will be less than H(X). Consequently, if we do not consider the coding scheme as part of the compressed message, we obtain an illogical result, in fact, we have NH0(m)<N. In practice, we managed to compress a random sequence. The error depends on the fact that, in this second case, we obtained the coding scheme after the message was generated therefore, it must be part of the compressed message. In conclusion, to obtain a correct result, the compressed string must be defined by NH0(m)+coding scheme whose length is greater than N. Therefore, we have a redundancy due to non-knowledge of the source.
Modern approach to information theory
One of the most important aspects covered is to understand that source coding is a subproblem with respect to the problem of information transmission in its most general form. Consequently, when the source is not known, only the zero order empirical entropy can be calculated, a parameter that has a much more limited value than the entropy of the source. Indeed, the value obtained by multiplying the zero-order empirical entropy by the message length NH0(m) does not represent the compressed message. In this case, the coding scheme must also be part of the compressed message, having been defined after the generation of the message and therefore cannot be included in the communication language. This approach represents an important change of point of view. In fact, it was absolutely normal, even when the source was not known, to consider the compressed message only as NH0(m) without considering the coding scheme as is done with source coding.
This change of point of view was necessary with the arrival of a new theory that makes it possible to reduce the redundancy that is generated when the source is not known. This theory can only be applied when the alphabet is greater than or equal to 3. The purpose of this theory is to make the encoded sequence NH0(m) and the coding scheme interact with each other more efficiently.
Now, I will explain the basis of this theory. If we have to perform entropy coding on a sequence of length N and alphabet A, whose source we do not know, the only information we have is that the sequence will be part of one of |A|^N possible sequences. Thus, the only possible way for a transform, to improve on average the length of the compressed sequence (encoded sequence+coding scheme), using an entropic coding, is to transform the set of all possible sequences into a new set of the same size composed of sequences that on average can be compressed in less space. In this way, even if we don't know the source that generates the sequence, we know that if we apply this transform, we can, on average, obtain a compressed message of smaller length than the compressed message of the untransformed sequence.
The set having these characteristics is the set of dimension |A|^N composed of sequences of length N+K. Increasing the length of the sequence also increases the size of the set that includes all possible sequences, which becomes |A|^N+K. Therefore, from this set it is possible to select a subset of size |A|^N composed of sequences having the smallest value of the zero order empirical entropy .
In this way, we obtain the following result: given a message m of length N generated by a source X (unknown) of random variables i.i.d., applying the described transform f(m), we obtain on average:
NH(X)< Avg(NtH0(f(m)+ coding scheme)< Avg(NH0(m)+coding scheme)
With Nt>N
Avg(NH0(m)+coding scheme)=mean value of the encoded sequence m+coding scheme
Avg(NtH0(f(m)+ coding scheme)=mean value of the encoded transformed sequence f(m)+coding scheme
NH(X) is the compression limit that is not known, in fact, we have set the condition that the source is not known. As mentioned at the beginning, when the source is not known, in the current state of knowledge, it is not possible to define a compression limit.
Now, we prove the obtained result experimentally. Let's take a sequence of length 3 and alphabet 3, the possible sequences are 27. If we increase the length to 4 and keep the alphabet at 3, the possible sequences become 81. Of these 81, we select the 27 with the smallest value of H0(m). In this way, the two sets have the same number of elements. Thus, we can define a one-to-one relationship between the sequences of the two sets. The following table shows the 27 elements of the two sets. In the first column, we have message m, in the second column, we have NH0(m), in the third column, we have the transformed message f(m) and in the fourth column, we have NtH0(f(m)) of the transformed message f(m).
enter image description here
The average value of zero order empirical entropy multiplied by the string length N=3 of the messages m is:
Avg(NH0(m))=2.893
with N=3.
The average value of zero order empirical entropy multiplied by the length of the string Nt=4 of the transformed messages f(m) is:
Avg(NtH0(f(m))=2.884
con Nt=4
Thus, you can see that although the transformed messages are longer, the average value of zero-order empirical entropy multiplied by the message length is, on average, smaller when the transform is applied. Consequently, even if we do not know the source, we know that by applying the transform, we obtain, on average, a reduction in the value of zero-order empirical entropy multiplied by the length of the message. Therefore, the length of the encoded message (symbols replaced by codewords) decreases.
Now, let's look at the encoding scheme, in this case making an evaluation is more difficult because the length of the encoding scheme depends on the compression method used. But, we know that, from a theoretical point of view, the parameter that most influences the length of the encoding scheme is the size of the alphabet. Therefore, we calculate the average size of the alphabet A in the two sets.
The mean value of the alphabet size |A|, in the case of messages in the first column of the table, is:
Avg(|A|)=2.1
The mean value of the alphabet size |A|, in the case of transformed messages f(m) in the third column of the table, is:
Avg(|A|)=1.89
The result obtained is not valid only for the reported case. Indeed, whatever the size of the alphabet, the transformed set being composed of longer sequences always has a greater number of type classes with a number of symbols less than |A|. With type class, we mean a set of strings that all have the same symbol frequency, for example, string 113 and string 131 belong to the same type class in fact, both have symbol frequency 1=2/3 and 3=1/ 4.
Therefore, being on average NH0(f(m))<NH0(m) and coding scheme f(m) < coding scheme m, we have experimentally demonstrated the following inequality:
NH(X)< Avg(NH0(f(m)+ coding scheme)< Avg(NtH0(m)+coding scheme)
In this way, the set of transformed messages has an average compressed message length, using entropy encoding, less than the set of untransformed messages. Consequently, when the source is not known, if the described transform is applied, a reduction of the compressed message is obtained on average. Therefore, we are able to reduce the inefficiency caused by not knowing the source.
In conclusion, the question we posed at the beginning, about the existence of a minimum compression length of a sequence, represents a problem that has not yet been solved. However, new developments regarding the theory of information, managing to reduce the inefficiency that is created when we don’t know the source, have made it possible to take a significant step forward on this topic.