计算二进制字符串的 Lempel-Ziv (LZ) 复杂度(也称为序列复杂度)
我需要计算二进制字符串的 LZ 复杂度。 LZ 复杂度是从开始到结束查看流时遇到的差异子串的数量。例如:
s = 1001111011000010
在不同子串中标记序列复杂度 c(s) = 6: s = 1 / 0 / 01 / 1110 / 1100 / 0010 /
有人可以指导我找到一个简单的解决方案吗?我确信对于这个众所周知的问题应该有一些非常直接的实现,但我很难找到它们。可以简单地通过构建后缀树或类似的东西来完成吗?如果是,具体如何?我该怎么办?
有人知道任何 c/c++ 源代码来完成这项任务吗?
提前致谢。
澄清答案中建议的树的构造。这棵树看起来像这样吗?
o
/ \
o o
/ \ / \
o o o o
/ /
o o
I need to calculate the LZ-complexity of a binary string. The LZ-complexity is the number of differencet substrings encountered as the stream is viewed from begining to the end. As an example:
s = 1001111011000010
Marking in the different substrings the sequence complexity c(s) = 6:
s = 1 / 0 / 01 / 1110 / 1100 / 0010 /
can someone guide me to find a simple solution for that? I am sure there should be some very straight-forward implementations for this well-known problem, but I have difficulty finding them. Can it be done simply done with constructing a suffix tree or something similar. If yes, exactly how? and what should I do?
anyone knows of any c/c++ source code to accomplish the task?
thanks in advance.
to clarify the construction of the tree suggested in the answers. Does the tree looks like this?
o
/ \
o o
/ \ / \
o o o o
/ /
o o
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
@Arash 和 @Sanchit Gupta:您可能对 LZ76 复杂性和 LZ78 复杂性感到困惑。 Arash 指的是 LZ76 复杂度,另一种是 LZ78 复杂度。您可以参考论文“Estimating the Entropy Rate of Spike Trains via Lempel-Ziv Complexity”的第3节。
@Arash and @Sanchit Gupta: You might've got confused between LZ76 complexity and LZ78 complexity. The one Arash is refering to is LZ76 complexity and the other one is LZ78 complexity. You can refer to section-3 of the paper "Estimating the Entropy Rate of Spike Trains via Lempel-Ziv Complexity".
下面是如何使用树计算 LZ 复杂度的快速示例。为了方便起见 - 我的;不是你的 - 这段代码实现了一个固定大小的预分配树,并且是为什么 void* 指针难以使用且难以维护的一个主要示例。按原样交出这段代码,你的讲师可能会朝你的脸开枪:)
Below is a quick example of how to compute LZ-Complexity using a tree. For convenience - mine; not yours - this code implements a fixed-sized pre-allocated tree, and is a prime example of why void* pointers are ugly to use and difficult to maintain. Hand this code in as is, and your lecturer is likely to shoot you in the face :)
1 0 01 11 10 110 00 010
序列的复杂度为 8,因为分区是 8 而不是 6 - 1/0/01/11/10/110/00/010
1 0 01 11 10 110 00 010
Complexity of sequence is 8 because the partitions are 8 not 6 - 1/0/01/11/10/110/00/010
Guillermo Valle 有一个实现,可以生成正确的答案(与当前的维基百科代码不同)。
例如,
0001 的复杂度为 2:
0 001
010 的复杂度为 3:
0 1 0
Guillermo Valle has an implementation that produces the right answers (unlike e.g. the current Wikipedia code).
For instance,
Complexity of 0001 is 2:
0 001
Complexity of 010 is 3:
0 1 0
创建一棵二叉树,其中左为 0,右为 1。对于每一位,尝试查找树中的序列。如果存在,连接下一点,冲洗,重复。如果不存在,请将其添加到树中并继续。 LZ 复杂度是树中路径的总数(不仅仅是叶节点数)。
顺便问一下,这是
作业
吗?Create a binary tree where left is 0 and right is 1. For each bit, try to find the sequence in the tree. If it's there, concatenate the next bit, rinse, repeat. If it's not there, add it to the tree and continue. LZ Complexity is the total number of paths in the tree (not just # leaf nodes).
By the way, is this
homework
?这可能与您相关。是并行实现
计算 LZ 复杂度的 LZMP 算法
在 CUDA 中并在 nVidia GPU 上运行。
http://www.ariel.ac.il/sites/ratsaby/代码/LZMP.zip
This may be relevant for you. It is parallel implementation
of an algorithm LZMP that computes the LZ-complexity
in CUDA and runs on nVidia GPU.
http://www.ariel.ac.il/sites/ratsaby/Code/LZMP.zip
这在 Python 中应该可以解决问题(来自:Kaspar、F. Schuster、H.时空模式复杂性的轻松计算测量。物理评论 A,第 36 卷,第 2 期,第 842 页。)
This should do the trick in Python (from: Kaspar, F. Schuster, H. Easily calculable measure for the complexity of spatiotemporal patterns. Physical Review A, vol 36, n. 2, p 842.)