计算二进制字符串的 Lempel-Ziv (LZ) 复杂度(也称为序列复杂度)

发布于 2024-10-17 01:22:09 字数 467 浏览 5 评论 0原文

我需要计算二进制字符串的 LZ 复杂度。 LZ 复杂度是从开始到结束查看流时遇到的差异子串的数量。例如:

s = 1001111011000010

在不同子串中标记序列复杂度 c(s) = 6: s = 1 / 0 / 01 / 1110 / 1100 / 0010 /

有人可以指导我找到一个简单的解决方案吗?我确信对于这个众所周知的问题应该有一些非常直接的实现,但我很难找到它们。可以简单地通过构建后缀树或类似的东西来完成吗?如果是,具体如何?我该怎么办?

有人知道任何 c/c++ 源代码来完成这项任务吗?

提前致谢。

澄清答案中建议的树的构造。这棵树看起来像这样吗?

         o
       /   \
      o     o
     / \   / \
    o   o o   o
       /     /
      o     o

I need to calculate the LZ-complexity of a binary string. The LZ-complexity is the number of differencet substrings encountered as the stream is viewed from begining to the end. As an example:

s = 1001111011000010

Marking in the different substrings the sequence complexity c(s) = 6:
s = 1 / 0 / 01 / 1110 / 1100 / 0010 /

can someone guide me to find a simple solution for that? I am sure there should be some very straight-forward implementations for this well-known problem, but I have difficulty finding them. Can it be done simply done with constructing a suffix tree or something similar. If yes, exactly how? and what should I do?

anyone knows of any c/c++ source code to accomplish the task?

thanks in advance.

to clarify the construction of the tree suggested in the answers. Does the tree looks like this?

         o
       /   \
      o     o
     / \   / \
    o   o o   o
       /     /
      o     o

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

南风起 2024-10-24 01:22:09

@Arash 和 @Sanchit Gupta:您可能对 LZ76 复杂性和 LZ78 复杂性感到困惑。 Arash 指的是 LZ76 复杂度,另一种是 LZ78 复杂度。您可以参考论文“Estimating the Entropy Rate of Spike Trains via Lempel-Ziv Complexity”的第3节。

@Arash and @Sanchit Gupta: You might've got confused between LZ76 complexity and LZ78 complexity. The one Arash is refering to is LZ76 complexity and the other one is LZ78 complexity. You can refer to section-3 of the paper "Estimating the Entropy Rate of Spike Trains via Lempel-Ziv Complexity".

魂ガ小子 2024-10-24 01:22:09

下面是如何使用树计算 LZ 复杂度的快速示例。为了方便起见 - 我的;不是你的 - 这段代码实现了一个固定大小的预分配树,并且是为什么 void* 指针难以使用且难以维护的一个主要示例。按原样交出这段代码,你的讲师可能会朝你的脸开枪:)

#include <stdlib.h>
#include <stdio.h>

int LZComplexity(char *p_binarySequence, int p_maxTreeNodes)
{
 void **patternTree;
 void **currentNode;
 void **nextFreeNode;
 int nodeCount;
 int sequenceIndex;
 int currentDigit;

 nodeCount = 0;
 patternTree = malloc(sizeof(void*) * (p_maxTreeNodes << 1));
 currentNode = patternTree;
 nextFreeNode = patternTree + (sizeof(void*) << 1);
 currentNode[0] = NULL;
 currentNode[1] = NULL;
 sequenceIndex = 0;

 while (p_binarySequence[sequenceIndex])
 {
  currentDigit = p_binarySequence[sequenceIndex] - 48;
  if (NULL == currentNode[currentDigit])
  {
   currentNode[currentDigit] = nextFreeNode;
   nextFreeNode[0] = NULL;
   nextFreeNode[1] = NULL;
   nextFreeNode += (sizeof(void*) << 1);
   currentNode = patternTree;
   nodeCount++;
  }
  else
  {
   currentNode = currentNode[currentDigit];
  }
  sequenceIndex++;
 }

 free(patternTree);
 return nodeCount;
}


int main(int argc, char *argv[])
{
 printf("%u\n", LZComplexity("10100101001011101011", 1000));
 return 0;
}

Below is a quick example of how to compute LZ-Complexity using a tree. For convenience - mine; not yours - this code implements a fixed-sized pre-allocated tree, and is a prime example of why void* pointers are ugly to use and difficult to maintain. Hand this code in as is, and your lecturer is likely to shoot you in the face :)

#include <stdlib.h>
#include <stdio.h>

int LZComplexity(char *p_binarySequence, int p_maxTreeNodes)
{
 void **patternTree;
 void **currentNode;
 void **nextFreeNode;
 int nodeCount;
 int sequenceIndex;
 int currentDigit;

 nodeCount = 0;
 patternTree = malloc(sizeof(void*) * (p_maxTreeNodes << 1));
 currentNode = patternTree;
 nextFreeNode = patternTree + (sizeof(void*) << 1);
 currentNode[0] = NULL;
 currentNode[1] = NULL;
 sequenceIndex = 0;

 while (p_binarySequence[sequenceIndex])
 {
  currentDigit = p_binarySequence[sequenceIndex] - 48;
  if (NULL == currentNode[currentDigit])
  {
   currentNode[currentDigit] = nextFreeNode;
   nextFreeNode[0] = NULL;
   nextFreeNode[1] = NULL;
   nextFreeNode += (sizeof(void*) << 1);
   currentNode = patternTree;
   nodeCount++;
  }
  else
  {
   currentNode = currentNode[currentDigit];
  }
  sequenceIndex++;
 }

 free(patternTree);
 return nodeCount;
}


int main(int argc, char *argv[])
{
 printf("%u\n", LZComplexity("10100101001011101011", 1000));
 return 0;
}
皇甫轩 2024-10-24 01:22:09

1 0 01 11 10 110 00 010
序列的复杂度为 8,因为分区是 8 而不是 6 - 1/0/01/11/10/110/00/010

1 0 01 11 10 110 00 010
Complexity of sequence is 8 because the partitions are 8 not 6 - 1/0/01/11/10/110/00/010

呆萌少年 2024-10-24 01:22:09

Guillermo Valle 有一个实现,可以生成正确的答案(与当前的维基百科代码不同)。

例如,

0001 的复杂度为 2:0 001

010 的复杂度为 3:0 1 0

Guillermo Valle has an implementation that produces the right answers (unlike e.g. the current Wikipedia code).

For instance,

Complexity of 0001 is 2: 0 001

Complexity of 010 is 3: 0 1 0

单身狗的梦 2024-10-24 01:22:09

创建一棵二叉树,其中左为 0,右为 1。对于每一位,尝试查找树中的序列。如果存在,连接下一点,冲洗,重复。如果不存在,请将其添加到树中并继续。 LZ 复杂度是树中路径的总数(不仅仅是叶节点数)。

顺便问一下,这是作业吗?

Create a binary tree where left is 0 and right is 1. For each bit, try to find the sequence in the tree. If it's there, concatenate the next bit, rinse, repeat. If it's not there, add it to the tree and continue. LZ Complexity is the total number of paths in the tree (not just # leaf nodes).

By the way, is this homework?

谈下烟灰 2024-10-24 01:22:09

这可能与您相关。是并行实现
计算 LZ 复杂度的 LZMP 算法
在 CUDA 中并在 nVidia GPU 上运行。

http://www.ariel.ac.il/sites/ratsaby/代码/LZMP.zip

This may be relevant for you. It is parallel implementation
of an algorithm LZMP that computes the LZ-complexity
in CUDA and runs on nVidia GPU.

http://www.ariel.ac.il/sites/ratsaby/Code/LZMP.zip

掩于岁月 2024-10-24 01:22:09

这在 Python 中应该可以解决问题(来自:Kaspar、F. Schuster、H.时空模式复杂性的轻松计算测量。物理评论 A,第 36 卷,第 2 期,第 842 页。)

#!/usr/bin/python


def lz_complexity(s):
    i, k, l = 0, 1, 1
    k_max = 1
    n = len(s) - 1
    c = 1
    while True:
        if s[i + k - 1] == s[l + k - 1]:
            k = k + 1
            if l + k >= n - 1:
                c = c + 1
                break
        else:
            if k > k_max:
               k_max = k
            i = i + 1
            if i == l:
                c = c + 1
                l = l + k_max
                if l + 1 > n:
                    break
                else:
                    i = 0
                    k = 1
                    k_max = 1
            else:
                k = 1
    return c


def main():
    lz = lz_complexity('1001111011000010')
    assert lz == 6 
    print lz


if __name__ == '__main__':

    main()

This should do the trick in Python (from: Kaspar, F. Schuster, H. Easily calculable measure for the complexity of spatiotemporal patterns. Physical Review A, vol 36, n. 2, p 842.)

#!/usr/bin/python


def lz_complexity(s):
    i, k, l = 0, 1, 1
    k_max = 1
    n = len(s) - 1
    c = 1
    while True:
        if s[i + k - 1] == s[l + k - 1]:
            k = k + 1
            if l + k >= n - 1:
                c = c + 1
                break
        else:
            if k > k_max:
               k_max = k
            i = i + 1
            if i == l:
                c = c + 1
                l = l + k_max
                if l + 1 > n:
                    break
                else:
                    i = 0
                    k = 1
                    k_max = 1
            else:
                k = 1
    return c


def main():
    lz = lz_complexity('1001111011000010')
    assert lz == 6 
    print lz


if __name__ == '__main__':

    main()
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文