著名压缩算法（LZ78）的缓慢实现

发布于 2024-07-27 20:38:04 字数 1083 浏览 2 评论 0原文

我正在编写一种方法，通过遵循 LZ78 算法来近似字符串的 Kolmogorov 复杂度，除了不添加到表中，我只保留一个计数器，即我只对压缩的大小感兴趣。

问题是，对于大量输入，需要花费数小时。这是我实施的方式吗？

/**
 * Uses the LZ78 compression algorithm to approximate the Kolmogorov
 * complexity of a String
 * 
 * @param s
 * @return the approximate Kolmogorov complexity
 */
public double kComplexity(String s) {

    ArrayList<String> dictionary = new ArrayList<String>();
    StringBuilder w = new StringBuilder();
    double comp = 0;
    for (int i = 0; i < s.length(); i++) {
        char c = s.charAt(i);
        if (dictionary.contains(w.toString() + c)) {
            w.append(c);
        } else {
            comp++;
            dictionary.add(w.toString() + c);
            w = new StringBuilder();
        }
    }
    if (w.length() != 0) {
        comp++;
    }

    return comp;
}

更新：使用

HashSet<String> dictionary = new HashSet<String>();

代替

ArrayList<String> dictionary = new ArrayList<String>();

使其更快

原文

I'm writing a method which approximates the Kolmogorov complexity of a String by following the LZ78 algorithm, except instead of adding to a table I just keep a counter i.e i'm only interested in the size of the compression.

The problem is that for large inputs it is taking hours. Is it the way I have implemented it?

/**
 * Uses the LZ78 compression algorithm to approximate the Kolmogorov
 * complexity of a String
 * 
 * @param s
 * @return the approximate Kolmogorov complexity
 */
public double kComplexity(String s) {

    ArrayList<String> dictionary = new ArrayList<String>();
    StringBuilder w = new StringBuilder();
    double comp = 0;
    for (int i = 0; i < s.length(); i++) {
        char c = s.charAt(i);
        if (dictionary.contains(w.toString() + c)) {
            w.append(c);
        } else {
            comp++;
            dictionary.add(w.toString() + c);
            w = new StringBuilder();
        }
    }
    if (w.length() != 0) {
        comp++;
    }

    return comp;
}

UPDATE:
Using

HashSet<String> dictionary = new HashSet<String>();

instead of

ArrayList<String> dictionary = new ArrayList<String>();

makes it much faster

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

青春有你 2024-08-03 20:38:04

我想我可以做得更好（抱歉有点长）：

import java.io.File;
import java.io.FileInputStream;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

public class LZ78 {
    /**
     * Uses the LZ78 compression algorithm to approximate the Kolmogorov
     * complexity of a String
     * 
     * @param s
     * @return the approximate Kolmogorov complexity
     */
    public static double kComplexity(String s) {
        Set<String> dictionary = new HashSet<String>();
        StringBuilder w = new StringBuilder();
        double comp = 0;
        for (int i = 0; i < s.length(); i++) {
            char c = s.charAt(i);
            if (dictionary.contains(w.toString() + c)) {
                w.append(c);
            } else {
                comp++;
                dictionary.add(w.toString() + c);
                w = new StringBuilder();
            }
        }
        if (w.length() != 0) {
            comp++;
        }
        return comp;
    }

    private static boolean equal(Object o1, Object o2) {
        return o1 == o2 || (o1 != null && o1.equals(o2));
    }

    public static final class FList<T> {
        public final FList<T> head;
        public final T tail;
        private final int hashCodeValue;

        public FList(FList<T> head, T tail) {
            this.head = head;
            this.tail = tail;
            int v = head != null ? head.hashCodeValue : 0;
            hashCodeValue = v * 31 + (tail != null ? tail.hashCode() : 0);
        }

        @Override
        public boolean equals(Object obj) {
            if (obj instanceof FList<?>) {
                FList<?> that = (FList<?>) obj;
                return equal(this.head, that.head)
                        && equal(this.tail, that.tail);
            }
            return false;
        }

        @Override
        public int hashCode() {
            return hashCodeValue;
        }

        @Override
        public String toString() {
            return head + ", " + tail;
        }
    }

    public static final class FListChar {
        public final FListChar head;
        public final char tail;
        private final int hashCodeValue;

        public FListChar(FListChar head, char tail) {
            this.head = head;
            this.tail = tail;
            int v = head != null ? head.hashCodeValue : 0;
            hashCodeValue = v * 31 + tail;
        }

        @Override
        public boolean equals(Object obj) {
            if (obj instanceof FListChar) {
                FListChar that = (FListChar) obj;
                return equal(this.head, that.head) && this.tail == that.tail;
            }
            return false;
        }

        @Override
        public int hashCode() {
            return hashCodeValue;
        }

        @Override
        public String toString() {
            return head + ", " + tail;
        }
    }

    public static double kComplexity2(String s) {
        Map<FList<Character>, FList<Character>> dictionary = 
            new HashMap<FList<Character>, FList<Character>>();
        FList<Character> w = null;
        double comp = 0;
        for (int i = 0; i < s.length(); i++) {
            char c = s.charAt(i);
            FList<Character> w1 = new FList<Character>(w, c);
            FList<Character> ex = dictionary.get(w1);
            if (ex != null) {
                w = ex;
            } else {
                comp++;
                dictionary.put(w1, w1);
                w = null;
            }
        }
        if (w != null) {
            comp++;
        }
        return comp;
    }

    public static double kComplexity3(String s) {
        Map<FListChar, FListChar> dictionary = 
            new HashMap<FListChar, FListChar>(1024);
        FListChar w = null;
        double comp = 0;
        for (int i = 0; i < s.length(); i++) {
            char c = s.charAt(i);
            FListChar w1 = new FListChar(w, c);
            FListChar ex = dictionary.get(w1);
            if (ex != null) {
                w = ex;
            } else {
                comp++;
                dictionary.put(w1, w1);
                w = null;
            }
        }
        if (w != null) {
            comp++;
        }
        return comp;
    }

    public static void main(String[] args) throws Exception {
        File f = new File("methods.txt");
        byte[] data = new byte[(int) f.length()];
        FileInputStream fin = new FileInputStream(f);
        int len = fin.read(data);
        fin.close();
        final String test = new String(data, 0, len);

        final int n = 100;
        ExecutorService exec = Executors.newFixedThreadPool(1);
        exec.submit(new Runnable() {
            @Override
            public void run() {
                long t = System.nanoTime();
                double value = 0;
                for (int i = 0; i < n; i++) {
                    value += kComplexity(test);
                }
                System.out.printf("kComplexity: %.3f; Time: %d ms%n",
                        value / n, (System.nanoTime() - t) / 1000000);
            }
        });
        exec.submit(new Runnable() {
            @Override
            public void run() {
                long t = System.nanoTime();
                double value = 0;
                for (int i = 0; i < n; i++) {
                    value += kComplexity2(test);
                }
                System.out.printf("kComplexity2: %.3f; Time: %d ms%n", value
                        / n, (System.nanoTime() - t) / 1000000);
            }
        });
        exec.submit(new Runnable() {
            @Override
            public void run() {
                long t = System.nanoTime();
                double value = 0;
                for (int i = 0; i < n; i++) {
                    value += kComplexity3(test);
                }
                System.out.printf("kComplexity3: %.3f; Time: %d ms%n", value
                        / n, (System.nanoTime() - t) / 1000000);
            }
        });
        exec.shutdown();
    }
}

结果：

kComplexity: 41546,000; Time: 17028 ms
kComplexity2: 41546,000; Time: 6555 ms
kComplexity3: 41546,000; Time: 5971 ms

编辑同侪压力：它是如何工作的？

弗兰基，不知道，这似乎是加快速度的好方法。我也必须弄清楚，所以我们开始吧。

据观察，原始代码进行了大量字符串追加，但是，用 LinkedList 替换它并没有帮助，因为在哈希表中查找集合存在持续的压力- 每次使用 hashCode() 和 equals() 时都需要遍历整个列表。

但我怎样才能确保代码不会执行不必要的操作呢？答案：不变性 - 如果你的类是不可变的，那么至少 hashCode 是不变的，因此可以预先计算。相等性检查也可以缩短 - 但在最坏的情况下，它仍然会遍历整个集合。

这很好，但是如何“修改”一个不可变的类。不，您不需要，每次需要不同的内容时，您都会创建一个新内容。但是，当您仔细查看字典的内容时，您会发现它包含冗余信息：[]a、[abc]d、[abc ]e，[abcd]f。那么为什么不将头部存储为指向先前看到的模式的指针，并为当前字符设置尾部呢？

确切地。使用不变性和反向引用可以节省空间和时间，甚至垃圾收集器也会喜欢你。我的解决方案的一个特点是，在 F# 和 Haskell 中，列表使用 head:[tail] 签名 - head 是元素类型，tail 是指向下一个集合的指针。在这个问题中，随着列表在尾部增长，需要相反的情况。

从这里开始，它只是一些进一步的优化 - 例如，使用具体的 char 作为尾部类型，以避免通用版本的常量 char 自动装箱。

我的解决方案的一个缺点是它在计算列表的方面时使用递归。对于相对较小的列表，这很好，但较长的列表可能需要您增加计算运行的线程堆栈大小。理论上，通过 Java 7 的尾部调用优化，我的代码可以以允许 JIT 执行优化的方式重写（或者已经这样了？很难说）。

I think I can do better (sorry a bit long):

import java.io.File;
import java.io.FileInputStream;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

public class LZ78 {
    /**
     * Uses the LZ78 compression algorithm to approximate the Kolmogorov
     * complexity of a String
     * 
     * @param s
     * @return the approximate Kolmogorov complexity
     */
    public static double kComplexity(String s) {
        Set<String> dictionary = new HashSet<String>();
        StringBuilder w = new StringBuilder();
        double comp = 0;
        for (int i = 0; i < s.length(); i++) {
            char c = s.charAt(i);
            if (dictionary.contains(w.toString() + c)) {
                w.append(c);
            } else {
                comp++;
                dictionary.add(w.toString() + c);
                w = new StringBuilder();
            }
        }
        if (w.length() != 0) {
            comp++;
        }
        return comp;
    }

    private static boolean equal(Object o1, Object o2) {
        return o1 == o2 || (o1 != null && o1.equals(o2));
    }

    public static final class FList<T> {
        public final FList<T> head;
        public final T tail;
        private final int hashCodeValue;

        public FList(FList<T> head, T tail) {
            this.head = head;
            this.tail = tail;
            int v = head != null ? head.hashCodeValue : 0;
            hashCodeValue = v * 31 + (tail != null ? tail.hashCode() : 0);
        }

        @Override
        public boolean equals(Object obj) {
            if (obj instanceof FList<?>) {
                FList<?> that = (FList<?>) obj;
                return equal(this.head, that.head)
                        && equal(this.tail, that.tail);
            }
            return false;
        }

        @Override
        public int hashCode() {
            return hashCodeValue;
        }

        @Override
        public String toString() {
            return head + ", " + tail;
        }
    }

    public static final class FListChar {
        public final FListChar head;
        public final char tail;
        private final int hashCodeValue;

        public FListChar(FListChar head, char tail) {
            this.head = head;
            this.tail = tail;
            int v = head != null ? head.hashCodeValue : 0;
            hashCodeValue = v * 31 + tail;
        }

        @Override
        public boolean equals(Object obj) {
            if (obj instanceof FListChar) {
                FListChar that = (FListChar) obj;
                return equal(this.head, that.head) && this.tail == that.tail;
            }
            return false;
        }

        @Override
        public int hashCode() {
            return hashCodeValue;
        }

        @Override
        public String toString() {
            return head + ", " + tail;
        }
    }

    public static double kComplexity2(String s) {
        Map<FList<Character>, FList<Character>> dictionary = 
            new HashMap<FList<Character>, FList<Character>>();
        FList<Character> w = null;
        double comp = 0;
        for (int i = 0; i < s.length(); i++) {
            char c = s.charAt(i);
            FList<Character> w1 = new FList<Character>(w, c);
            FList<Character> ex = dictionary.get(w1);
            if (ex != null) {
                w = ex;
            } else {
                comp++;
                dictionary.put(w1, w1);
                w = null;
            }
        }
        if (w != null) {
            comp++;
        }
        return comp;
    }

    public static double kComplexity3(String s) {
        Map<FListChar, FListChar> dictionary = 
            new HashMap<FListChar, FListChar>(1024);
        FListChar w = null;
        double comp = 0;
        for (int i = 0; i < s.length(); i++) {
            char c = s.charAt(i);
            FListChar w1 = new FListChar(w, c);
            FListChar ex = dictionary.get(w1);
            if (ex != null) {
                w = ex;
            } else {
                comp++;
                dictionary.put(w1, w1);
                w = null;
            }
        }
        if (w != null) {
            comp++;
        }
        return comp;
    }

    public static void main(String[] args) throws Exception {
        File f = new File("methods.txt");
        byte[] data = new byte[(int) f.length()];
        FileInputStream fin = new FileInputStream(f);
        int len = fin.read(data);
        fin.close();
        final String test = new String(data, 0, len);

        final int n = 100;
        ExecutorService exec = Executors.newFixedThreadPool(1);
        exec.submit(new Runnable() {
            @Override
            public void run() {
                long t = System.nanoTime();
                double value = 0;
                for (int i = 0; i < n; i++) {
                    value += kComplexity(test);
                }
                System.out.printf("kComplexity: %.3f; Time: %d ms%n",
                        value / n, (System.nanoTime() - t) / 1000000);
            }
        });
        exec.submit(new Runnable() {
            @Override
            public void run() {
                long t = System.nanoTime();
                double value = 0;
                for (int i = 0; i < n; i++) {
                    value += kComplexity2(test);
                }
                System.out.printf("kComplexity2: %.3f; Time: %d ms%n", value
                        / n, (System.nanoTime() - t) / 1000000);
            }
        });
        exec.submit(new Runnable() {
            @Override
            public void run() {
                long t = System.nanoTime();
                double value = 0;
                for (int i = 0; i < n; i++) {
                    value += kComplexity3(test);
                }
                System.out.printf("kComplexity3: %.3f; Time: %d ms%n", value
                        / n, (System.nanoTime() - t) / 1000000);
            }
        });
        exec.shutdown();
    }
}

Results:

kComplexity: 41546,000; Time: 17028 ms
kComplexity2: 41546,000; Time: 6555 ms
kComplexity3: 41546,000; Time: 5971 ms

Edit Peer pressure: How does it work?

Franky, have no idea, it just seemed to be a good way to speed up things. I have to figure it out too, so here we go.

It was an observation that the original code made lot of string appends, however, replacing it with a LinkedList<String> wouldn't help as there is a constant pressure for looking up collections in the hash-table - every time, the hashCode() and equals() are used it needs to traverse the entire list.

But how can I make sure the code doesn't perform this unnecessary? The answer: Immutability - if your class is immutable, then at least the hashCode is constant, therefore, can be pre-computed. The equality check can be shortcutted too - but in worst case scenario it will still traverse the entire collection.

That's nice, but then how do you 'modify' an immutable class. No you don't, you create a new one every time a different content is needed. However, when you look close at the contents of the dictionary, you'll recognize it contains redundant information: []a, [abc]d, [abc]e, [abcd]f. So why not just store the head as a pointer to a previously seen pattern and have a tail for the current character?

Exactly. Using immutability and back-references you can spare space and time, and even the garbage collector will love you too. One speciality of my solution is that in F# and Haskell, the list uses a head:[tail] signature - the head is the element type and the tail is a pointer to the next collection. In this question, the opposite was required as the lists grow at the tail side.

From here on its just some further optimization - for example, use a concrete char as the tail type to avoid constant char autoboxing of the generic version.

One drawback of my solution is that it utilizes recursion when calculating the aspects of the list. For relatively small list it's fine, but longer list could require you to increase the Thread stack size on which your computation runs. In theory, with Java 7's tail call optimization, my code can be rewritten in such a way that it allows the JIT to perform the optimization (or is it already so? Hard to tell).

回复收藏 0 原文

○闲身 2024-08-03 20:38:04

在我看来，ArrayList 并不是保存仅包含和添加的字典的最佳数据结构。

编辑

尝试使用 HashSet，它存储它的元素在哈希表中，是设置接口; 但是它不保证迭代的顺序

回复收藏 0 原文

衣神在巴黎 2024-08-03 20:38:04

ArrayList 的搜索复杂度为 O(N)。使用哈希表或字典等数据结构。

回复收藏 0 原文

白衬杉格子梦 2024-08-03 20:38:04

由于您总是检查 prefix+c 我认为一个好的数据结构可能是一棵树，其中每个子项都有其父级作为前缀：

           root
        /       |
       a        b
      /  |     /  | 
     an  ap   ba bo
         |         
         ape

另一种可能更简单的方法是使用排序列表，然后使用二分搜索来查找您所在的字符串看着。我仍然认为树方法会更快。

Since you are always checking for prefix+c I think a good data structure could be a tree where each child has its parent as prefix:

           root
        /       |
       a        b
      /  |     /  | 
     an  ap   ba bo
         |         
         ape

Another perhaps more simple approach is to use a sorted list and then use binary search to find the string you are looking at. I still think the tree approach will be faster though.

回复收藏 0 原文