修改 Levenshtein Distance 算法以不计算所有距离

发布于 2024-09-26 04:36:00 字数 2926 浏览 6 评论 0原文

我正在研究模糊搜索实现,作为实现的一部分,我们使用 Apache 的 StringUtils.getLevenshteinDistance。目前,我们正在为模糊搜索设定一个特定的最大平均响应时间。经过各种增强和一些分析后,花费最多时间的地方是计算编辑距离。在搜索三个字母或更多字母的字符串时,它大约占据总时间的 80-90%。

现在,我知道这里可以做的事情有一些限制,但我已经阅读了之前的 SO 问题和 LD 的维基百科链接,如果有人愿意将阈值限制为设定的最大距离,这可能有助于遏制在算法上花费了时间,但我不确定如何准确地做到这一点。

如果我们只对 距离如果小于a 阈值k,那么就足以 计算宽度的对角线条纹 矩阵中有2k+1。这样, 算法可以在 O(kl) 时间内运行, 其中 l 是最短的长度 字符串。[3]

下面您将看到来自 StringUtils 的原始 LH 代码。之后就是我的修改了。我试图基本上计算距 i,j 对角线的设定长度的距离(因此,在我的示例中,i,j 对角线上方和下方的两条对角线)。然而,这不可能是正确的,因为我已经这样做了。例如,在最高对角线上,它总是会选择正上方的单元格值,该值将为 0。如果有人可以向我展示如何按照我所描述的方式实现此功能,或者关于如何实现它的一些一般建议,将不胜感激。

public static int getLevenshteinDistance(String s, String t) {
        if (s == null || t == null) {
            throw new IllegalArgumentException("Strings must not be null");
        }

        int n = s.length(); // length of s
        int m = t.length(); // length of t

        if (n == 0) {
            return m;
        } else if (m == 0) {
            return n;
        }

        if (n > m) {
            // swap the input strings to consume less memory
            String tmp = s;
            s = t;
            t = tmp;
            n = m;
            m = t.length();
        }

        int p[] = new int[n+1]; //'previous' cost array, horizontally
        int d[] = new int[n+1]; // cost array, horizontally
        int _d[]; //placeholder to assist in swapping p and d

        // indexes into strings s and t
        int i; // iterates through s
        int j; // iterates through t

        char t_j; // jth character of t

        int cost; // cost

        for (i = 0; i<=n; i++) {
            p[i] = i;
        }

        for (j = 1; j<=m; j++) {
            t_j = t.charAt(j-1);
            d[0] = j;

            for (i=1; i<=n; i++) {
                cost = s.charAt(i-1)==t_j ? 0 : 1;
                // minimum of cell to the left+1, to the top+1, diagonally left and up +cost
                d[i] = Math.min(Math.min(d[i-1]+1, p[i]+1),  p[i-1]+cost);
            }

            // copy current distance counts to 'previous row' distance counts
            _d = p;
            p = d;
            d = _d;
        }

        // our last action in the above loop was to switch d and p, so p now 
        // actually has the most recent cost counts
        return p[n];
    }

我的修改(仅针对 for 循环):

  for (j = 1; j<=m; j++) {
        t_j = t.charAt(j-1);
        d[0] = j;

        int k = Math.max(j-2, 1);
        for (i = k; i <= Math.min(j+2, n); i++) {
            cost = s.charAt(i-1)==t_j ? 0 : 1;
            // minimum of cell to the left+1, to the top+1, diagonally left and up +cost
            d[i] = Math.min(Math.min(d[i-1]+1, p[i]+1),  p[i-1]+cost);
        }

        // copy current distance counts to 'previous row' distance counts
        _d = p;
        p = d;
        d = _d;
    }

I'm working on a fuzzy search implementation and as part of the implementation, we're using Apache's StringUtils.getLevenshteinDistance. At the moment, we're going for a specific maxmimum average response time for our fuzzy search. After various enhancements and with some profiling, the place where the most time is spent is calculating the Levenshtein distance. It takes up roughly 80-90% of the total time on search strings three letters or more.

Now, I know there are some limitations to what can be done here, but I've read on previous SO questions and on the Wikipedia link for LD that if one is willing limit the threshold to a set maximum distance, that could help curb the time spent on the algorithm, but I'm not sure how to do this exactly.

If we are only interested in the
distance if it is smaller than a
threshold k, then it suffices to
compute a diagonal stripe of width
2k+1 in the matrix. In this way, the
algorithm can be run in O(kl) time,
where l is the length of the shortest
string.[3]

Below you will see the original LH code from StringUtils. After that is my modification. I'm trying to basically calculate the distances of a set length from the i,j diagonal (so, in my example, two diagonals above and below the i,j diagonal). However, this can't be correct as I've done it. For example, on the highest diagonal, it's always going to choose the cell value directly above, which will be 0. If anyone could show me how to make this functional as I've described, or some general advice on how to make it so, it would be greatly appreciated.

public static int getLevenshteinDistance(String s, String t) {
        if (s == null || t == null) {
            throw new IllegalArgumentException("Strings must not be null");
        }

        int n = s.length(); // length of s
        int m = t.length(); // length of t

        if (n == 0) {
            return m;
        } else if (m == 0) {
            return n;
        }

        if (n > m) {
            // swap the input strings to consume less memory
            String tmp = s;
            s = t;
            t = tmp;
            n = m;
            m = t.length();
        }

        int p[] = new int[n+1]; //'previous' cost array, horizontally
        int d[] = new int[n+1]; // cost array, horizontally
        int _d[]; //placeholder to assist in swapping p and d

        // indexes into strings s and t
        int i; // iterates through s
        int j; // iterates through t

        char t_j; // jth character of t

        int cost; // cost

        for (i = 0; i<=n; i++) {
            p[i] = i;
        }

        for (j = 1; j<=m; j++) {
            t_j = t.charAt(j-1);
            d[0] = j;

            for (i=1; i<=n; i++) {
                cost = s.charAt(i-1)==t_j ? 0 : 1;
                // minimum of cell to the left+1, to the top+1, diagonally left and up +cost
                d[i] = Math.min(Math.min(d[i-1]+1, p[i]+1),  p[i-1]+cost);
            }

            // copy current distance counts to 'previous row' distance counts
            _d = p;
            p = d;
            d = _d;
        }

        // our last action in the above loop was to switch d and p, so p now 
        // actually has the most recent cost counts
        return p[n];
    }

My modifications (only to the for loops):

  for (j = 1; j<=m; j++) {
        t_j = t.charAt(j-1);
        d[0] = j;

        int k = Math.max(j-2, 1);
        for (i = k; i <= Math.min(j+2, n); i++) {
            cost = s.charAt(i-1)==t_j ? 0 : 1;
            // minimum of cell to the left+1, to the top+1, diagonally left and up +cost
            d[i] = Math.min(Math.min(d[i-1]+1, p[i]+1),  p[i-1]+cost);
        }

        // copy current distance counts to 'previous row' distance counts
        _d = p;
        p = d;
        d = _d;
    }

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

无语# 2024-10-03 04:36:00

实现窗口的问题是处理每行中第一个条目左侧和最后一个条目上方的值。

一种方法是将您最初填写的值从 1 而不是 0 开始,然后忽略遇到的任何 0。您必须从最终答案中减去 1。

另一种方法是用高值填充第一个和最后一个上方的条目,这样最小检查永远不会选择它们。这就是我前几天必须实现它时选择的方式:

public static int levenshtein(String s, String t, int threshold) {
    int slen = s.length();
    int tlen = t.length();

    // swap so the smaller string is t; this reduces the memory usage
    // of our buffers
    if(tlen > slen) {
        String stmp = s;
        s = t;
        t = stmp;
        int itmp = slen;
        slen = tlen;
        tlen = itmp;
    }

    // p is the previous and d is the current distance array; dtmp is used in swaps
    int[] p = new int[tlen + 1];
    int[] d = new int[tlen + 1];
    int[] dtmp;

    // the values necessary for our threshold are written; the ones after
    // must be filled with large integers since the tailing member of the threshold 
    // window in the bottom array will run min across them
    int n = 0;
    for(; n < Math.min(p.length, threshold + 1); ++n)
        p[n] = n;
    Arrays.fill(p, n, p.length, Integer.MAX_VALUE);
    Arrays.fill(d, Integer.MAX_VALUE);

    // this is the core of the Levenshtein edit distance algorithm
    // instead of actually building the matrix, two arrays are swapped back and forth
    // the threshold limits the amount of entries that need to be computed if we're 
    // looking for a match within a set distance
    for(int row = 1; row < s.length()+1; ++row) {
        char schar = s.charAt(row-1);
        d[0] = row;

        // set up our threshold window
        int min = Math.max(1, row - threshold);
        int max = Math.min(d.length, row + threshold + 1);

        // since we're reusing arrays, we need to be sure to wipe the value left of the
        // starting index; we don't have to worry about the value above the ending index
        // as the arrays were initially filled with large integers and we progress to the right
        if(min > 1)
            d[min-1] = Integer.MAX_VALUE;

        for(int col = min; col < max; ++col) {
            if(schar == t.charAt(col-1))
                d[col] = p[col-1];
            else 
                // min of: diagonal, left, up
                d[col] = Math.min(p[col-1], Math.min(d[col-1], p[col])) + 1;
        }
        // swap our arrays
        dtmp = p;
        p = d;
        d = dtmp;
    }

        if(p[tlen] == Integer.MAX_VALUE)
            return -1;
    return p[tlen];
}

The issue with implementing the window is dealing with the value to the left of the first entry and above the last entry in each row.

One way is to start the values you initially fill in at 1 instead of 0, then just ignore any 0s that you encounter. You'll have to subtract 1 from your final answer.

Another way is to fill the entries left of first and above last with high values so the minimum check will never pick them. That's the way I chose when I had to implement it the other day:

public static int levenshtein(String s, String t, int threshold) {
    int slen = s.length();
    int tlen = t.length();

    // swap so the smaller string is t; this reduces the memory usage
    // of our buffers
    if(tlen > slen) {
        String stmp = s;
        s = t;
        t = stmp;
        int itmp = slen;
        slen = tlen;
        tlen = itmp;
    }

    // p is the previous and d is the current distance array; dtmp is used in swaps
    int[] p = new int[tlen + 1];
    int[] d = new int[tlen + 1];
    int[] dtmp;

    // the values necessary for our threshold are written; the ones after
    // must be filled with large integers since the tailing member of the threshold 
    // window in the bottom array will run min across them
    int n = 0;
    for(; n < Math.min(p.length, threshold + 1); ++n)
        p[n] = n;
    Arrays.fill(p, n, p.length, Integer.MAX_VALUE);
    Arrays.fill(d, Integer.MAX_VALUE);

    // this is the core of the Levenshtein edit distance algorithm
    // instead of actually building the matrix, two arrays are swapped back and forth
    // the threshold limits the amount of entries that need to be computed if we're 
    // looking for a match within a set distance
    for(int row = 1; row < s.length()+1; ++row) {
        char schar = s.charAt(row-1);
        d[0] = row;

        // set up our threshold window
        int min = Math.max(1, row - threshold);
        int max = Math.min(d.length, row + threshold + 1);

        // since we're reusing arrays, we need to be sure to wipe the value left of the
        // starting index; we don't have to worry about the value above the ending index
        // as the arrays were initially filled with large integers and we progress to the right
        if(min > 1)
            d[min-1] = Integer.MAX_VALUE;

        for(int col = min; col < max; ++col) {
            if(schar == t.charAt(col-1))
                d[col] = p[col-1];
            else 
                // min of: diagonal, left, up
                d[col] = Math.min(p[col-1], Math.min(d[col-1], p[col])) + 1;
        }
        // swap our arrays
        dtmp = p;
        p = d;
        d = dtmp;
    }

        if(p[tlen] == Integer.MAX_VALUE)
            return -1;
    return p[tlen];
}
神妖 2024-10-03 04:36:00

我写过关于 Levenshtein 自动机的文章,这是之前在 O(n) 时间内进行此类检查的一种方法,此处。源代码示例采用 Python 编写,但解释应该会有所帮助,并且引用的论文提供了更多详细信息。

I've written about Levenshtein automata, which are one way to do this sort of check in O(n) time before, here. The source code samples are in Python, but the explanations should be helpful, and the referenced papers provide more details.

蹲在坟头点根烟 2024-10-03 04:36:00

根据“Gusfield, Dan (1997)。关于字符串、树和序列的算法:计算机科学和计算生物学”(第 264 页),您应该忽略零。

According to "Gusfield, Dan (1997). Algorithms on strings, trees, and sequences: computer science and computational biology" (page 264) you should ignore zeros.

不语却知心 2024-10-03 04:36:00

这里有人回答了一个非常相似的问题:

引用:
<我>
我已经做过很多次了。我的方法是对可能变化的游戏树进行递归深度优先树遍历。有一个预算 k 的更改,我用它来修剪树。有了这个例程,我首先使用 k=0 运行它,然后 k=1,然后 k=2,直到我得到命中或者我不想再更高。

char* a = /* string 1 */;
char* b = /* string 2 */;
int na = strlen(a);
int nb = strlen(b);
bool walk(int ia, int ib, int k){
  /* if the budget is exhausted, prune the search */
  if (k < 0) return false;
  /* if at end of both strings we have a match */ 
  if (ia == na && ib == nb) return true;
  /* if the first characters match, continue walking with no reduction in budget */
  if (ia < na && ib < nb && a[ia] == b[ib] && walk(ia+1, ib+1, k)) return true;
  /* if the first characters don't match, assume there is a 1-character replacement */
  if (ia < na && ib < nb && a[ia] != b[ib] && walk(ia+1, ib+1, k-1)) return true;
  /* try assuming there is an extra character in a */
  if (ia < na && walk(ia+1, ib, k-1)) return true;
  /* try assuming there is an extra character in b */
  if (ib < nb && walk(ia, ib+1, k-1)) return true;
  /* if none of those worked, I give up */
  return false;
}  

只是主要部分,更多代码在原文中

Here someone answers a very similar question:

Cite:

I've done it a number of times. The way I do it is with a recursive depth-first tree-walk of the game tree of possible changes. There is a budget k of changes, that I use to prune the tree. With that routine in hand, first I run it with k=0, then k=1, then k=2 until I either get a hit or I don't want to go any higher.

char* a = /* string 1 */;
char* b = /* string 2 */;
int na = strlen(a);
int nb = strlen(b);
bool walk(int ia, int ib, int k){
  /* if the budget is exhausted, prune the search */
  if (k < 0) return false;
  /* if at end of both strings we have a match */ 
  if (ia == na && ib == nb) return true;
  /* if the first characters match, continue walking with no reduction in budget */
  if (ia < na && ib < nb && a[ia] == b[ib] && walk(ia+1, ib+1, k)) return true;
  /* if the first characters don't match, assume there is a 1-character replacement */
  if (ia < na && ib < nb && a[ia] != b[ib] && walk(ia+1, ib+1, k-1)) return true;
  /* try assuming there is an extra character in a */
  if (ia < na && walk(ia+1, ib, k-1)) return true;
  /* try assuming there is an extra character in b */
  if (ib < nb && walk(ia, ib+1, k-1)) return true;
  /* if none of those worked, I give up */
  return false;
}  

just the main part, more code in the original

久而酒知 2024-10-03 04:36:00

我使用了原始代码并将其放置在 j for 循环结束之前:

    if (p[n] > s.length() + 5)
        break;

+5 是任意的,但对于我们的目的来说,如果距离是查询长度加五(或我们确定的任何数字),则它不会返回的内容确实很重要,因为我们认为匹配结果差异太大。它确实减少了一些事情。不过,如果有人能更好地理解的话,可以肯定这不是维基声明所讨论的想法。

I used the original code and places this just before the end of the j for loop:

    if (p[n] > s.length() + 5)
        break;

The +5 is arbitrary but for our purposes, if the distances is the query length plus five (or whatever number we settle upon), it doesn't really matter what is returned because we consider the match as simply being too different. It does cut down on things a bit. Still, pretty sure this isn't the idea that the Wiki statement was talking about, if anyone understands that better.

無心 2024-10-03 04:36:00

Apache Commons Lang 3.4 有这样的实现:

/**
 * <p>Find the Levenshtein distance between two Strings if it's less than or equal to a given
 * threshold.</p>
 *
 * <p>This is the number of changes needed to change one String into
 * another, where each change is a single character modification (deletion,
 * insertion or substitution).</p>
 *
 * <p>This implementation follows from Algorithms on Strings, Trees and Sequences by Dan Gusfield
 * and Chas Emerick's implementation of the Levenshtein distance algorithm from
 * <a href="http://www.merriampark.com/ld.htm">http://www.merriampark.com/ld.htm</a></p>
 *
 * <pre>
 * StringUtils.getLevenshteinDistance(null, *, *)             = IllegalArgumentException
 * StringUtils.getLevenshteinDistance(*, null, *)             = IllegalArgumentException
 * StringUtils.getLevenshteinDistance(*, *, -1)               = IllegalArgumentException
 * StringUtils.getLevenshteinDistance("","", 0)               = 0
 * StringUtils.getLevenshteinDistance("aaapppp", "", 8)       = 7
 * StringUtils.getLevenshteinDistance("aaapppp", "", 7)       = 7
 * StringUtils.getLevenshteinDistance("aaapppp", "", 6))      = -1
 * StringUtils.getLevenshteinDistance("elephant", "hippo", 7) = 7
 * StringUtils.getLevenshteinDistance("elephant", "hippo", 6) = -1
 * StringUtils.getLevenshteinDistance("hippo", "elephant", 7) = 7
 * StringUtils.getLevenshteinDistance("hippo", "elephant", 6) = -1
 * </pre>
 *
 * @param s  the first String, must not be null
 * @param t  the second String, must not be null
 * @param threshold the target threshold, must not be negative
 * @return result distance, or {@code -1} if the distance would be greater than the threshold
 * @throws IllegalArgumentException if either String input {@code null} or negative threshold
 */
public static int getLevenshteinDistance(CharSequence s, CharSequence t, final int threshold) {
    if (s == null || t == null) {
        throw new IllegalArgumentException("Strings must not be null");
    }
    if (threshold < 0) {
        throw new IllegalArgumentException("Threshold must not be negative");
    }

    /*
    This implementation only computes the distance if it's less than or equal to the
    threshold value, returning -1 if it's greater.  The advantage is performance: unbounded
    distance is O(nm), but a bound of k allows us to reduce it to O(km) time by only
    computing a diagonal stripe of width 2k + 1 of the cost table.
    It is also possible to use this to compute the unbounded Levenshtein distance by starting
    the threshold at 1 and doubling each time until the distance is found; this is O(dm), where
    d is the distance.

    One subtlety comes from needing to ignore entries on the border of our stripe
    eg.
    p[] = |#|#|#|*
    d[] =  *|#|#|#|
    We must ignore the entry to the left of the leftmost member
    We must ignore the entry above the rightmost member

    Another subtlety comes from our stripe running off the matrix if the strings aren't
    of the same size.  Since string s is always swapped to be the shorter of the two,
    the stripe will always run off to the upper right instead of the lower left of the matrix.

    As a concrete example, suppose s is of length 5, t is of length 7, and our threshold is 1.
    In this case we're going to walk a stripe of length 3.  The matrix would look like so:

       1 2 3 4 5
    1 |#|#| | | |
    2 |#|#|#| | |
    3 | |#|#|#| |
    4 | | |#|#|#|
    5 | | | |#|#|
    6 | | | | |#|
    7 | | | | | |

    Note how the stripe leads off the table as there is no possible way to turn a string of length 5
    into one of length 7 in edit distance of 1.

    Additionally, this implementation decreases memory usage by using two
    single-dimensional arrays and swapping them back and forth instead of allocating
    an entire n by m matrix.  This requires a few minor changes, such as immediately returning
    when it's detected that the stripe has run off the matrix and initially filling the arrays with
    large values so that entries we don't compute are ignored.

    See Algorithms on Strings, Trees and Sequences by Dan Gusfield for some discussion.
     */

    int n = s.length(); // length of s
    int m = t.length(); // length of t

    // if one string is empty, the edit distance is necessarily the length of the other
    if (n == 0) {
        return m <= threshold ? m : -1;
    } else if (m == 0) {
        return n <= threshold ? n : -1;
    }

    if (n > m) {
        // swap the two strings to consume less memory
        final CharSequence tmp = s;
        s = t;
        t = tmp;
        n = m;
        m = t.length();
    }

    int p[] = new int[n + 1]; // 'previous' cost array, horizontally
    int d[] = new int[n + 1]; // cost array, horizontally
    int _d[]; // placeholder to assist in swapping p and d

    // fill in starting table values
    final int boundary = Math.min(n, threshold) + 1;
    for (int i = 0; i < boundary; i++) {
        p[i] = i;
    }
    // these fills ensure that the value above the rightmost entry of our
    // stripe will be ignored in following loop iterations
    Arrays.fill(p, boundary, p.length, Integer.MAX_VALUE);
    Arrays.fill(d, Integer.MAX_VALUE);

    // iterates through t
    for (int j = 1; j <= m; j++) {
        final char t_j = t.charAt(j - 1); // jth character of t
        d[0] = j;

        // compute stripe indices, constrain to array size
        final int min = Math.max(1, j - threshold);
        final int max = (j > Integer.MAX_VALUE - threshold) ? n : Math.min(n, j + threshold);

        // the stripe may lead off of the table if s and t are of different sizes
        if (min > max) {
            return -1;
        }

        // ignore entry left of leftmost
        if (min > 1) {
            d[min - 1] = Integer.MAX_VALUE;
        }

        // iterates through [min, max] in s
        for (int i = min; i <= max; i++) {
            if (s.charAt(i - 1) == t_j) {
                // diagonally left and up
                d[i] = p[i - 1];
            } else {
                // 1 + minimum of cell to the left, to the top, diagonally left and up
                d[i] = 1 + Math.min(Math.min(d[i - 1], p[i]), p[i - 1]);
            }
        }

        // copy current distance counts to 'previous row' distance counts
        _d = p;
        p = d;
        d = _d;
    }

    // if p[n] is greater than the threshold, there's no guarantee on it being the correct
    // distance
    if (p[n] <= threshold) {
        return p[n];
    }
    return -1;
}

Apache Commons Lang 3.4 has this implementation:

/**
 * <p>Find the Levenshtein distance between two Strings if it's less than or equal to a given
 * threshold.</p>
 *
 * <p>This is the number of changes needed to change one String into
 * another, where each change is a single character modification (deletion,
 * insertion or substitution).</p>
 *
 * <p>This implementation follows from Algorithms on Strings, Trees and Sequences by Dan Gusfield
 * and Chas Emerick's implementation of the Levenshtein distance algorithm from
 * <a href="http://www.merriampark.com/ld.htm">http://www.merriampark.com/ld.htm</a></p>
 *
 * <pre>
 * StringUtils.getLevenshteinDistance(null, *, *)             = IllegalArgumentException
 * StringUtils.getLevenshteinDistance(*, null, *)             = IllegalArgumentException
 * StringUtils.getLevenshteinDistance(*, *, -1)               = IllegalArgumentException
 * StringUtils.getLevenshteinDistance("","", 0)               = 0
 * StringUtils.getLevenshteinDistance("aaapppp", "", 8)       = 7
 * StringUtils.getLevenshteinDistance("aaapppp", "", 7)       = 7
 * StringUtils.getLevenshteinDistance("aaapppp", "", 6))      = -1
 * StringUtils.getLevenshteinDistance("elephant", "hippo", 7) = 7
 * StringUtils.getLevenshteinDistance("elephant", "hippo", 6) = -1
 * StringUtils.getLevenshteinDistance("hippo", "elephant", 7) = 7
 * StringUtils.getLevenshteinDistance("hippo", "elephant", 6) = -1
 * </pre>
 *
 * @param s  the first String, must not be null
 * @param t  the second String, must not be null
 * @param threshold the target threshold, must not be negative
 * @return result distance, or {@code -1} if the distance would be greater than the threshold
 * @throws IllegalArgumentException if either String input {@code null} or negative threshold
 */
public static int getLevenshteinDistance(CharSequence s, CharSequence t, final int threshold) {
    if (s == null || t == null) {
        throw new IllegalArgumentException("Strings must not be null");
    }
    if (threshold < 0) {
        throw new IllegalArgumentException("Threshold must not be negative");
    }

    /*
    This implementation only computes the distance if it's less than or equal to the
    threshold value, returning -1 if it's greater.  The advantage is performance: unbounded
    distance is O(nm), but a bound of k allows us to reduce it to O(km) time by only
    computing a diagonal stripe of width 2k + 1 of the cost table.
    It is also possible to use this to compute the unbounded Levenshtein distance by starting
    the threshold at 1 and doubling each time until the distance is found; this is O(dm), where
    d is the distance.

    One subtlety comes from needing to ignore entries on the border of our stripe
    eg.
    p[] = |#|#|#|*
    d[] =  *|#|#|#|
    We must ignore the entry to the left of the leftmost member
    We must ignore the entry above the rightmost member

    Another subtlety comes from our stripe running off the matrix if the strings aren't
    of the same size.  Since string s is always swapped to be the shorter of the two,
    the stripe will always run off to the upper right instead of the lower left of the matrix.

    As a concrete example, suppose s is of length 5, t is of length 7, and our threshold is 1.
    In this case we're going to walk a stripe of length 3.  The matrix would look like so:

       1 2 3 4 5
    1 |#|#| | | |
    2 |#|#|#| | |
    3 | |#|#|#| |
    4 | | |#|#|#|
    5 | | | |#|#|
    6 | | | | |#|
    7 | | | | | |

    Note how the stripe leads off the table as there is no possible way to turn a string of length 5
    into one of length 7 in edit distance of 1.

    Additionally, this implementation decreases memory usage by using two
    single-dimensional arrays and swapping them back and forth instead of allocating
    an entire n by m matrix.  This requires a few minor changes, such as immediately returning
    when it's detected that the stripe has run off the matrix and initially filling the arrays with
    large values so that entries we don't compute are ignored.

    See Algorithms on Strings, Trees and Sequences by Dan Gusfield for some discussion.
     */

    int n = s.length(); // length of s
    int m = t.length(); // length of t

    // if one string is empty, the edit distance is necessarily the length of the other
    if (n == 0) {
        return m <= threshold ? m : -1;
    } else if (m == 0) {
        return n <= threshold ? n : -1;
    }

    if (n > m) {
        // swap the two strings to consume less memory
        final CharSequence tmp = s;
        s = t;
        t = tmp;
        n = m;
        m = t.length();
    }

    int p[] = new int[n + 1]; // 'previous' cost array, horizontally
    int d[] = new int[n + 1]; // cost array, horizontally
    int _d[]; // placeholder to assist in swapping p and d

    // fill in starting table values
    final int boundary = Math.min(n, threshold) + 1;
    for (int i = 0; i < boundary; i++) {
        p[i] = i;
    }
    // these fills ensure that the value above the rightmost entry of our
    // stripe will be ignored in following loop iterations
    Arrays.fill(p, boundary, p.length, Integer.MAX_VALUE);
    Arrays.fill(d, Integer.MAX_VALUE);

    // iterates through t
    for (int j = 1; j <= m; j++) {
        final char t_j = t.charAt(j - 1); // jth character of t
        d[0] = j;

        // compute stripe indices, constrain to array size
        final int min = Math.max(1, j - threshold);
        final int max = (j > Integer.MAX_VALUE - threshold) ? n : Math.min(n, j + threshold);

        // the stripe may lead off of the table if s and t are of different sizes
        if (min > max) {
            return -1;
        }

        // ignore entry left of leftmost
        if (min > 1) {
            d[min - 1] = Integer.MAX_VALUE;
        }

        // iterates through [min, max] in s
        for (int i = min; i <= max; i++) {
            if (s.charAt(i - 1) == t_j) {
                // diagonally left and up
                d[i] = p[i - 1];
            } else {
                // 1 + minimum of cell to the left, to the top, diagonally left and up
                d[i] = 1 + Math.min(Math.min(d[i - 1], p[i]), p[i - 1]);
            }
        }

        // copy current distance counts to 'previous row' distance counts
        _d = p;
        p = d;
        d = _d;
    }

    // if p[n] is greater than the threshold, there's no guarantee on it being the correct
    // distance
    if (p[n] <= threshold) {
        return p[n];
    }
    return -1;
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文