编辑距离成本

发布于 2024-11-24 14:04:03 字数 2172 浏览 1 评论 0原文

我是 haskell 的新手,遇到了一个非常严重的性能问题,它一定是我的代码而不是 haskell 平台。

我有一个 Levenshtein 距离的 python 实现(自己的代码),并且我将其传递(或尝试这样做)给 haskell。结果如下:

bool2int :: Bool -> Int
bool2int True = 1
bool2int False = 0

levenshtein :: Eq a => [a] -> [a] -> Int -> Int -> Int
levenshtein u v 0 0 = 0
levenshtein u v i 0 = i
levenshtein u v 0 j = j
levenshtein u v i j = minimum [1 + levenshtein u v i (j - 1),
    1 + levenshtein u v (i - 1) j,
    bool2int (u !! (i - 1) /= v !! (j - 1) ) + levenshtein u v (i - 1) (j - 1) ]

distance :: Eq a => [a] -> [a] -> Int
distance u v = levenshtein u v (length u) (length v)

现在,Python 和 Haskell 之间长度为 10 或以上的字符串的执行时间差异为 10 的不同次方。另外,通过一些粗略的时间测量(挂钟,因为到目前为止我还没有在 haskell 中找到 Clock() 命令),看来我的 haskell 实现成本不是 O(mn),而是其他一些快速增长的成本。

注意:我不希望我的 haskell 实现与 python 脚本竞争速度。我只是希望它在“合理”的时间内运行,而不是整个宇宙存在的时间的倍数。

问题:

  • 我做错了什么,我的实施速度这么慢?
  • 如何修复它?
  • 谈论“惰性求值”:我认为如果 levenshtein "cat" "kit" 2 2 被调用三次,则只计算一次。这是对的吗?
  • 我的 bool2int 一定有内置的东西,对吗?
  • 任何其他输入如果能推动我在掌握 Haskell 的艰难道路上前进,我都会非常感激。

编辑:这里是用于比较的Python代码:

#! /usr/bin/python3.2
# -*- coding, utf-8 -*-

class Levenshtein:
        def __init__ (self, u, v):
                self.__u = ' ' + u
                self.__v = ' ' + v
                self.__D = [ [None for x in self.__u] for x in self.__v]
                for x, _ in enumerate (self.__u): self.__D [0] [x] = x
                for x, _ in enumerate (self.__v): self.__D [x] [0] = x

        @property
        def distance (self):
                return self.__getD (len (self.__v) - 1, len (self.__u) - 1)

        def __getD (self, i, j):
                if self.__D [i] [j] != None: return self.__D [i] [j]
                self.__D [i] [j] = min ( [self.__getD (i - 1, j - 1) + (0 if self.__v [i] == self.__u [j] else 1),
                                  self.__getD (i, j - 1) + 1,
                                  self.__getD (i - 1, j) + 1] )
                return self.__D [i] [j]

print (Levenshtein ('first string', 'second string').distance)

I am new to haskell and I encountered a performance issue that is so grave that it must be my code and not the haskell platform.

I have a python implementation of the Levenshtein distance (own code) and I passed (or tried to do so) this to haskell. The result is the following:

bool2int :: Bool -> Int
bool2int True = 1
bool2int False = 0

levenshtein :: Eq a => [a] -> [a] -> Int -> Int -> Int
levenshtein u v 0 0 = 0
levenshtein u v i 0 = i
levenshtein u v 0 j = j
levenshtein u v i j = minimum [1 + levenshtein u v i (j - 1),
    1 + levenshtein u v (i - 1) j,
    bool2int (u !! (i - 1) /= v !! (j - 1) ) + levenshtein u v (i - 1) (j - 1) ]

distance :: Eq a => [a] -> [a] -> Int
distance u v = levenshtein u v (length u) (length v)

Now, the difference in execution time for strings of length 10 or more is of various powers of 10 between python and haskell. Also with some rough time measuring (wall clock, as I haven't found a clock() command in haskell so far) it seems that my haskell implementation has not cost O(mn), but some other exorbitantly fast growing cost.

Nota bene: I do not want my haskell implementation to compete speed wise with the python script. I just want it to run in a "sensible" time and not in multiples of the time the whole universe exists.

Questions:

  • What am I doing wrong, that my implementation is so darn slow?
  • How to fix it?
  • Talking about "lazy evaluation": I gather that if levenshtein "cat" "kit" 2 2 is called thrice, it is only calculated once. Is this right?
  • There must be something built-in for my bool2int, right?
  • Any other input is highly appreciated if it shoves me ahead on the rough path to mastering haskell.

EDIT: Here goes the python code for comparison:

#! /usr/bin/python3.2
# -*- coding, utf-8 -*-

class Levenshtein:
        def __init__ (self, u, v):
                self.__u = ' ' + u
                self.__v = ' ' + v
                self.__D = [ [None for x in self.__u] for x in self.__v]
                for x, _ in enumerate (self.__u): self.__D [0] [x] = x
                for x, _ in enumerate (self.__v): self.__D [x] [0] = x

        @property
        def distance (self):
                return self.__getD (len (self.__v) - 1, len (self.__u) - 1)

        def __getD (self, i, j):
                if self.__D [i] [j] != None: return self.__D [i] [j]
                self.__D [i] [j] = min ( [self.__getD (i - 1, j - 1) + (0 if self.__v [i] == self.__u [j] else 1),
                                  self.__getD (i, j - 1) + 1,
                                  self.__getD (i - 1, j) + 1] )
                return self.__D [i] [j]

print (Levenshtein ('first string', 'second string').distance)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

傲性难收 2024-12-01 14:04:03

我做错了什么,我的实现速度这么慢?

您的算法具有指数复杂度。您似乎假设正在为您记住这些电话,但事实并非如此。

如何解决?

您需要添加显式记忆,可能使用数组或其他方法。

谈论“惰性求值”:我认为如果 levenshtein "cat" "kit" 2 2 被调用三次,则只计算一次。这是正确的吗?


不,Haskell 不进行自动记忆。惰性意味着,如果您执行 let y = fx in y + y,则仅在需要求和结果时才对 f x 求值(一次)。它意味着fx + f x将仅在对f x的一次调用中求值。当您想要共享子表达式的结果时,您必须明确。

我的 bool2int 一定有内置的东西,对吗?

是的,有一个instance Enum Bool,因此您可以使用fromEnum

*Main> fromEnum True
1
*Main> fromEnum False
0

如果任何其他意见能够推动我在掌握 Haskell 的艰难道路上前进,我们将非常感激。

虽然从头开始编写东西可能很有趣且具有教育意义,但学习利用 Hackage< 上的许多优秀库非常重要/a> 当做这样的常见事情时。

例如,edit-distance 包中有 Levenshtein 距离的实现。


我将你的 Haskell 代码翻译回 Python 进行比较:

def levenshtein(u, v, i, j):
    if i == 0: return j
    if j == 0: return i

    return min(1 + levenshtein(u, v, i, (j-1)),
               1 + levenshtein(u, v, (i-1), j),
               (u[i-1] != v[j-1]) + levenshtein(u, v, (i-1), (j-1)))

def distance(u, v):
    return levenshtein(u, v, len(u), len(v))

if __name__ == "__main__":
    print distance("abbacabaab", "abaddafaca")

即使没有修复 的 O(n) 索引问题chrisdb 在他的回答中指出,这在编译时比 Haskell 版本慢:

$ time python levenshtein.py
6

real    0m4.793s
user    0m4.690s
sys 0m0.020s

$ time ./Levenshtein 
6

real    0m0.524s
user    0m0.520s
sys 0m0.000s

当然,它们都输给了正确记忆的版本edit-distance 包:

$ time ./LevenshteinEditDistance 
6

real    0m0.015s
user    0m0.010s
sys 0m0.000s

这是一个使用 Data.Array 的简单记忆实现:

import Data.Array

distance u v = table ! (m, n)
   where table = listArray ((0, 0), (m, n)) [levenshtein i j | i <- [0..m], j <- [0..n]]

         levenshtein 0 j = j
         levenshtein i 0 = i
         levenshtein i j = minimum [ 1 + table!(i, j-1)
                                   , 1 + table!(i-1, j)
                                   , fromEnum (u'!(i-1) /= v'!(j-1)) + table!(i-1, j-1) ]

         u' = listArray (0, m-1) u
         v' = listArray (0, n-1) v

         m = length u
         n = length v

main = print $ distance "abbacabaab" "abaddafaca"

它的执行方式与原始 Python 代码类似:

$ time python levenshtein-original.py 
6

real    0m0.037s
user    0m0.030s
sys 0m0.000s
$ time ./LevenshteinArray 
6

real    0m0.017s
user    0m0.010s
sys 0m0.000s

What am I doing wrong, that my implementation is so darn slow?

Your algorithm has exponential complexity. You seem to be assuming that the calls are being memoized for you, but that's not the case.

How to fix it?

You'll need to add explicit memoization, possibly using an array or some other method.

Talking about "lazy evaluation": I gather that if levenshtein "cat" "kit" 2 2 is called thrice, it is only calculated once. Is this right?

No, Haskell does not do automatic memoization. Laziness means that if you do let y = f x in y + y, then the f x will only be evaluated (once) if the result of the sum is demanded. It does not mean that f x + f x will evaluate in only one call to f x. You have to be explicit when you want to share results from subexpressions.

There must be something built-in for my bool2int, right?

Yes, there is an instance Enum Bool, so you can use fromEnum.

*Main> fromEnum True
1
*Main> fromEnum False
0

Any other input is highly appreciated if it shoves me ahead on the rough path to mastering haskell.

While writing stuff from scratch may be fun and educational, it is important to learn to take advantage of the many great libraries on Hackage when doing common things like this.

For example there is an implementation of the Levenshtein distance in the edit-distance package.


I translated your Haskell code back to Python for comparison:

def levenshtein(u, v, i, j):
    if i == 0: return j
    if j == 0: return i

    return min(1 + levenshtein(u, v, i, (j-1)),
               1 + levenshtein(u, v, (i-1), j),
               (u[i-1] != v[j-1]) + levenshtein(u, v, (i-1), (j-1)))

def distance(u, v):
    return levenshtein(u, v, len(u), len(v))

if __name__ == "__main__":
    print distance("abbacabaab", "abaddafaca")

Even without fixing the O(n) indexing issue that chrisdb pointed out in his answer, this performs slower than the Haskell version when compiled:

$ time python levenshtein.py
6

real    0m4.793s
user    0m4.690s
sys 0m0.020s

$ time ./Levenshtein 
6

real    0m0.524s
user    0m0.520s
sys 0m0.000s

Of course, they both lose to the properly memoized version in the edit-distance package:

$ time ./LevenshteinEditDistance 
6

real    0m0.015s
user    0m0.010s
sys 0m0.000s

Here's a simple memoized implementation using Data.Array:

import Data.Array

distance u v = table ! (m, n)
   where table = listArray ((0, 0), (m, n)) [levenshtein i j | i <- [0..m], j <- [0..n]]

         levenshtein 0 j = j
         levenshtein i 0 = i
         levenshtein i j = minimum [ 1 + table!(i, j-1)
                                   , 1 + table!(i-1, j)
                                   , fromEnum (u'!(i-1) /= v'!(j-1)) + table!(i-1, j-1) ]

         u' = listArray (0, m-1) u
         v' = listArray (0, n-1) v

         m = length u
         n = length v

main = print $ distance "abbacabaab" "abaddafaca"

It performs similarly to your original Python code:

$ time python levenshtein-original.py 
6

real    0m0.037s
user    0m0.030s
sys 0m0.000s
$ time ./LevenshteinArray 
6

real    0m0.017s
user    0m0.010s
sys 0m0.000s
GRAY°灰色天空 2024-12-01 14:04:03

在我看来,可能的原因是使用 !!用于随机访问。 Haskell 列表是链表,不太适合需要随机访问而不是顺序访问的算法。

您可能想尝试用更适合随机访问的内容替换列表。如果您对字符串感兴趣,您可以使用 Data.Text 或 ByteStrings,它们本质上是数组,并且应该很快。或者你可以使用类似 Data.Vector 的东西。

编辑:实际上,看起来 Data.Text 也会有同样的问题,因为文档说索引是 O(n) 。可能转换为向量是最好的。

It looks to me like the likely cause is the use of !! for random access. Haskell lists are linked lists, which are poorly suited to algorithms that require random rather than sequential access.

You might want to try replacing the lists with something better suited to random access. If you are interested in strings you could use Data.Text or ByteStrings, which are underlyingly arrays and should be fast. Or you could use something like Data.Vector perhaps.

EDIT: Actually, it looks like Data.Text will have the same issue, since the documentation says indexing is O(n). Probably converting to a vector would be best.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文