如何纠正 Damerau-Levenshtein 实施中的错误？

发布于 2024-09-13 11:37:41 字数 23556 浏览 16 评论 0原文

我带着另一个较长的问题回来了。尝试过许多基于 Python 的 Damerau-Levenshtein 编辑距离实现，我终于找到了下面列出的作为 editdistance_reference()。它似乎提供了正确的结果并且似乎有一个有效的实施。

所以我开始将代码转换为 Cython。根据我的测试数据，参考方法设法提供结果进行 11,000 次比较（对于大约 12 个字母长的单词对），而 Cythonized 方法则超过每秒 200,000 次比较。遗憾的是，结果不正确：当您查看变量 thisrow 时我打印出来进行调试，我的版本已经充满了，无论我向它扔什么数据，而参考输出显示另一张图片。例如，针对 'world' 测试 'helo' 产生以下输出（ED 标记我的函数，EDR 是正确工作的参考）：

From editdistance():

#ED  A [0, 0, 0, 0, 0, 1]
#ED  B [1, 0, 0, 0, 0, 1]
#ED  B [1, 1, 0, 0, 0, 1]
#ED  B [1, 1, 1, 0, 0, 1]
#ED  B [1, 1, 1, 1, 0, 1]
#ED  B [1, 1, 1, 1, 1, 1]

#ED  A [0, 0, 0, 0, 0, 2]
#ED  B [1, 0, 0, 0, 0, 2]
#ED  B [1, 1, 0, 0, 0, 2]
#ED  B [1, 1, 1, 0, 0, 2]
#ED  B [1, 1, 1, 1, 0, 2]
#ED  B [1, 1, 1, 1, 1, 2]

#ED  A [0, 0, 0, 0, 0, 3]
#ED  B [1, 0, 0, 0, 0, 3]
#ED  B [1, 1, 0, 0, 0, 3]
#ED  B [1, 1, 1, 0, 0, 3]
#ED  B [1, 1, 1, 1, 0, 3]
#ED  B [1, 1, 1, 1, 1, 3]

#ED  A [0, 0, 0, 0, 0, 4]
#ED  B [1, 0, 0, 0, 0, 4]
#ED  B [1, 1, 0, 0, 0, 4]
#ED  B [1, 1, 1, 0, 0, 4]
#ED  B [1, 1, 1, 1, 0, 4]
#ED  B [1, 1, 1, 1, 1, 4]

from editdistance_reference( ）：

#EDR A [0, 0, 0, 0, 0, 1]
#EDR B [1, 0, 0, 0, 0, 1]
#EDR B [1, 2, 0, 0, 0, 1]
#EDR B [1, 2, 3, 0, 0, 1]
#EDR B [1, 2, 3, 4, 0, 1]
#EDR B [1, 2, 3, 4, 5, 1]

#EDR A [0, 0, 0, 0, 0, 2]
#EDR B [2, 0, 0, 0, 0, 2]
#EDR B [2, 2, 0, 0, 0, 2]
#EDR B [2, 2, 3, 0, 0, 2]
#EDR B [2, 2, 3, 4, 0, 2]
#EDR B [2, 2, 3, 4, 5, 2]

#EDR A [0, 0, 0, 0, 0, 3]
#EDR B [3, 0, 0, 0, 0, 3]
#EDR B [3, 3, 0, 0, 0, 3]
#EDR B [3, 3, 3, 0, 0, 3]
#EDR B [3, 3, 3, 3, 0, 3]
#EDR B [3, 3, 3, 3, 4, 3]

#EDR A [0, 0, 0, 0, 0, 4]
#EDR B [4, 0, 0, 0, 0, 4]
#EDR B [4, 4, 0, 0, 0, 4]
#EDR B [4, 4, 4, 0, 0, 4]
#EDR B [4, 4, 4, 4, 0, 4]
#EDR B [4, 4, 4, 4, 4, 4]

我一定很愚蠢，因为错误可能是那些非常非常明显的事情之一。但我似乎找不到它。

还有第二个问题：我为三个数组 twoago、oneago 和 thisrow 分配了 malloc 空间，然后它们以循环方式交换。当我尝试free(twoago)等等时，我得到一个 glibc 抱怨双重释放或损坏的行。我用谷歌搜索了这个；难道是指针交换业务让glibc有点头晕，导致无法正确释放内存？

下面我首先列出了进行编译所需的 setup.py (/path/to/python3.1 ./setup.py build_ext --inplace)，然后编辑距离代码正确，所以感兴趣人们发现更容易复制。

还有一件事：这是用Python3.1运行的；一件有趣的事情是，在 *.pyx 文件中我们确实有裸露的 unicode 字符串，但 print 仍然是一个语句，而不是一个函数。

是的，我知道这里要粘贴很多代码，但问题是，当您把它削减太多了。我相信除 editdistance() 之外的所有方法都可以正常工作，但请随意指出您发现的任何问题。

setup.py：

from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext

setup(
  name            = 'cython_dameraulevenshtein',
  ext_modules     = [
    Extension( 'cython_dameraulevenshtein', [ 'cython_dameraulevenshtein.pyx', ] ), ],
  cmdclass        = {
    'build_ext': build_ext }, )

cython_dameraulevenshtein.pyx（一直滚动到最后查看有趣的东西）：

############################################################################################################
cdef extern from "stdlib.h":
  ctypedef  unsigned int size_t
  void      *malloc(size_t size)
  void      *realloc( void *ptr, size_t size )
  void      free(void *ptr)

#-----------------------------------------------------------------------------------------------------------
cdef inline unsigned int _minimum_of_two_uints( unsigned int a, unsigned int b ):
  if a < b: return a
  return b

#-----------------------------------------------------------------------------------------------------------
cdef inline unsigned int _minimum_of_three_uints( unsigned int a, unsigned int b, unsigned int c ):
  if a < b:
    if c < a:
      return c
    return a
  if c < b:
    return c
  return b

#-----------------------------------------------------------------------------------------------------------
cdef inline int _warp( unsigned int limit, int value ):
  return value if value >= 0 else limit + value

############################################################################################################
# ARRAYS THAT SAY SIZE ;-)
#-----------------------------------------------------------------------------------------------------------
cdef class Array_of_unsigned_int:
  cdef unsigned int *data
  cdef unsigned int length

  #---------------------------------------------------------------------------------------------------------
  def __cinit__( self, unsigned int length, fill_value = None ):
    self.length = length
    self.data   = <unsigned int *>malloc( length * sizeof( unsigned int ) )  ###OBS### must check malloc doesn't return NULL pointer
    if fill_value is not None:
      self.fill( fill_value )

  #---------------------------------------------------------------------------------------------------------
  cdef fill( self, unsigned int value ):
    cdef unsigned int idx
    cdef unsigned int *d    = self.data
    for idx from 0 <= idx < self.length:
      d[ idx ] = value

  #---------------------------------------------------------------------------------------------------------
  cdef resize( self, unsigned int length ):
    self.data   = <unsigned int *>realloc( self.data, length * sizeof( unsigned int ) )  ###OBS### must check realloc doesn't return NULL pointer
    self.length = length

  #---------------------------------------------------------------------------------------------------------
  def free( self ):
    """Always remember the milk: Free up memory."""
    free( self.data )  ###OBS### should free memory here

  #---------------------------------------------------------------------------------------------------------
  def as_list( self ):
    """Return the array as a Python list."""
    R                       = []
    cdef unsigned int idx
    cdef unsigned int *d    = self.data
    for idx from 0 <= idx < self.length:
      R.append( d[ idx ] )
    return R


############################################################################################################
# CONVERTING UNICODE TO CHARACTER IDs (CIDs)
#---------------------------------------------------------------------------------------------------------
cdef unsigned int _UMX_surrogate_lower_bound    = 0x10000
cdef unsigned int _UMX_surrogate_upper_bound    = 0x10ffff
cdef unsigned int _UMX_surrogate_hi_lower_bound = 0xd800
cdef unsigned int _UMX_surrogate_hi_upper_bound = 0xdbff
cdef unsigned int _UMX_surrogate_lo_lower_bound = 0xdc00
cdef unsigned int _UMX_surrogate_lo_upper_bound = 0xdfff
cdef unsigned int _UMX_surrogate_foobar_factor  = 0x400

#---------------------------------------------------------------------------------------------------------
cdef Array_of_unsigned_int _cids_from_text( text ):
  """Givn a ``text`` either as a Unicode string or as a ``bytes`` or ``bytearray``, return an instance of
  ``Array_of_unsigned_int`` that enumerates either the Unicode codepoints of each character or the value of
  each byte. Surrogate pairs will be condensed into single values, so on narrow Python builds the length of
  the array returned may be less than ``len( text )``."""
  #.........................................................................................................
  # Make sure ``text`` is either a Unicode string (``str``) or a ``bytes``-like thing:
  is_bytes = isinstance( text, ( bytes, bytearray, ) )
  assert is_bytes or isinstance( text, str ), '#121'
  #.........................................................................................................
  # Whether it is a ``str`` or a ``bytes``, we know the result can only have at most as many elements as
  # there are characters in ``text``, so we can already reserve that much space (in the case of a Unicode
  # text, there may be fewer CIDs if there happen to be surrogate characters):
  cdef unsigned int           length  = <unsigned int>len( text )
  cdef Array_of_unsigned_int  R       = Array_of_unsigned_int( length )
  #.........................................................................................................
  # If ``text`` is empty, we can return an empty array right away:
  if length == 0: return R
  #.........................................................................................................
  # Otherwise, prepare to copy data:
  cdef unsigned int idx               = 0
  #.........................................................................................................
  # If ``text`` is a ``bytes``-like thing, use simplified processing; we just have to copy over all byte
  # values and are done:
  if is_bytes:
    for idx from 0 <= idx < length:
      R.data[ idx ] = <unsigned int>text[ idx ]
    return R
  #.........................................................................................................
  cdef unsigned int cid               = 0
  cdef bool         is_surrogate      = False
  cdef unsigned int hi                = 0
  cdef unsigned int lo                = 0
  cdef unsigned int chr_count         = 0
  #.........................................................................................................
  # Iterate over all indexes in text:
  for idx from 0 <= idx < length:
    #.......................................................................................................
    # If we met with a surrogate CID in the last cycle, then that was a high surrogate CID, and the
    # corresponding low CID is on the current position. Having both, we can compute the intended CID
    # and reset the flag:
    if is_surrogate:
      lo = <unsigned int>ord( text[ idx ] )
      # IIRC, this formula was documented in Unicode 3:
      cid = ( ( hi - _UMX_surrogate_hi_lower_bound ) * _UMX_surrogate_foobar_factor
            + ( lo - _UMX_surrogate_lo_lower_bound ) + _UMX_surrogate_lower_bound )
      is_surrogate = False
    #.......................................................................................................
    else:
      # Otherwise, we retrieve the CID from the current position:
      cid = <unsigned int>ord( text[ idx ] )
      #.....................................................................................................
      if _UMX_surrogate_hi_lower_bound <= cid <= _UMX_surrogate_hi_upper_bound:
        # If this CID is a high surrogate CID, set ``hi`` to this value and set a flag so we'll come back
        # in the next cycle:
        hi                = cid
        is_surrogate      = True
        continue
    #.......................................................................................................
    R.data[ chr_count ] = cid
    chr_count     += 1
  #.........................................................................................................
  # Surrogate CIDs take up two characters but end up as a single resultant CID, so the return value may
  # have fewer elements than the naive string length indicated; in this case, we want to free some memory
  # and correct array length data:
  if chr_count != length:
    R.resize( chr_count )
  #.........................................................................................................
  return R

#---------------------------------------------------------------------------------------------------------
def cids_from_text( text ):
  cdef Array_of_unsigned_int c_R  =_cids_from_text( text )
  R                               = c_R.as_list()
  c_R.free() ###OBS### should free memory here
  return R


############################################################################################################
# SECOND-ORDER SIMILARITY
#-----------------------------------------------------------------------------------------------------------
cpdef float similarity( char *a, char *b ):
  """Given two byte strings ``a`` and ``b``, return their Damerau-Levenshtein similarity as a float between
  0.0 and 1.1. Similarity is computed as ``1 - relative_editdistance( a, b )``, so a result of ``1.0``
  indicates identity, while ``0.0`` indicates complete dissimilarity."""
  return 1.0 - relative_editdistance( a, b )

#-----------------------------------------------------------------------------------------------------------
cpdef float relative_editdistance( char *a, char *b ):
  """Given two byte strings ``a`` and ``b``, return their relative Damerau-Levenshtein distance. The return
  value is a float between 0.0 and 1.0; it is calculated as the absolute edit distance, divided by the
  length of the longer string. Therefore, ``0.0`` indicates identity, while ``1.0`` indicates complete
  dissimilarity."""
  cdef int length = max( len( a ), len( b ) )
  if length == 0: return 0.0
  return editdistance( a, b ) / <float>length

############################################################################################################
# EDIT DISTANCE
#-----------------------------------------------------------------------------------------------------------
cpdef unsigned int editdistance( text_a, text_b ):
  """Given texts as Unicode strings or ``bytes`` / ``bytearray`` objects, return their absolute
  Damerau-Levenshtein distance. Each deletion, insertion, substitution, and transposition is counted as one
  difference, so the edit distance between ``abc`` and ``ab``, ``abcx``, ``abx``, ``acb``, respectively, is
  ``1``."""
  #.........................................................................................................
  # This should be fast in Python, as it can (and probably is) implemented by doing an identity check in
  # the case of ``bytes`` and ``str`` objects:
  if text_a == text_b: return 0
  #.........................................................................................................
  # Convert Unicode text to C array of unsigned integers:
  cdef Array_of_unsigned_int a  = _cids_from_text( text_a )
  cdef Array_of_unsigned_int b  = _cids_from_text( text_b )
  R                             = c_editdistance( a, b )
  #.........................................................................................................
  # Always remember the milk:
  a.free()
  b.free()
  #.........................................................................................................
  return R

#-----------------------------------------------------------------------------------------------------------
cdef unsigned int c_editdistance( Array_of_unsigned_int cids_a, Array_of_unsigned_int cids_b ):
  # Conceptually, this is based on a len(a) + 1 * len(b) + 1 matrix.
  # However, only the current and two previous rows are needed at once,
  # so we only store those.
  #.........................................................................................................
  # This shortcut is pretty useless if comparison is not very fast; therefore, it is done in the function
  # that deals with the Python objects, q.v.
  # if cids_a.equals( cids_b ): return 0
  #.........................................................................................................
  cdef unsigned int a_length            = cids_a.length
  cdef unsigned int b_length            = cids_b.length
  #.........................................................................................................
  # Another shortcut: if one of the texts is empty, then the edit distance is trivially the length of the
  # other text. This also works for two empty texts, but those have already been taken care of by the
  # previous shortcut:
  #.........................................................................................................
  if a_length == 0: return b_length
  if b_length == 0: return a_length
  #.........................................................................................................
  cdef unsigned int row_length          = b_length   + 1
  cdef unsigned int row_length_1        = row_length - 1
  cdef unsigned int row_bytecount       = sizeof( unsigned int ) * row_length
  cdef unsigned int *oneago             = <unsigned int *>malloc( row_bytecount ) ###OBS### must check malloc doesn't return NULL pointer
  cdef unsigned int *twoago             = <unsigned int *>malloc( row_bytecount ) ###OBS### must check malloc doesn't return NULL pointer
  cdef unsigned int *thisrow            = <unsigned int *>malloc( row_bytecount ) ###OBS### must check malloc doesn't return NULL pointer
  cdef unsigned int idx                 = 0
  cdef unsigned int idx_a               = 0
  cdef unsigned int idx_b               = 0
  cdef          int idx_a_1_text        = 0
  cdef          int idx_b_1_row         = 0
  cdef          int idx_b_2_row         = 0
  cdef          int idx_b_1_text        = 0
  cdef unsigned int deletion_cost       = 0
  cdef unsigned int addition_cost       = 0
  cdef unsigned int substitution_cost   = 0
  #.........................................................................................................
  # Equivalent of ``thisrow = list( range( 1, b_length + 1 ) ) + [ 0 ]``:
  #print( '#305', cids_a.as_list(), cids_b.as_list(), a_length, b_length, row_length, row_length_1 )
  for idx from 1 <= idx < row_length:
    thisrow[ idx - 1 ] = idx
  thisrow[ row_length - 1 ] = 0
  #.........................................................................................................
  for idx_a from 0 <= idx_a < a_length:
    idx_a_1_text      = _warp(   a_length, idx_a - 1 )
    twoago, oneago = oneago, thisrow
    #.......................................................................................................
    # Equivalent of ``thisrow = [ 0 ] * b_length + [ idx_a + 1 ]``:
    for idx from 0 <= idx < row_length_1:
      thisrow[ idx ] = 0
    thisrow[ row_length - 1 ] = idx_a + 1
    #.......................................................................................................
    # some diagnostic output:
    x = []
    for idx from 0 <= idx < row_length: x.append( thisrow[ idx ] )
    print
    print '#ED  A', x
    #.......................................................................................................
    for idx_b from 0 <= idx_b < b_length:
      #.....................................................................................................
      idx_b_1_row       = _warp( row_length, idx_b - 1 )
      idx_b_1_text      = _warp(   b_length, idx_b - 1 )
      #.....................................................................................................
      assert 0 <= idx_b_1_row  < row_length, ( '#323', idx_b_1_row, )
      assert 0 <= idx_a_1_text <   a_length, ( '#324', idx_a_1_text, )
      assert 0 <= idx_b_1_text <   b_length, ( '#325', idx_b_1_text, )
      #.....................................................................................................
      deletion_cost     = oneago[  idx_b       ] + 1
      addition_cost     = thisrow[ idx_b_1_row ] + 1
      substitution_cost = oneago[  idx_b_1_row ] + ( 1 if    cids_a.data[ idx_a ]
                                                          != cids_b.data[ idx_b ] else 0 )
      thisrow[ idx_b ]  = _minimum_of_three_uints( deletion_cost, addition_cost, substitution_cost )
      #.....................................................................................................
      # Transpositions:
      if (  idx_a > 0
        and idx_b > 0
        and cids_a.data[ idx_a        ] == cids_b.data[ idx_b_1_text ]
        and cids_a.data[ idx_a_1_text ] == cids_b.data[ idx_b        ]
        and cids_a.data[ idx_a        ] != cids_b.data[ idx_b        ] ):
        #...................................................................................................
        idx_b_2_row       = _warp( row_length, idx_b - 2 )
        assert 0 <= idx_b_2_row  < row_length, ( '#340', idx_b_2_row, )
        thisrow[ idx_b ]  = _minimum_of_two_uints( thisrow[ idx_b ], twoago[ idx_b_2_row ] + 1 )
      #.....................................................................................................
      # some diagnostic output:
      x = []
      for idx from 0 <= idx < row_length: x.append( thisrow[ idx ] )
      print '#ED  B', x
  #.........................................................................................................
  # Here, ``b_length - 1`` can't become negative, since we already tested for ``b_length == 0`` in the
  # shortcut above:
  cdef unsigned int R = thisrow[ b_length - 1 ]
  #.........................................................................................................
  # Always remember the milk:
  # BUG: Activating below lines leads to glibc failing with ``double free or corruption``
  #free( twoago )
  #free( oneago )
  #free( thisrow )e
  #.........................................................................................................
  return R

#-----------------------------------------------------------------------------------------------------------
def editdistance_reference( text_a, text_b ):
  """This method is believed to compute a correct Damerau-Levenshtein edit distance, with deletions,
  insertions, substitutions, and transpositions. Do not touch it; it is here to validate results returned
  from the above method. Code adapted from
  http://mwh.geek.nz/2009/04/26/python-damerau-levenshtein-distance"""
  # Conceptually, the implementation is based on a ``( len( seq1 ) + 1 ) * ( len( seq2 ) + 1 )`` matrix.
  # However, only the current and two previous rows are needed at once, so we only store those. Python
  # lists wrap around for negative indices, so we put the leftmost column at the *end* of the list. This
  # matches with the zero-indexed strings and saves extra calculation.
  b_length  = len( text_b )
  oneago    = None
  thisrow   = list( range( 1, b_length + 1 ) ) + [ 0 ]
  for idx_a in range( len( text_a ) ):
    twoago, oneago, thisrow = oneago, thisrow, [ 0 ] * b_length + [ idx_a + 1 ]
    #.......................................................................................................
    # some diagnostic output:
    print
    print '#EDR A', thisrow
    #.......................................................................................................
    for idx_b in range( b_length ):
      deletion_cost     = oneago[  idx_b     ] + 1
      addition_cost     = thisrow[ idx_b - 1 ] + 1
      substitution_cost = oneago[  idx_b - 1 ] + ( text_a[ idx_a ] != text_b[ idx_b ] )
      thisrow[ idx_b ]  = min( deletion_cost, addition_cost, substitution_cost )
      if (  idx_a > 0
        and idx_b > 0
        and text_a[ idx_a     ] == text_b[ idx_b - 1 ]
        and text_a[ idx_a - 1 ] == text_b[ idx_b     ]
        and text_a[ idx_a     ] != text_b[ idx_b     ] ):
        thisrow[ idx_b ] = min( thisrow[ idx_b ], twoago[ idx_b - 2 ] + 1 )
      #.....................................................................................................
      # some diagnostic output:
      print '#EDR B', thisrow
      #.....................................................................................................
  return thisrow[ len( text_b ) - 1 ]

编辑我也将此文本发布到pastebin 和 Cython 列表。

原文

I'm back with another longish question. Having experimented with a number of Python-based Damerau-Levenshtein
edit distance implementations, I finally found the one listed below as editdistance_reference(). It
seems to deliver correct results and appears to have an efficient implementation.

So I set down to convert the code to Cython. on my test data, the reference method manages to deliver results
for 11,000 comparisons (for pairs of words aound 12 letters long), while the Cythonized method does over
200,000 comparisons per second. Sadly, the results are incorrect: when you look at the variable thisrow
which I print out for debugging, my version has it filled with ones no matter what data I throw at it,
while the reference output shows another picture. For example, testing 'helo' against 'world'
produces the following output (ED marks my function, EDR is the correctly working reference):

From editdistance():

#ED  A [0, 0, 0, 0, 0, 1]
#ED  B [1, 0, 0, 0, 0, 1]
#ED  B [1, 1, 0, 0, 0, 1]
#ED  B [1, 1, 1, 0, 0, 1]
#ED  B [1, 1, 1, 1, 0, 1]
#ED  B [1, 1, 1, 1, 1, 1]

#ED  A [0, 0, 0, 0, 0, 2]
#ED  B [1, 0, 0, 0, 0, 2]
#ED  B [1, 1, 0, 0, 0, 2]
#ED  B [1, 1, 1, 0, 0, 2]
#ED  B [1, 1, 1, 1, 0, 2]
#ED  B [1, 1, 1, 1, 1, 2]

#ED  A [0, 0, 0, 0, 0, 3]
#ED  B [1, 0, 0, 0, 0, 3]
#ED  B [1, 1, 0, 0, 0, 3]
#ED  B [1, 1, 1, 0, 0, 3]
#ED  B [1, 1, 1, 1, 0, 3]
#ED  B [1, 1, 1, 1, 1, 3]

#ED  A [0, 0, 0, 0, 0, 4]
#ED  B [1, 0, 0, 0, 0, 4]
#ED  B [1, 1, 0, 0, 0, 4]
#ED  B [1, 1, 1, 0, 0, 4]
#ED  B [1, 1, 1, 1, 0, 4]
#ED  B [1, 1, 1, 1, 1, 4]

from editdistance_reference():

#EDR A [0, 0, 0, 0, 0, 1]
#EDR B [1, 0, 0, 0, 0, 1]
#EDR B [1, 2, 0, 0, 0, 1]
#EDR B [1, 2, 3, 0, 0, 1]
#EDR B [1, 2, 3, 4, 0, 1]
#EDR B [1, 2, 3, 4, 5, 1]

#EDR A [0, 0, 0, 0, 0, 2]
#EDR B [2, 0, 0, 0, 0, 2]
#EDR B [2, 2, 0, 0, 0, 2]
#EDR B [2, 2, 3, 0, 0, 2]
#EDR B [2, 2, 3, 4, 0, 2]
#EDR B [2, 2, 3, 4, 5, 2]

#EDR A [0, 0, 0, 0, 0, 3]
#EDR B [3, 0, 0, 0, 0, 3]
#EDR B [3, 3, 0, 0, 0, 3]
#EDR B [3, 3, 3, 0, 0, 3]
#EDR B [3, 3, 3, 3, 0, 3]
#EDR B [3, 3, 3, 3, 4, 3]

#EDR A [0, 0, 0, 0, 0, 4]
#EDR B [4, 0, 0, 0, 0, 4]
#EDR B [4, 4, 0, 0, 0, 4]
#EDR B [4, 4, 4, 0, 0, 4]
#EDR B [4, 4, 4, 4, 0, 4]
#EDR B [4, 4, 4, 4, 4, 4]

i must be very dumb as the error is probably one of those very very obvious things. but i can't seem to find it.

there is a second problem: i malloc space for the three arrays twoago, oneago, and thisrow,
then they get swapped around in a circular fashion. when i try to free( twoago ) and so on, i get a
line where glibc complains about double free or corruption. i googled for that; could it be that the
pointer-swapping business makes glibc a bit dizzy so it becomes unable to correctly free memory?

below i list first the setup.py that is needed to do the compilation
(/path/to/python3.1 ./setup.py build_ext --inplace), then the edit distance code proper, so interested
people find it easier to replicate.

one more thing: this is run with Python3.1; one funny thing is that inside the *.pyx file we do have
bare unicode strings, but print is still a statement, not a function.

and yes i know this is a lot of code to paste here but the thing is the code is simply not runnable when you
cut it down all too much. i believe all methods except editdistance() to work correctly, but feel free
to point out any problems you perceive.

setup.py:

from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext

setup(
  name            = 'cython_dameraulevenshtein',
  ext_modules     = [
    Extension( 'cython_dameraulevenshtein', [ 'cython_dameraulevenshtein.pyx', ] ), ],
  cmdclass        = {
    'build_ext': build_ext }, )

cython_dameraulevenshtein.pyx (scroll all the way to the end to see the interesting stuff):

############################################################################################################
cdef extern from "stdlib.h":
  ctypedef  unsigned int size_t
  void      *malloc(size_t size)
  void      *realloc( void *ptr, size_t size )
  void      free(void *ptr)

#-----------------------------------------------------------------------------------------------------------
cdef inline unsigned int _minimum_of_two_uints( unsigned int a, unsigned int b ):
  if a < b: return a
  return b

#-----------------------------------------------------------------------------------------------------------
cdef inline unsigned int _minimum_of_three_uints( unsigned int a, unsigned int b, unsigned int c ):
  if a < b:
    if c < a:
      return c
    return a
  if c < b:
    return c
  return b

#-----------------------------------------------------------------------------------------------------------
cdef inline int _warp( unsigned int limit, int value ):
  return value if value >= 0 else limit + value

############################################################################################################
# ARRAYS THAT SAY SIZE ;-)
#-----------------------------------------------------------------------------------------------------------
cdef class Array_of_unsigned_int:
  cdef unsigned int *data
  cdef unsigned int length

  #---------------------------------------------------------------------------------------------------------
  def __cinit__( self, unsigned int length, fill_value = None ):
    self.length = length
    self.data   = <unsigned int *>malloc( length * sizeof( unsigned int ) )  ###OBS### must check malloc doesn't return NULL pointer
    if fill_value is not None:
      self.fill( fill_value )

  #---------------------------------------------------------------------------------------------------------
  cdef fill( self, unsigned int value ):
    cdef unsigned int idx
    cdef unsigned int *d    = self.data
    for idx from 0 <= idx < self.length:
      d[ idx ] = value

  #---------------------------------------------------------------------------------------------------------
  cdef resize( self, unsigned int length ):
    self.data   = <unsigned int *>realloc( self.data, length * sizeof( unsigned int ) )  ###OBS### must check realloc doesn't return NULL pointer
    self.length = length

  #---------------------------------------------------------------------------------------------------------
  def free( self ):
    """Always remember the milk: Free up memory."""
    free( self.data )  ###OBS### should free memory here

  #---------------------------------------------------------------------------------------------------------
  def as_list( self ):
    """Return the array as a Python list."""
    R                       = []
    cdef unsigned int idx
    cdef unsigned int *d    = self.data
    for idx from 0 <= idx < self.length:
      R.append( d[ idx ] )
    return R


############################################################################################################
# CONVERTING UNICODE TO CHARACTER IDs (CIDs)
#---------------------------------------------------------------------------------------------------------
cdef unsigned int _UMX_surrogate_lower_bound    = 0x10000
cdef unsigned int _UMX_surrogate_upper_bound    = 0x10ffff
cdef unsigned int _UMX_surrogate_hi_lower_bound = 0xd800
cdef unsigned int _UMX_surrogate_hi_upper_bound = 0xdbff
cdef unsigned int _UMX_surrogate_lo_lower_bound = 0xdc00
cdef unsigned int _UMX_surrogate_lo_upper_bound = 0xdfff
cdef unsigned int _UMX_surrogate_foobar_factor  = 0x400

#---------------------------------------------------------------------------------------------------------
cdef Array_of_unsigned_int _cids_from_text( text ):
  """Givn a ``text`` either as a Unicode string or as a ``bytes`` or ``bytearray``, return an instance of
  ``Array_of_unsigned_int`` that enumerates either the Unicode codepoints of each character or the value of
  each byte. Surrogate pairs will be condensed into single values, so on narrow Python builds the length of
  the array returned may be less than ``len( text )``."""
  #.........................................................................................................
  # Make sure ``text`` is either a Unicode string (``str``) or a ``bytes``-like thing:
  is_bytes = isinstance( text, ( bytes, bytearray, ) )
  assert is_bytes or isinstance( text, str ), '#121'
  #.........................................................................................................
  # Whether it is a ``str`` or a ``bytes``, we know the result can only have at most as many elements as
  # there are characters in ``text``, so we can already reserve that much space (in the case of a Unicode
  # text, there may be fewer CIDs if there happen to be surrogate characters):
  cdef unsigned int           length  = <unsigned int>len( text )
  cdef Array_of_unsigned_int  R       = Array_of_unsigned_int( length )
  #.........................................................................................................
  # If ``text`` is empty, we can return an empty array right away:
  if length == 0: return R
  #.........................................................................................................
  # Otherwise, prepare to copy data:
  cdef unsigned int idx               = 0
  #.........................................................................................................
  # If ``text`` is a ``bytes``-like thing, use simplified processing; we just have to copy over all byte
  # values and are done:
  if is_bytes:
    for idx from 0 <= idx < length:
      R.data[ idx ] = <unsigned int>text[ idx ]
    return R
  #.........................................................................................................
  cdef unsigned int cid               = 0
  cdef bool         is_surrogate      = False
  cdef unsigned int hi                = 0
  cdef unsigned int lo                = 0
  cdef unsigned int chr_count         = 0
  #.........................................................................................................
  # Iterate over all indexes in text:
  for idx from 0 <= idx < length:
    #.......................................................................................................
    # If we met with a surrogate CID in the last cycle, then that was a high surrogate CID, and the
    # corresponding low CID is on the current position. Having both, we can compute the intended CID
    # and reset the flag:
    if is_surrogate:
      lo = <unsigned int>ord( text[ idx ] )
      # IIRC, this formula was documented in Unicode 3:
      cid = ( ( hi - _UMX_surrogate_hi_lower_bound ) * _UMX_surrogate_foobar_factor
            + ( lo - _UMX_surrogate_lo_lower_bound ) + _UMX_surrogate_lower_bound )
      is_surrogate = False
    #.......................................................................................................
    else:
      # Otherwise, we retrieve the CID from the current position:
      cid = <unsigned int>ord( text[ idx ] )
      #.....................................................................................................
      if _UMX_surrogate_hi_lower_bound <= cid <= _UMX_surrogate_hi_upper_bound:
        # If this CID is a high surrogate CID, set ``hi`` to this value and set a flag so we'll come back
        # in the next cycle:
        hi                = cid
        is_surrogate      = True
        continue
    #.......................................................................................................
    R.data[ chr_count ] = cid
    chr_count     += 1
  #.........................................................................................................
  # Surrogate CIDs take up two characters but end up as a single resultant CID, so the return value may
  # have fewer elements than the naive string length indicated; in this case, we want to free some memory
  # and correct array length data:
  if chr_count != length:
    R.resize( chr_count )
  #.........................................................................................................
  return R

#---------------------------------------------------------------------------------------------------------
def cids_from_text( text ):
  cdef Array_of_unsigned_int c_R  =_cids_from_text( text )
  R                               = c_R.as_list()
  c_R.free() ###OBS### should free memory here
  return R


############################################################################################################
# SECOND-ORDER SIMILARITY
#-----------------------------------------------------------------------------------------------------------
cpdef float similarity( char *a, char *b ):
  """Given two byte strings ``a`` and ``b``, return their Damerau-Levenshtein similarity as a float between
  0.0 and 1.1. Similarity is computed as ``1 - relative_editdistance( a, b )``, so a result of ``1.0``
  indicates identity, while ``0.0`` indicates complete dissimilarity."""
  return 1.0 - relative_editdistance( a, b )

#-----------------------------------------------------------------------------------------------------------
cpdef float relative_editdistance( char *a, char *b ):
  """Given two byte strings ``a`` and ``b``, return their relative Damerau-Levenshtein distance. The return
  value is a float between 0.0 and 1.0; it is calculated as the absolute edit distance, divided by the
  length of the longer string. Therefore, ``0.0`` indicates identity, while ``1.0`` indicates complete
  dissimilarity."""
  cdef int length = max( len( a ), len( b ) )
  if length == 0: return 0.0
  return editdistance( a, b ) / <float>length

############################################################################################################
# EDIT DISTANCE
#-----------------------------------------------------------------------------------------------------------
cpdef unsigned int editdistance( text_a, text_b ):
  """Given texts as Unicode strings or ``bytes`` / ``bytearray`` objects, return their absolute
  Damerau-Levenshtein distance. Each deletion, insertion, substitution, and transposition is counted as one
  difference, so the edit distance between ``abc`` and ``ab``, ``abcx``, ``abx``, ``acb``, respectively, is
  ``1``."""
  #.........................................................................................................
  # This should be fast in Python, as it can (and probably is) implemented by doing an identity check in
  # the case of ``bytes`` and ``str`` objects:
  if text_a == text_b: return 0
  #.........................................................................................................
  # Convert Unicode text to C array of unsigned integers:
  cdef Array_of_unsigned_int a  = _cids_from_text( text_a )
  cdef Array_of_unsigned_int b  = _cids_from_text( text_b )
  R                             = c_editdistance( a, b )
  #.........................................................................................................
  # Always remember the milk:
  a.free()
  b.free()
  #.........................................................................................................
  return R

#-----------------------------------------------------------------------------------------------------------
cdef unsigned int c_editdistance( Array_of_unsigned_int cids_a, Array_of_unsigned_int cids_b ):
  # Conceptually, this is based on a len(a) + 1 * len(b) + 1 matrix.
  # However, only the current and two previous rows are needed at once,
  # so we only store those.
  #.........................................................................................................
  # This shortcut is pretty useless if comparison is not very fast; therefore, it is done in the function
  # that deals with the Python objects, q.v.
  # if cids_a.equals( cids_b ): return 0
  #.........................................................................................................
  cdef unsigned int a_length            = cids_a.length
  cdef unsigned int b_length            = cids_b.length
  #.........................................................................................................
  # Another shortcut: if one of the texts is empty, then the edit distance is trivially the length of the
  # other text. This also works for two empty texts, but those have already been taken care of by the
  # previous shortcut:
  #.........................................................................................................
  if a_length == 0: return b_length
  if b_length == 0: return a_length
  #.........................................................................................................
  cdef unsigned int row_length          = b_length   + 1
  cdef unsigned int row_length_1        = row_length - 1
  cdef unsigned int row_bytecount       = sizeof( unsigned int ) * row_length
  cdef unsigned int *oneago             = <unsigned int *>malloc( row_bytecount ) ###OBS### must check malloc doesn't return NULL pointer
  cdef unsigned int *twoago             = <unsigned int *>malloc( row_bytecount ) ###OBS### must check malloc doesn't return NULL pointer
  cdef unsigned int *thisrow            = <unsigned int *>malloc( row_bytecount ) ###OBS### must check malloc doesn't return NULL pointer
  cdef unsigned int idx                 = 0
  cdef unsigned int idx_a               = 0
  cdef unsigned int idx_b               = 0
  cdef          int idx_a_1_text        = 0
  cdef          int idx_b_1_row         = 0
  cdef          int idx_b_2_row         = 0
  cdef          int idx_b_1_text        = 0
  cdef unsigned int deletion_cost       = 0
  cdef unsigned int addition_cost       = 0
  cdef unsigned int substitution_cost   = 0
  #.........................................................................................................
  # Equivalent of ``thisrow = list( range( 1, b_length + 1 ) ) + [ 0 ]``:
  #print( '#305', cids_a.as_list(), cids_b.as_list(), a_length, b_length, row_length, row_length_1 )
  for idx from 1 <= idx < row_length:
    thisrow[ idx - 1 ] = idx
  thisrow[ row_length - 1 ] = 0
  #.........................................................................................................
  for idx_a from 0 <= idx_a < a_length:
    idx_a_1_text      = _warp(   a_length, idx_a - 1 )
    twoago, oneago = oneago, thisrow
    #.......................................................................................................
    # Equivalent of ``thisrow = [ 0 ] * b_length + [ idx_a + 1 ]``:
    for idx from 0 <= idx < row_length_1:
      thisrow[ idx ] = 0
    thisrow[ row_length - 1 ] = idx_a + 1
    #.......................................................................................................
    # some diagnostic output:
    x = []
    for idx from 0 <= idx < row_length: x.append( thisrow[ idx ] )
    print
    print '#ED  A', x
    #.......................................................................................................
    for idx_b from 0 <= idx_b < b_length:
      #.....................................................................................................
      idx_b_1_row       = _warp( row_length, idx_b - 1 )
      idx_b_1_text      = _warp(   b_length, idx_b - 1 )
      #.....................................................................................................
      assert 0 <= idx_b_1_row  < row_length, ( '#323', idx_b_1_row, )
      assert 0 <= idx_a_1_text <   a_length, ( '#324', idx_a_1_text, )
      assert 0 <= idx_b_1_text <   b_length, ( '#325', idx_b_1_text, )
      #.....................................................................................................
      deletion_cost     = oneago[  idx_b       ] + 1
      addition_cost     = thisrow[ idx_b_1_row ] + 1
      substitution_cost = oneago[  idx_b_1_row ] + ( 1 if    cids_a.data[ idx_a ]
                                                          != cids_b.data[ idx_b ] else 0 )
      thisrow[ idx_b ]  = _minimum_of_three_uints( deletion_cost, addition_cost, substitution_cost )
      #.....................................................................................................
      # Transpositions:
      if (  idx_a > 0
        and idx_b > 0
        and cids_a.data[ idx_a        ] == cids_b.data[ idx_b_1_text ]
        and cids_a.data[ idx_a_1_text ] == cids_b.data[ idx_b        ]
        and cids_a.data[ idx_a        ] != cids_b.data[ idx_b        ] ):
        #...................................................................................................
        idx_b_2_row       = _warp( row_length, idx_b - 2 )
        assert 0 <= idx_b_2_row  < row_length, ( '#340', idx_b_2_row, )
        thisrow[ idx_b ]  = _minimum_of_two_uints( thisrow[ idx_b ], twoago[ idx_b_2_row ] + 1 )
      #.....................................................................................................
      # some diagnostic output:
      x = []
      for idx from 0 <= idx < row_length: x.append( thisrow[ idx ] )
      print '#ED  B', x
  #.........................................................................................................
  # Here, ``b_length - 1`` can't become negative, since we already tested for ``b_length == 0`` in the
  # shortcut above:
  cdef unsigned int R = thisrow[ b_length - 1 ]
  #.........................................................................................................
  # Always remember the milk:
  # BUG: Activating below lines leads to glibc failing with ``double free or corruption``
  #free( twoago )
  #free( oneago )
  #free( thisrow )e
  #.........................................................................................................
  return R

#-----------------------------------------------------------------------------------------------------------
def editdistance_reference( text_a, text_b ):
  """This method is believed to compute a correct Damerau-Levenshtein edit distance, with deletions,
  insertions, substitutions, and transpositions. Do not touch it; it is here to validate results returned
  from the above method. Code adapted from
  http://mwh.geek.nz/2009/04/26/python-damerau-levenshtein-distance"""
  # Conceptually, the implementation is based on a ``( len( seq1 ) + 1 ) * ( len( seq2 ) + 1 )`` matrix.
  # However, only the current and two previous rows are needed at once, so we only store those. Python
  # lists wrap around for negative indices, so we put the leftmost column at the *end* of the list. This
  # matches with the zero-indexed strings and saves extra calculation.
  b_length  = len( text_b )
  oneago    = None
  thisrow   = list( range( 1, b_length + 1 ) ) + [ 0 ]
  for idx_a in range( len( text_a ) ):
    twoago, oneago, thisrow = oneago, thisrow, [ 0 ] * b_length + [ idx_a + 1 ]
    #.......................................................................................................
    # some diagnostic output:
    print
    print '#EDR A', thisrow
    #.......................................................................................................
    for idx_b in range( b_length ):
      deletion_cost     = oneago[  idx_b     ] + 1
      addition_cost     = thisrow[ idx_b - 1 ] + 1
      substitution_cost = oneago[  idx_b - 1 ] + ( text_a[ idx_a ] != text_b[ idx_b ] )
      thisrow[ idx_b ]  = min( deletion_cost, addition_cost, substitution_cost )
      if (  idx_a > 0
        and idx_b > 0
        and text_a[ idx_a     ] == text_b[ idx_b - 1 ]
        and text_a[ idx_a - 1 ] == text_b[ idx_b     ]
        and text_a[ idx_a     ] != text_b[ idx_b     ] ):
        thisrow[ idx_b ] = min( thisrow[ idx_b ], twoago[ idx_b - 2 ] + 1 )
      #.....................................................................................................
      # some diagnostic output:
      print '#EDR B', thisrow
      #.....................................................................................................
  return thisrow[ len( text_b ) - 1 ]

edit i also posted this text to pastebin and the Cython list.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

此生挚爱伱 2024-09-20 11:37:41

进行一些基本的调试。您知道标记为 #ED B 的第二个输出行出了问题。错误的值似乎表明它很早就找到了一个编辑，并且再也没有找到过。这可能是因为 min() 参数之一以某种方式被限制在 1。打印 deletion_cost、substitution_cost、addition_cost ...哪个是错误的？为什么错了？打印输入文本值。暂时禁用转置部分，看看是否可以解决问题。检查并重新检查_warp caper（如果我见过的话，这是一个狡猾的霍比特人的花招）及其用法。如果将“aaaaa”与“aaaaa”进行比较会发生什么？ “qwerty”与“qwerty”？ “xxxxx”与“yyyyy”？所有 bytes、bytearray 和 str 输入是否都会出现此问题？

免费的问题是：我怀疑腐败，而不是头晕。打印三个数组；它们的内容是否符合预期？尝试一次启用一个数组的 free() —— 全部损坏了吗？只有一个？哪一个？

关于内存管理的一些旁白：您可能想阅读此并考虑使用特定于 Python 的例程而不是 malloc/free。如果有代理，则缩小阵列规模似乎有些过头了。

更新：遵循我自己的建议。删除成本被塞满。 “oneago”与“thisrow”相同。导致错误答案和双倍（-！未损坏！-）释放的问题：指针的循环洗牌不是循环的。

# twoago, oneago = oneago, thisrow ### BUG ###
twoago, oneago, thisrow = oneago, thisrow, twoago ### FIXED ###

更新2： [评论容量太小]没有魔力，只是普通的调试工作，正如我所建议的。 “专注于我的修复”并不是“超级可读”。参考代码确实为每次传递创建了一个新列表，它可以这样做，因为 thisrow 没有引用上一次传递的任何内容。它不需要这样做，事实上，除了第一个和最后一个元素之外的初始化可以由随机数组成，并且仅用于填写列表，以便可以将其索引到而不是附加到一些非- 棘手的实现确实如此。因此，您可以盲目地模拟“参考实现”，但代价是执行额外的（浪费的）malloc/free，或者您可以忽略特定于 Python 的实现细节，而仅使用参考实现作为可能正确答案的来源。然后您可以接受我的修复，然后通过删除 thisrow 数组的大部分初始化来继续节省时间。

更新 3： 这是为您提供的替代参考实现。它最初分配 3 行，以避免外循环内创建列表的开销。它还避免了对 thisrow 的最后一个元素以外的所有元素进行不必要的初始化。这简化了到 C/Cython 的转换。

def damlevref2(seq1, seq2):
    # For Python 2.x as was the original.
    # Appears to work on Python 1.5.2 as well :-)
    seq2len = len(seq2)
    twoago = [-777] * (seq2len + 1) # pseudo-malloc; any old rubbish will do
    oneago = [-666] * (seq2len + 1) # ditto
    thisrow = range(1, seq2len + 1) + [0]
    for x in xrange(len(seq1)):
        twoago, oneago, thisrow = oneago, thisrow, twoago # circular "pointer" shuffle
        thisrow[-1] = x + 1
        for y in xrange(seq2len):
            delcost = oneago[y] + 1
            addcost = thisrow[y - 1] + 1
            subcost = oneago[y - 1] + (seq1[x] != seq2[y])
            thisrow[y] = min(delcost, addcost, subcost)
            if (x > 0 and y > 0 and seq1[x] == seq2[y - 1]
                and seq1[x-1] == seq2[y] and seq1[x] != seq2[y]):
                thisrow[y] = min(thisrow[y], twoago[y - 2] + 1)
    return thisrow[seq2len - 1]

Do some elementary debugging. You know that it is going wrong in the 2nd output line marked #ED B. The wrong values seem to indicate that it finds one edit early on and never finds any more. This is possibly because one of the min() args is somehow clamped at 1. Print deletion_cost, substitution_cost, addition_cost ... which is wrong? Why is it wrong? Print the input text values. Temporarily disable the transposition section to see if that makes the problem go away. Check and re-check the _warp caper (a tricksy hobbit gimmick if I ever saw one) and the usage thereof. What happens if you compare "aaaaa" with "aaaaa"? "qwerty" with "qwerty"? "xxxxx" with "yyyyy"? Does the problem happen with all of bytes, bytearray and str input?

The free problem: I'd suspect corruption, not dizzyness. Print the three arrays; are their contents as expected? Try enabling the free() one array at a time -- all broken? only one? which one?

Some asides on memory management: You may like to read this and consider using the Python-specific routines instead of malloc/free. Downsizing your array if there have been surrogates seems over the top.

Update: Followed my own suggestions. Deletion cost was stuffed. "oneago" was same as "thisrow". Problem causing both the wrong answer and the doubled (-! not corrupted !-) free: circular shuffle of pointers wasn't circular.

# twoago, oneago = oneago, thisrow ### BUG ###
twoago, oneago, thisrow = oneago, thisrow, twoago ### FIXED ###

Update 2: [comment capacity too small] No mojo, just plain ordinary debugging spadework, as I suggested. "concentrating on this for my fix" is not "super-readible". The reference code does create a new list for each pass, which it CAN do because thisrow refers to nothing carried over from the previous pass. It doesn't NEED to do this, and in fact the initialisation apart from the first and last elements could consist of random numbers, and are only there to fill out the list so that it can be indexed into instead of appended to as some non-tricksy implementations do. So you can slavishly emulate the "reference implementation", at the cost of doing an extra (wasted) malloc/free, or you could ignore the Python-specific implementation details and use the reference implementation solely as a source of presumably correct answers. Then you could accept my fix, and later go on to saving time by chopping out most of the initialisation of the thisrow array.

Update 3: Here's a replacement reference implementation for you. It allocates 3 rows initially, in order to avoid the overhead of list creation inside the outer loop. It also avoids the unnecessary initialisation of all but the last element of thisrow. This eases the translation into C/Cython.

def damlevref2(seq1, seq2):
    # For Python 2.x as was the original.
    # Appears to work on Python 1.5.2 as well :-)
    seq2len = len(seq2)
    twoago = [-777] * (seq2len + 1) # pseudo-malloc; any old rubbish will do
    oneago = [-666] * (seq2len + 1) # ditto
    thisrow = range(1, seq2len + 1) + [0]
    for x in xrange(len(seq1)):
        twoago, oneago, thisrow = oneago, thisrow, twoago # circular "pointer" shuffle
        thisrow[-1] = x + 1
        for y in xrange(seq2len):
            delcost = oneago[y] + 1
            addcost = thisrow[y - 1] + 1
            subcost = oneago[y - 1] + (seq1[x] != seq2[y])
            thisrow[y] = min(delcost, addcost, subcost)
            if (x > 0 and y > 0 and seq1[x] == seq2[y - 1]
                and seq1[x-1] == seq2[y] and seq1[x] != seq2[y]):
                thisrow[y] = min(thisrow[y], twoago[y - 2] + 1)
    return thisrow[seq2len - 1]

回复收藏 0 原文

~没有更多了~