如何将 python/cython unicode 字符串转换为长整数数组，以进行 levenshtein 编辑距离

发布于 2024-09-12 18:42:35 字数 4844 浏览 13 评论 0原文

可能的重复：
如何纠正此 Damerau-Levenshtein 实现中的错误？

我有以下 Cython 代码（改编自 bpbio 项目），该项目Damerau- Levenenshtein 编辑距离计算：

#---------------------------------------------------------------------------
cdef extern from "stdlib.h":
  ctypedef unsigned int size_t
  size_t strlen(char *s)
  void *malloc(size_t size)
  void *calloc(size_t n, size_t size)
  void free(void *ptr)
  int strcmp(char *a, char *b)
  char * strcpy(char *a, char *b)

#---------------------------------------------------------------------------
cdef extern from "Python.h":
  object PyTuple_GET_ITEM(object, int)
  void Py_INCREF(object)

#---------------------------------------------------------------------------
cdef inline size_t imin(int a, int b, int c):
  if a < b:
    if c < a:
      return c
    return a
  if c < b:
    return c
  return b

#---------------------------------------------------------------------------
cpdef int editdistance( char *a, char *b ):
  """Given two byte strings ``a`` and ``b``, return their absolute Damerau-
  Levenshtein distance. Each deletion, insertion, substitution, and
  transposition is counted as one difference, so the edit distance between
  ``abc`` and ``ab``, ``abcx``, ``abx``, ``acb``, respectively, is ``1``."""

  #.........................................................................
  if strcmp( a, b ) == 0: return 0
  #.........................................................................
  cdef int    alen    = strlen( a )
  cdef int    blen    = strlen( b )
  cdef int    R
  cdef char   *ctmp
  cdef size_t i
  cdef size_t j
  cdef size_t achr
  cdef size_t bchr
  #.........................................................................
  if alen > blen:
    ctmp = a;
    a = b;
    b = ctmp;
    alen, blen = blen, alen
  #.........................................................................
  cdef char   *m1     = <char *>calloc(   blen + 2,    sizeof( char ) )
  cdef char   *m2     = <char *>calloc(   blen + 2,    sizeof( char ) )
  cdef char   *m3     = <char *>malloc( ( blen + 2 ) * sizeof( char ) )
  #.........................................................................
  for i from 0 <= i <= blen:
    m2[ i ] = i
  #.........................................................................
  for i from 1 <= i <= alen:
    m1[ 0 ] =    i + 1
    achr    = a[ i - 1 ]
    for j from 1 <= j <= blen:
      bchr = b[ j- 1 ]
      if achr == bchr:
        m1[ j ] = m2[ j - 1 ]
      else:
        m1[ j ] = 1 + imin( m1[ j - 1 ], m2[ j - 1 ], m2[ j ] )
      if i != 1 and j != 1 and achr == b[ j - 2 ] and bchr == a[ i - 2 ]:
        m1[ j ] = m3[ j - 1 ]
    #.......................................................................
    m1, m2 = m2, m1
    strcpy( m3, m2 )
  #.........................................................................
  R = <int>m2[ blen ]
  #.........................................................................
  # cleanup:
  free( m3 )
  free( m1 )
  free( m2 )
  #.........................................................................
  return R

代码运行良好且快速（在我的 PC 上每秒进行 300,000...400,000 次比较）。

挑战在于使该代码也能与 unicode 字符串一起使用。我正在运行 Python 3.1 并从数据库中检索文本，然后将其与查询文本相匹配。

在将这些字符串传递给 Cython 函数进行比较之前将它们编码为 bytes 并不是一个好主意，因为性能会受到相当大的影响（经过测试），并且对于包含 7 位之外的字符的任何文本，结果可能是错误的美国 ASCII。

（非常简洁）Cython 手册确实提到了 unicode 字符串，但对当前的问题几乎没有帮助。

在我看来，一个 unicode 字符串可以被认为是一个整数数组，每个整数代表一个代码点，上面的代码基本上已经在 char 数组上运行了，所以我的猜测是我应该(1)扩展它来处理C整数数组； (2) 添加将 python unicode 字符串转换为 C 数组的代码； (3) 利润！。

( 注意： 这种方法有两个潜在问题：一个是处理 unicode 代理字符，但我想我知道该怎么做另一个问题是 unicode 代码点并没有真正 1:1 映射到“字符”的概念，但我认为它超出了这个问题的范围。是一个比较单位。）

所以我请求建议如何

编写一个快速的 Cython 函数，该函数接受 python unicode 字符串并返回 Cython unsigned int 的 C 数组（4 字节）；
修改显示的代码以处理这些数组并执行正确的内存分配/释放（这对我来说是相当陌生的东西）。

编辑：John Machin指出了奇怪的类型转换char *m1 等可能是为了速度和/或内存优化而完成的；这些变量仍然被视为数字数组。我意识到该代码没有采取任何措施来防止长字符串可能发生的溢出；当一个数组元素超过 127 或 255（取决于所使用的 C 编译器）时，可能会出现错误结果。对于来自生物信息学项目的代码有点令人惊讶。

也就是说，我只对少于一百个字符左右的基本相同的字符串的精确结果感兴趣。出于我的目的，低于 60% 相同性的结果可以安全地报告为“完全不同”（通过返回较长文本的长度），所以我想最好将 char *m1 保留在其中地方，但添加一些代码来检查溢出和早期中止，以防出现严重的差异。

原文

Possible Duplicate:
How to correct bugs in this Damerau-Levenshtein implementation?

I have the following Cython code (adapted from the bpbio project) that does Damerau-Levenenshtein edit-distance calculation:

#---------------------------------------------------------------------------
cdef extern from "stdlib.h":
  ctypedef unsigned int size_t
  size_t strlen(char *s)
  void *malloc(size_t size)
  void *calloc(size_t n, size_t size)
  void free(void *ptr)
  int strcmp(char *a, char *b)
  char * strcpy(char *a, char *b)

#---------------------------------------------------------------------------
cdef extern from "Python.h":
  object PyTuple_GET_ITEM(object, int)
  void Py_INCREF(object)

#---------------------------------------------------------------------------
cdef inline size_t imin(int a, int b, int c):
  if a < b:
    if c < a:
      return c
    return a
  if c < b:
    return c
  return b

#---------------------------------------------------------------------------
cpdef int editdistance( char *a, char *b ):
  """Given two byte strings ``a`` and ``b``, return their absolute Damerau-
  Levenshtein distance. Each deletion, insertion, substitution, and
  transposition is counted as one difference, so the edit distance between
  ``abc`` and ``ab``, ``abcx``, ``abx``, ``acb``, respectively, is ``1``."""

  #.........................................................................
  if strcmp( a, b ) == 0: return 0
  #.........................................................................
  cdef int    alen    = strlen( a )
  cdef int    blen    = strlen( b )
  cdef int    R
  cdef char   *ctmp
  cdef size_t i
  cdef size_t j
  cdef size_t achr
  cdef size_t bchr
  #.........................................................................
  if alen > blen:
    ctmp = a;
    a = b;
    b = ctmp;
    alen, blen = blen, alen
  #.........................................................................
  cdef char   *m1     = <char *>calloc(   blen + 2,    sizeof( char ) )
  cdef char   *m2     = <char *>calloc(   blen + 2,    sizeof( char ) )
  cdef char   *m3     = <char *>malloc( ( blen + 2 ) * sizeof( char ) )
  #.........................................................................
  for i from 0 <= i <= blen:
    m2[ i ] = i
  #.........................................................................
  for i from 1 <= i <= alen:
    m1[ 0 ] =    i + 1
    achr    = a[ i - 1 ]
    for j from 1 <= j <= blen:
      bchr = b[ j- 1 ]
      if achr == bchr:
        m1[ j ] = m2[ j - 1 ]
      else:
        m1[ j ] = 1 + imin( m1[ j - 1 ], m2[ j - 1 ], m2[ j ] )
      if i != 1 and j != 1 and achr == b[ j - 2 ] and bchr == a[ i - 2 ]:
        m1[ j ] = m3[ j - 1 ]
    #.......................................................................
    m1, m2 = m2, m1
    strcpy( m3, m2 )
  #.........................................................................
  R = <int>m2[ blen ]
  #.........................................................................
  # cleanup:
  free( m3 )
  free( m1 )
  free( m2 )
  #.........................................................................
  return R

The code runs fine and fast (300,000...400,000 comparisons per second on my PC).

the challenge is to make this code work with unicode strings as well. i am running Python 3.1 and retrieve texts from a database that are then matched to a query text.

encoding these strings to bytes before passing them to the Cython function for comparison is not be a good idea, since performance would suffer considerably (tested) and results would likely be wrong for any text containing characters outside of 7bit US ASCII.

the (very terse) Cython manual does mention unicode strings, but is hardly helpful for the problem at hand.

as i see it, a unicode string can be conceived of as an array of integer number, each representing a single codepoint, and the code above is basically operating on arrays of chars already, so my guess is that i should (1) extend it to handle C arrays of integers; (2) add code to convert a python unicode string to a C array; (3) profit!.

( Note: there are two potential issues with this approach: one is handling unicode surrogate characters, but i guess i know what to do with those. the other problem is that unicode codepoints do not really map 1:1 to the concept of 'characters'. i am well aware of that but i consider it outside of the scope of this question. please assume that one unicode codepoint is one unit of comparison.)

so i am asking for suggestions how to

write a fast Cython function that accepts a python unicode string and returns a C array of Cython unsigned ints (4 bytes);
modify the code shown to handle those arrays and do the correct memory allocations / deallocations (this is pretty foreign stuff to me).

Edit: John Machin has pointed out that the curious typecasts char *m1 etc are probably done for speed and/or memory optimization; these variables are still treated as arrays of numbers. i realize that the code does nothing to prevent a possible overflow with long strings; erroneous results may occur when one array element exceeds 127 or 255 (depending on the C compiler used). sort of surprising for code coming from a bioinformatics project.

that said, i am only interested in precise results for largely identical strings of less than say a hundred characters or so. results below 60% sameness could for my purposes be safely reported as 'completely different' (by returning the length of the longer text), so i guess it will be best to leave the char *m1 casts in place, but add some code to check against overflow and early abortion in case of rampant dissimilarity.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

ゃ人海孤独症 2024-09-19 18:42:36

使用 ord() 将字符转换为其整数代码点。它适用于 unicode 或 str 字符串类型的字符：

codepoints = [ord(c) for c in text]

Use ord() to convert characters to their integer code point. It works characters from either unicode or str string types:

codepoints = [ord(c) for c in text]

回复收藏 0 原文

朱染 2024-09-19 18:42:36

警告讲师：我从来没有这样做过。以下是我尝试的粗略草图。

您将需要使用 PyUnicode_AsUnicode 函数和下一个函数 PyUnicode_GetSize 。在当前有 char 的声明中，使用 Py_UNICODE 相反。想必在狭窄的 (UCS2) 构建中，您将复制内部结构，并随时转换代理对。通过广泛的 (UCS4) 构建，您可以直接在内部结构上进行操作。