Data.ByteString 中的findSubstrings 和breakSubstring

发布于 2024-11-15 05:12:21 字数 1772 浏览 6 评论 0原文

在 Data/ByteString.hs 的源代码中，它表示函数 findSubstrings 已被弃用，取而代之的是 breakSubstring。不过，我认为使用 KMP 算法实现的 findSubstrings 比 BreakSubstring 中使用的算法要高效得多，后者是一种简单的算法。有人知道为什么这样做吗？

这是旧的实现：

{-# DEPRECATED findSubstrings "findSubstrings is deprecated in favour of breakSubstring." #-}

{-
{- This function uses the Knuth-Morris-Pratt string matching algorithm.  -}

findSubstrings pat@(PS _ _ m) str@(PS _ _ n) = search 0 0
where
  patc x = pat `unsafeIndex` x
  strc x = str `unsafeIndex` x

  -- maybe we should make kmpNext a UArray before using it in search?
  kmpNext = listArray (0,m) (-1:kmpNextL pat (-1))
  kmpNextL p _ | null p = []
  kmpNextL p j = let j' = next (unsafeHead p) j + 1
                     ps = unsafeTail p
                     x = if not (null ps) && unsafeHead ps == patc j'
                            then kmpNext Array.! j' else j'
                    in x:kmpNextL ps j'
  search i j = match ++ rest -- i: position in string, j: position in pattern
    where match = if j == m then [(i - j)] else []
          rest = if i == n then [] else search (i+1) (next (strc i) j + 1)
  next c j | j >= 0 && (j == m || c /= patc j) = next c (kmpNext Array.! j)
           | otherwise = j
-}

这是新的简单实现：

findSubstrings :: ByteString -- ^ String to search for.
           -> ByteString -- ^ String to seach in.
           -> [Int]
findSubstrings pat str
    | null pat         = [0 .. length str]
    | otherwise        = search 0 str
where
    STRICT2(search)
    search n s
        | null s             = []
        | pat `isPrefixOf` s = n : search (n+1) (unsafeTail s)
        | otherwise          =     search (n+1) (unsafeTail s)

原文

In the source of Data/ByteString.hs it says that the function findSubstrings has been deprecated in favor of breakSubstring. However I think the findSubstrings which was implemented using the KMP algorithm is much more efficient than the algorithm used in breakSubstring which is a naive one. Anybody has any idea why this has been done ?

Here's the old implementation:

{-# DEPRECATED findSubstrings "findSubstrings is deprecated in favour of breakSubstring." #-}

{-
{- This function uses the Knuth-Morris-Pratt string matching algorithm.  -}

findSubstrings pat@(PS _ _ m) str@(PS _ _ n) = search 0 0
where
  patc x = pat `unsafeIndex` x
  strc x = str `unsafeIndex` x

  -- maybe we should make kmpNext a UArray before using it in search?
  kmpNext = listArray (0,m) (-1:kmpNextL pat (-1))
  kmpNextL p _ | null p = []
  kmpNextL p j = let j' = next (unsafeHead p) j + 1
                     ps = unsafeTail p
                     x = if not (null ps) && unsafeHead ps == patc j'
                            then kmpNext Array.! j' else j'
                    in x:kmpNextL ps j'
  search i j = match ++ rest -- i: position in string, j: position in pattern
    where match = if j == m then [(i - j)] else []
          rest = if i == n then [] else search (i+1) (next (strc i) j + 1)
  next c j | j >= 0 && (j == m || c /= patc j) = next c (kmpNext Array.! j)
           | otherwise = j
-}

And here's the new naive one:

findSubstrings :: ByteString -- ^ String to search for.
           -> ByteString -- ^ String to seach in.
           -> [Int]
findSubstrings pat str
    | null pat         = [0 .. length str]
    | otherwise        = search 0 str
where
    STRICT2(search)
    search n s
        | null s             = []
        | pat `isPrefixOf` s = n : search (n+1) (unsafeTail s)
        | otherwise          =     search (n+1) (unsafeTail s)

分享到QQ

分享到微博