修复一个特别模糊的 Haskell 空间泄漏

发布于 2024-12-11 01:51:37 字数 1513 浏览 3 评论 0 原文

因此，在从我用来将推文数据分解为 n 元语法的 Haskell 中发挥出最后一点性能后，我遇到了空间泄漏问题。当我进行分析时，GC 使用了大约 60-70% 的进程，并且有大量内存部分专用于拖动。希望当我出错时，一些 Haskell 大师能够提出建议。

{-# LANGUAGE OverloadedStrings, BangPatterns #-}
import Data.Maybe
import qualified Data.ByteString.Char8 as B
import qualified Data.HashMap.Strict as H
import Text.Regex.Posix
import Data.List
import qualified Data.Char as C

isClassChar a = C.isAlphaNum a || a == ' ' || a == '\'' ||
                a == '-' || a == '#' || a == '@' || a == '%'

cullWord :: B.ByteString -> B.ByteString
cullWord w = B.map C.toLower $ B.filter isClassChar w

procTextN :: Int -> B.ByteString -> [([B.ByteString],Int)]
procTextN n t = H.toList $ foldl' ngram H.empty lines
  where !lines        = B.lines $ cullWord t
        ngram tr line = snd $ foldl' breakdown (base,tr) (B.split ' ' line)
        base          = replicate (n-1) ""

breakdown :: ([B.ByteString], H.HashMap [B.ByteString] Int) -> B.ByteString -> ([B.ByteString],H.HashMap [B.ByteString] Int)
breakdown (st@(s:ss),tree) word =
  newStack `seq` expandedWord `seq` (newStack,expandedWord)
      where newStack     = ss ++ [word]
            expandedWord = updateWord (st ++ [word]) tree

updateWord :: [B.ByteString] -> H.HashMap [B.ByteString] Int -> H.HashMap [B.ByteString] Int
updateWord w h = H.insertWith (+) w 1 h

main = do
  test2 <- B.readFile "canewobble"
  print $ filter (\(a,b) -> b > 100) $ 
     sortBy (\(a,b) (c,d) -> compare d b) $ procTextN 3 test2

原文

So after having squeezed the last bit of performance out of some Haskell I am using to break tweet data into n-grams, I'm running up against a space leak problem. When I profile, the GC uses about 60-70% of the process and there are significant memory portions dedicated to drag. Hopefully, some Haskell guru will be able to suggest when I'm going wrong.

{-# LANGUAGE OverloadedStrings, BangPatterns #-}
import Data.Maybe
import qualified Data.ByteString.Char8 as B
import qualified Data.HashMap.Strict as H
import Text.Regex.Posix
import Data.List
import qualified Data.Char as C

isClassChar a = C.isAlphaNum a || a == ' ' || a == '\'' ||
                a == '-' || a == '#' || a == '@' || a == '%'

cullWord :: B.ByteString -> B.ByteString
cullWord w = B.map C.toLower $ B.filter isClassChar w

procTextN :: Int -> B.ByteString -> [([B.ByteString],Int)]
procTextN n t = H.toList $ foldl' ngram H.empty lines
  where !lines        = B.lines $ cullWord t
        ngram tr line = snd $ foldl' breakdown (base,tr) (B.split ' ' line)
        base          = replicate (n-1) ""

breakdown :: ([B.ByteString], H.HashMap [B.ByteString] Int) -> B.ByteString -> ([B.ByteString],H.HashMap [B.ByteString] Int)
breakdown (st@(s:ss),tree) word =
  newStack `seq` expandedWord `seq` (newStack,expandedWord)
      where newStack     = ss ++ [word]
            expandedWord = updateWord (st ++ [word]) tree

updateWord :: [B.ByteString] -> H.HashMap [B.ByteString] Int -> H.HashMap [B.ByteString] Int
updateWord w h = H.insertWith (+) w 1 h

main = do
  test2 <- B.readFile "canewobble"
  print $ filter (\(a,b) -> b > 100) $ 
     sortBy (\(a,b) (c,d) -> compare d b) $ procTextN 3 test2

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

东风软 2024-12-18 01:51:37

一项小的优化是在排序之前过滤数据（使用 HashMap.filter）。这帮助我将最终执行时间缩短了 2 秒。我所做的另一件事是使用序列（Data.Sequence）而不是列表（没有明显区别:-( ）。我的版本可以在这里。

查看堆配置文件，我认为您的程序中没有空间泄漏：
Space profile

你只是在内存中构建了一个相当大的哈希表（377141 个键值对），然后将其丢弃经过一番处理后。根据 Johan 的帖子，这个大小的哈希表大约需要。 5*N + 4*(N-1) 个字 = 3394265*4 字节 ~= 13 MiB，这与堆配置文件显示的内容一致。剩余空间由键和值占用。在我的机器上，GC 花费的时间约为 40%，这听起来并不是不合理，因为您不断更新哈希表和临时“堆栈”，同时不对数据进行任何计算量大的操作。由于您需要哈希表的唯一操作是 insertWith，因此也许最好使用可变数据结构？

更新：我已经使用可变哈希表重写了您的程序。有趣的是，速度差别不大，但内存占用稍好一些：

enter image description here