为什么我的代码使用 List 包中的一元列表如此慢?

发布于 2024-09-27 07:10:44 字数 1410 浏览 4 评论 0原文

上周,用户 Masse 提出了一个关于在 Haskell 目录中递归列出文件的问题。我的第一个想法是尝试使用 List 中的单子列表以避免在打印开始之前在内存中构建整个列表。我的实现如下:

module Main where

import Prelude hiding (filter) 
import Control.Applicative ((<$>))
import Control.Monad (join)
import Control.Monad.IO.Class (liftIO)
import Control.Monad.ListT (ListT)
import Data.List.Class (cons, execute, filter, fromList, mapL)
import System (getArgs)
import System.Directory (getDirectoryContents, doesDirectoryExist)
import System.FilePath ((</>))

main = execute . mapL putStrLn . listFiles =<< head <$> getArgs

listFiles :: FilePath -> ListT IO FilePath
listFiles path = liftIO (doesDirectoryExist path) >>= listIfDir
  where
    valid "."  = False
    valid ".." = False
    valid _ = True
    listIfDir False = return path
    listIfDir True
      =  cons path
      $  join
      $  listFiles
     <$> (path </>)
     <$> (filter valid =<< fromList <$> liftIO (getDirectoryContents path))

它工作得很好,因为它立即开始打印并且使用很少的内存。不幸的是,它也比类似的 FilePath -> 慢几十倍。 IO [FilePath] 版本。

我做错了什么?我从未在像这样的玩具示例之外使用过 List 包的 ListT,所以我不知道期望什么样的性能,但是 30 秒(相对于几分之一秒)来处理大约 40,000 个文件的目录似乎太慢了。

Last week user Masse asked a question about recursively listing files in a directory in Haskell. My first thought was to try using monadic lists from the List package to avoid building the entire list in memory before the printing can start. I implemented this as follows:

module Main where

import Prelude hiding (filter) 
import Control.Applicative ((<
gt;))
import Control.Monad (join)
import Control.Monad.IO.Class (liftIO)
import Control.Monad.ListT (ListT)
import Data.List.Class (cons, execute, filter, fromList, mapL)
import System (getArgs)
import System.Directory (getDirectoryContents, doesDirectoryExist)
import System.FilePath ((</>))

main = execute . mapL putStrLn . listFiles =<< head <
gt; getArgs

listFiles :: FilePath -> ListT IO FilePath
listFiles path = liftIO (doesDirectoryExist path) >>= listIfDir
  where
    valid "."  = False
    valid ".." = False
    valid _ = True
    listIfDir False = return path
    listIfDir True
      =  cons path
      $  join
      $  listFiles
     <
gt; (path </>)
     <
gt; (filter valid =<< fromList <
gt; liftIO (getDirectoryContents path))

This works beautifully in that it starts printing immediately and uses very little memory. Unfortunately it's also dozens of times slower than a comparable FilePath -> IO [FilePath] version.

What am I doing wrong? I've never used the List package's ListT outside of toy examples like this, so I don't know what kind of performance to expect, but 30 seconds (vs. a fraction of a second) to process a directory with ~40,000 files seems much too slow.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

秋叶绚丽 2024-10-04 07:10:44

分析显示 join(与 doesDirectoryExists 一起)占用了代码中的大部分时间。让我们看看它的定义是如何展开的:

  join x
=> (definition of join in Control.Monad)
  x >>= id
=> (definition of >>= in Control.Monad.ListT)
  foldrL' mappend mempty (fmap id x)
=> (fmap id = id)
  foldrL' mappend mempty x

如果在搜索的根目录中有 k 个子目录,并且它们的内容已经在列表中计算:d1, d 2, ... dk,然后在应用 join 后,您将得到(大致): (. ..(([] ++ d1) ++ d2) ... ++ dk)。由于 x ++ y 需要时间 O(length x) 整个过程将需要时间 O(d1 + (d< sub>1 + d2) + ... + (d1 + ... dk-1))。如果我们假设文件数量为 n 并且它们均匀分布在 d1 ... dk 那么计算join 的时间将为O(n*k),并且这只适用于listFiles 的第一层。

我认为这是您的解决方案的主要性能问题。

Profiling shows that join (together with doesDirectoryExists) accounts for most of the time in your code. Lets see how its definition unfolds:

  join x
=> (definition of join in Control.Monad)
  x >>= id
=> (definition of >>= in Control.Monad.ListT)
  foldrL' mappend mempty (fmap id x)
=> (fmap id = id)
  foldrL' mappend mempty x

If in the root directory of the search there are k subdirectories and their contents are already computed in the lists: d1, d2, ... dk, then after applying join you'll get (roughly): (...(([] ++ d1) ++ d2) ... ++ dk). Since x ++ y takes time O(length x) the whole thing will take time O(d1 + (d1 + d2) + ... + (d1 + ... dk-1)). If we assume that the number of files is n and they are evenly distributed between d1 ... dk then the time to compute join would be O(n*k) and that is only for the first level of listFiles.

This, I think, is the main performance problem with your solution.

泪眸﹌ 2024-10-04 07:10:44

我很好奇,使用 logict 为你工作? LogicT 在语义上与 ListT 相同,但以连续传递风格实现,因此它不应该出现与 concat 相关的问题类型你似乎遇到了。

import Prelude hiding (filter)
import Control.Applicative
import Control.Monad
import Control.Monad.Logic
import System (getArgs)
import System.Directory (getDirectoryContents, doesDirectoryExist)
import System.FilePath ((</>))

main = sequence_ =<< observeAllT . fmap putStrLn . listFiles =<< head <
gt; getArgs

cons :: MonadPlus m => a -> m a -> m a
cons x xs = return x `mplus` xs

fromList :: MonadPlus m => [a] -> m a
fromList = foldr cons mzero

filter :: MonadPlus m => (a -> Bool) -> m a -> m a
filter f xs = do
  x <- xs
  guard $ f x
  return x

listFiles :: FilePath -> LogicT IO FilePath
listFiles path = liftIO (doesDirectoryExist path) >>= listIfDir
  where
    valid "."  = False
    valid ".." = False
    valid _ = True
    listIfDir False = return path
    listIfDir True
      =  cons path
      $  join
      $  listFiles
     <
gt; (path </>)
     <
gt; (filter valid =<< fromList <
gt; liftIO (getDirectoryContents path))

I'm curious, how well does the same program written to use logict work for you? LogicT is semantically the same as ListT, but implemented in continuation-passing style so that it shouldn't have the concat-related type of problems you seem to be running into.

import Prelude hiding (filter)
import Control.Applicative
import Control.Monad
import Control.Monad.Logic
import System (getArgs)
import System.Directory (getDirectoryContents, doesDirectoryExist)
import System.FilePath ((</>))

main = sequence_ =<< observeAllT . fmap putStrLn . listFiles =<< head <
gt; getArgs

cons :: MonadPlus m => a -> m a -> m a
cons x xs = return x `mplus` xs

fromList :: MonadPlus m => [a] -> m a
fromList = foldr cons mzero

filter :: MonadPlus m => (a -> Bool) -> m a -> m a
filter f xs = do
  x <- xs
  guard $ f x
  return x

listFiles :: FilePath -> LogicT IO FilePath
listFiles path = liftIO (doesDirectoryExist path) >>= listIfDir
  where
    valid "."  = False
    valid ".." = False
    valid _ = True
    listIfDir False = return path
    listIfDir True
      =  cons path
      $  join
      $  listFiles
     <
gt; (path </>)
     <
gt; (filter valid =<< fromList <
gt; liftIO (getDirectoryContents path))
终难遇 2024-10-04 07:10:44

在大目录上运行它会发现内存泄漏。我怀疑这与 getDirectoryContents 的严格性有关,但可能还有更多的事情发生。简单的分析并没有出现太多,我会添加一些额外的成本中心,然后从那里开始......

Running it on a large directory reveals a memory leak. I suspect this has to do with the strictness of getDirectoryContents, but there might be more going on. Simple profiling didn't turn up much, I'd add some extra cost centers and go from there...

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文