为什么我的代码使用 List 包中的一元列表如此慢？

发布于 2024-09-27 07:10:44 字数 1410 浏览 10 评论 0原文

上周，用户 Masse 提出了一个关于在 Haskell 目录中递归列出文件的问题。我的第一个想法是尝试使用 List 包中的单子列表以避免在打印开始之前在内存中构建整个列表。我的实现如下：

module Main where

import Prelude hiding (filter) 
import Control.Applicative ((<$>))
import Control.Monad (join)
import Control.Monad.IO.Class (liftIO)
import Control.Monad.ListT (ListT)
import Data.List.Class (cons, execute, filter, fromList, mapL)
import System (getArgs)
import System.Directory (getDirectoryContents, doesDirectoryExist)
import System.FilePath ((</>))

main = execute . mapL putStrLn . listFiles =<< head <$> getArgs

listFiles :: FilePath -> ListT IO FilePath
listFiles path = liftIO (doesDirectoryExist path) >>= listIfDir
  where
    valid "."  = False
    valid ".." = False
    valid _ = True
    listIfDir False = return path
    listIfDir True
      =  cons path
      $  join
      $  listFiles
     <$> (path </>)
     <$> (filter valid =<< fromList <$> liftIO (getDirectoryContents path))

它工作得很好，因为它立即开始打印并且使用很少的内存。不幸的是，它也比类似的 FilePath -> 慢几十倍。 IO [FilePath] 版本。

我做错了什么？我从未在像这样的玩具示例之外使用过 List 包的 ListT，所以我不知道期望什么样的性能，但是 30 秒（相对于几分之一秒）来处理大约 40,000 个文件的目录似乎太慢了。

原文

Last week user Masse asked a question about recursively listing files in a directory in Haskell. My first thought was to try using monadic lists from the List package to avoid building the entire list in memory before the printing can start. I implemented this as follows:

module Main where

import Prelude hiding (filter) 
import Control.Applicative ((<gt;))
import Control.Monad (join)
import Control.Monad.IO.Class (liftIO)
import Control.Monad.ListT (ListT)
import Data.List.Class (cons, execute, filter, fromList, mapL)
import System (getArgs)
import System.Directory (getDirectoryContents, doesDirectoryExist)
import System.FilePath ((</>))

main = execute . mapL putStrLn . listFiles =<< head <gt; getArgs

listFiles :: FilePath -> ListT IO FilePath
listFiles path = liftIO (doesDirectoryExist path) >>= listIfDir
  where
    valid "."  = False
    valid ".." = False
    valid _ = True
    listIfDir False = return path
    listIfDir True
      =  cons path
      $  join
      $  listFiles
     <gt; (path </>)
     <
gt; (filter valid =<< fromList <gt; liftIO (getDirectoryContents path))

This works beautifully in that it starts printing immediately and uses very little memory. Unfortunately it's also dozens of times slower than a comparable FilePath -> IO [FilePath] version.

What am I doing wrong? I've never used the List package's ListT outside of toy examples like this, so I don't know what kind of performance to expect, but 30 seconds (vs. a fraction of a second) to process a directory with ~40,000 files seems much too slow.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

秋叶绚丽 2024-10-04 07:10:44

分析显示 join（与 doesDirectoryExists 一起）占用了代码中的大部分时间。让我们看看它的定义是如何展开的：

  join x
=> (definition of join in Control.Monad)
  x >>= id
=> (definition of >>= in Control.Monad.ListT)
  foldrL' mappend mempty (fmap id x)
=> (fmap id = id)
  foldrL' mappend mempty x

如果在搜索的根目录中有 k 个子目录，并且它们的内容已经在列表中计算：d₁, d ₂, ... d_k，然后在应用 join 后，您将得到（大致）： (. ..(([] ++ d₁) ++ d₂) ... ++ d_k)。由于 x ++ y 需要时间 O(length x) 整个过程将需要时间 O(d₁ + (d< sub>1 + d₂) + ... + (d₁ + ... d_k-1))。如果我们假设文件数量为 n 并且它们均匀分布在 d₁ ... d_k 那么计算join 的时间将为O(n*k)，并且这只适用于listFiles 的第一层。

我认为这是您的解决方案的主要性能问题。

Profiling shows that join (together with doesDirectoryExists) accounts for most of the time in your code. Lets see how its definition unfolds:

  join x
=> (definition of join in Control.Monad)
  x >>= id
=> (definition of >>= in Control.Monad.ListT)
  foldrL' mappend mempty (fmap id x)
=> (fmap id = id)
  foldrL' mappend mempty x

If in the root directory of the search there are k subdirectories and their contents are already computed in the lists: d₁, d₂, ... d_k, then after applying join you'll get (roughly): (...(([] ++ d₁) ++ d₂) ... ++ d_k). Since x ++ y takes time O(length x) the whole thing will take time O(d₁ + (d₁ + d₂) + ... + (d₁ + ... d_k-1)). If we assume that the number of files is n and they are evenly distributed between d₁ ... d_k then the time to compute join would be O(n*k) and that is only for the first level of listFiles.

This, I think, is the main performance problem with your solution.

回复收藏 0 原文

泪眸﹌ 2024-10-04 07:10:44

我很好奇，使用 logict 为你工作？ LogicT 在语义上与 ListT 相同，但以连续传递风格实现，因此它不应该出现与 concat 相关的问题类型你似乎遇到了。

import Prelude hiding (filter)
import Control.Applicative
import Control.Monad
import Control.Monad.Logic
import System (getArgs)
import System.Directory (getDirectoryContents, doesDirectoryExist)
import System.FilePath ((</>))

main = sequence_ =<< observeAllT . fmap putStrLn . listFiles =<< head <gt; getArgs

cons :: MonadPlus m => a -> m a -> m a
cons x xs = return x `mplus` xs

fromList :: MonadPlus m => [a] -> m a
fromList = foldr cons mzero

filter :: MonadPlus m => (a -> Bool) -> m a -> m a
filter f xs = do
  x <- xs
  guard $ f x
  return x

listFiles :: FilePath -> LogicT IO FilePath
listFiles path = liftIO (doesDirectoryExist path) >>= listIfDir
  where
    valid "."  = False
    valid ".." = False
    valid _ = True
    listIfDir False = return path
    listIfDir True
      =  cons path
      $  join
      $  listFiles
     <gt; (path </>)
     <
gt; (filter valid =<< fromList <gt; liftIO (getDirectoryContents path))

I'm curious, how well does the same program written to use logict work for you? LogicT is semantically the same as ListT, but implemented in continuation-passing style so that it shouldn't have the concat-related type of problems you seem to be running into.

import Prelude hiding (filter)
import Control.Applicative
import Control.Monad
import Control.Monad.Logic
import System (getArgs)
import System.Directory (getDirectoryContents, doesDirectoryExist)
import System.FilePath ((</>))

main = sequence_ =<< observeAllT . fmap putStrLn . listFiles =<< head <gt; getArgs

cons :: MonadPlus m => a -> m a -> m a
cons x xs = return x `mplus` xs

fromList :: MonadPlus m => [a] -> m a
fromList = foldr cons mzero

filter :: MonadPlus m => (a -> Bool) -> m a -> m a
filter f xs = do
  x <- xs
  guard $ f x
  return x

listFiles :: FilePath -> LogicT IO FilePath
listFiles path = liftIO (doesDirectoryExist path) >>= listIfDir
  where
    valid "."  = False
    valid ".." = False
    valid _ = True
    listIfDir False = return path
    listIfDir True
      =  cons path
      $  join
      $  listFiles
     <gt; (path </>)
     <
gt; (filter valid =<< fromList <gt; liftIO (getDirectoryContents path))

回复收藏 0 原文