为什么我的代码使用 List 包中的一元列表如此慢?
上周,用户 Masse 提出了一个关于在 Haskell 目录中递归列出文件的问题。我的第一个想法是尝试使用 List
包 中的单子列表以避免在打印开始之前在内存中构建整个列表。我的实现如下:
module Main where
import Prelude hiding (filter)
import Control.Applicative ((<$>))
import Control.Monad (join)
import Control.Monad.IO.Class (liftIO)
import Control.Monad.ListT (ListT)
import Data.List.Class (cons, execute, filter, fromList, mapL)
import System (getArgs)
import System.Directory (getDirectoryContents, doesDirectoryExist)
import System.FilePath ((</>))
main = execute . mapL putStrLn . listFiles =<< head <$> getArgs
listFiles :: FilePath -> ListT IO FilePath
listFiles path = liftIO (doesDirectoryExist path) >>= listIfDir
where
valid "." = False
valid ".." = False
valid _ = True
listIfDir False = return path
listIfDir True
= cons path
$ join
$ listFiles
<$> (path </>)
<$> (filter valid =<< fromList <$> liftIO (getDirectoryContents path))
它工作得很好,因为它立即开始打印并且使用很少的内存。不幸的是,它也比类似的 FilePath -> 慢几十倍。 IO [FilePath] 版本。
我做错了什么?我从未在像这样的玩具示例之外使用过 List
包的 ListT
,所以我不知道期望什么样的性能,但是 30 秒(相对于几分之一秒)来处理大约 40,000 个文件的目录似乎太慢了。
Last week user Masse asked a question about recursively listing files in a directory in Haskell. My first thought was to try using monadic lists from the List
package to avoid building the entire list in memory before the printing can start. I implemented this as follows:
module Main where
import Prelude hiding (filter)
import Control.Applicative ((<gt;))
import Control.Monad (join)
import Control.Monad.IO.Class (liftIO)
import Control.Monad.ListT (ListT)
import Data.List.Class (cons, execute, filter, fromList, mapL)
import System (getArgs)
import System.Directory (getDirectoryContents, doesDirectoryExist)
import System.FilePath ((</>))
main = execute . mapL putStrLn . listFiles =<< head <gt; getArgs
listFiles :: FilePath -> ListT IO FilePath
listFiles path = liftIO (doesDirectoryExist path) >>= listIfDir
where
valid "." = False
valid ".." = False
valid _ = True
listIfDir False = return path
listIfDir True
= cons path
$ join
$ listFiles
<gt; (path </>)
<gt; (filter valid =<< fromList <gt; liftIO (getDirectoryContents path))
This works beautifully in that it starts printing immediately and uses very little memory. Unfortunately it's also dozens of times slower than a comparable FilePath -> IO [FilePath]
version.
What am I doing wrong? I've never used the List
package's ListT
outside of toy examples like this, so I don't know what kind of performance to expect, but 30 seconds (vs. a fraction of a second) to process a directory with ~40,000 files seems much too slow.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
分析显示
join
(与doesDirectoryExists
一起)占用了代码中的大部分时间。让我们看看它的定义是如何展开的:如果在搜索的根目录中有
k
个子目录,并且它们的内容已经在列表中计算:d1, d 2, ... dk
,然后在应用join
后,您将得到(大致):(. ..(([] ++ d1) ++ d2) ... ++ dk)
。由于x ++ y
需要时间O(length x)
整个过程将需要时间O(d1 + (d< sub>1 + d2) + ... + (d1 + ... dk-1))
。如果我们假设文件数量为n
并且它们均匀分布在d1 ... dk
那么计算join
的时间将为O(n*k)
,并且这只适用于listFiles
的第一层。我认为这是您的解决方案的主要性能问题。
Profiling shows that
join
(together withdoesDirectoryExists
) accounts for most of the time in your code. Lets see how its definition unfolds:If in the root directory of the search there are
k
subdirectories and their contents are already computed in the lists:d1, d2, ... dk
, then after applyingjoin
you'll get (roughly):(...(([] ++ d1) ++ d2) ... ++ dk)
. Sincex ++ y
takes timeO(length x)
the whole thing will take timeO(d1 + (d1 + d2) + ... + (d1 + ... dk-1))
. If we assume that the number of files isn
and they are evenly distributed betweend1 ... dk
then the time to computejoin
would beO(n*k)
and that is only for the first level oflistFiles
.This, I think, is the main performance problem with your solution.
我很好奇,使用 logict 为你工作?
LogicT
在语义上与ListT
相同,但以连续传递风格实现,因此它不应该出现与concat
相关的问题类型你似乎遇到了。I'm curious, how well does the same program written to use logict work for you?
LogicT
is semantically the same asListT
, but implemented in continuation-passing style so that it shouldn't have theconcat
-related type of problems you seem to be running into.在大目录上运行它会发现内存泄漏。我怀疑这与 getDirectoryContents 的严格性有关,但可能还有更多的事情发生。简单的分析并没有出现太多,我会添加一些额外的成本中心,然后从那里开始......
Running it on a large directory reveals a memory leak. I suspect this has to do with the strictness of getDirectoryContents, but there might be more going on. Simple profiling didn't turn up much, I'd add some extra cost centers and go from there...