haskell 中大文件的 IO:性能问题

发布于 2024-10-25 09:49:14 字数 1559 浏览 8 评论 0原文

我正在尝试使用 Haskell 处理大文件。我想逐个字节地浏览输入文件,并生成一个又一个字节的输出。当然,我需要使用合理大小(几KB)的块来缓冲 IO。我做不到,我需要你的帮助。

import System 
import qualified Data.ByteString.Lazy as BL 
import Data.Word  
import Data.List

main :: IO () 
main =     
    do         
        args <- System.getArgs         
        let filename = head args         
        byteString <- BL.readFile filename         
        let wordsList = BL.unpack byteString         
        let foldFun acc word = doSomeStuff word : acc
        let wordsListCopy = foldl' foldFun [] wordsList
        let byteStringCopy = BL.pack (reverse wordsListCopy)
        BL.writeFile (filename ++ ".cpy") byteStringCopy
    where
        doSomeStuff = id

我将此文件命名为 TestCopy.hs,然后执行以下操作:

$ ls -l *MB
-rwxrwxrwx 1 root root 10000000 2011-03-24 13:11 10MB
-rwxrwxrwx 1 root root  5000000 2011-03-24 13:31 5MB
$ ghc --make -O TestCopy.hs 
[1 of 1] Compiling Main             ( TestCopy.hs, TestCopy.o )
Linking TestCopy ...
$ time ./TestCopy 5MB

real    0m5.631s
user    0m1.972s
sys 0m2.488s
$ diff 5MB 5MB.cpy
$ time ./TestCopy 10MB 

real    3m6.671s
user    0m3.404s
sys 1m21.649s
$ diff 10MB 10MB.cpy 
$ time ./TestCopy 10MB +RTS -K500M -RTS

real    2m50.261s
user    0m3.808s
sys 1m13.849s
$ diff 10MB 10MB.cpy 
$ 

我的问题:5MB 和 10 MB 文件之间存在巨大差异。我希望性能与输入文件的大小成线性关系。请问我做错了什么,我该如何实现这一目标?我不介意使用惰性字节串或其他任何东西,只要它有效,但它必须是标准的 ghc 库。

Precision:这是一个大学项目。我并不是想复制文件。 doSomeStuff 函数应执行我必须自定义的压缩/解压缩操作。

I'm trying to work over big files using Haskell. I'd like to browse an input file byte after byte, and to generate an output byte after byte. Of course I need the IO to be buffered with blocks of reasonable size (a few KB). I can't do it, and I need your help please.

import System 
import qualified Data.ByteString.Lazy as BL 
import Data.Word  
import Data.List

main :: IO () 
main =     
    do         
        args <- System.getArgs         
        let filename = head args         
        byteString <- BL.readFile filename         
        let wordsList = BL.unpack byteString         
        let foldFun acc word = doSomeStuff word : acc
        let wordsListCopy = foldl' foldFun [] wordsList
        let byteStringCopy = BL.pack (reverse wordsListCopy)
        BL.writeFile (filename ++ ".cpy") byteStringCopy
    where
        doSomeStuff = id

I name this file TestCopy.hs, then do the following:

$ ls -l *MB
-rwxrwxrwx 1 root root 10000000 2011-03-24 13:11 10MB
-rwxrwxrwx 1 root root  5000000 2011-03-24 13:31 5MB
$ ghc --make -O TestCopy.hs 
[1 of 1] Compiling Main             ( TestCopy.hs, TestCopy.o )
Linking TestCopy ...
$ time ./TestCopy 5MB

real    0m5.631s
user    0m1.972s
sys 0m2.488s
$ diff 5MB 5MB.cpy
$ time ./TestCopy 10MB 

real    3m6.671s
user    0m3.404s
sys 1m21.649s
$ diff 10MB 10MB.cpy 
$ time ./TestCopy 10MB +RTS -K500M -RTS

real    2m50.261s
user    0m3.808s
sys 1m13.849s
$ diff 10MB 10MB.cpy 
$ 

My problem: There is a huge difference between a 5MB and a 10 MB file. I'd like the performances to be linear in the size of the input file. Please what am i doing wrong, and how can I achieve this? I don't mind using lazy bytestrings or anything else as long as it works, but it has to be a standard ghc library.

Precision: It's for a university project. And I'm not trying to copy files. The doSomeStuff function shall perform compression/decompression actions that I have to customize.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

凉月流沐 2024-11-01 09:49:14

对于分块输入处理,我将使用 enumerator 包。

import Data.Enumerator
import Data.Enumerator.Binary (enumFile)

我们使用字节串

import Data.ByteString as BS

和 IO

import Control.Monad.Trans (liftIO)
import Control.Monad (mapM_)
import System (getArgs)

您的主函数可能如下所示:

main =
  do (filepath:_) <- getArgs
     let destination
     run_ $ enumFile filepath $ writeFile (filepath ++ ".cpy")

enumFile 每个块读取 4096 字节并将它们传递给 writeFile,后者将其写下来。

enumWrite 定义为:

enumWrite :: FilePath -> Iteratee BS.ByteString IO ()
enumWrite filepath =
  do liftIO (BS.writeFile filepath BS.empty)   -- ensure the destination is empty
     continue step
  where
  step (Chunks xs) =
    do liftIO (mapM_ (BS.appendFile filepath) xs)
       continue step
  step EOF         = yield () EOF

如您所见,step 函数获取字节串块并附加它们
到目标文件。这些块的类型为 Stream BS.Bytestring,其中
流定义为:

data Stream a = Chunks [a] | EOF

在 EOF 步骤终止时,产生 ()。

为了更详细地阅读这一点,我个人推荐迈克尔
Snoymans 教程

数字

$ time ./TestCopy 5MB
./TestCopy 5MB  2,91s user 0,32s system 96% cpu 3,356 total

$ time ./TestCopy2 5MB
./TestCopy2 5MB  0,04s user 0,03s system 93% cpu 0,075 total

这是一个很大的进步。现在为了实现您的折叠,您可能需要编写一个 Enumeratee,它用于转换输入流。幸运的是,枚举器包中已经定义了一个映射函数,可以根据您的需要对其进行修改,即可以修改它以继承状态。

关于中间结果的构造

您以相反的顺序构造wordsList,然后将其反转。我认为 差异列表 做得更好,因为附加需要由于附加只是一个函数组合,因此只需 O(1) 时间。我不确定它们是否占用更多空间。这是差异列表的粗略草图:

type DList a = [a] -> [a]

emptyList :: DList a
emptyList = id

snoc :: DList a -> a -> DList a
snoc dlist a = dlist . (a:)

toList :: DList a -> [a]
toList dlist = dlist []

可能不再需要这个答案,但为了完整性我添加了它。

For chunked input processing I would use the enumerator package.

import Data.Enumerator
import Data.Enumerator.Binary (enumFile)

We use bytestrings

import Data.ByteString as BS

and IO

import Control.Monad.Trans (liftIO)
import Control.Monad (mapM_)
import System (getArgs)

Your main function could look like following:

main =
  do (filepath:_) <- getArgs
     let destination
     run_ $ enumFile filepath $ writeFile (filepath ++ ".cpy")

enumFile reads 4096 bytes per chunk and passes these to writeFile, which writes it down.

enumWrite is defined as:

enumWrite :: FilePath -> Iteratee BS.ByteString IO ()
enumWrite filepath =
  do liftIO (BS.writeFile filepath BS.empty)   -- ensure the destination is empty
     continue step
  where
  step (Chunks xs) =
    do liftIO (mapM_ (BS.appendFile filepath) xs)
       continue step
  step EOF         = yield () EOF

As you can see, the step function takes chunks of bytestrings and appends them
to the destination file. These chunks have the type Stream BS.Bytestring, where
Stream is defined as:

data Stream a = Chunks [a] | EOF

On an EOF step terminates, yielding ().

To have a much more elaborate read on this I personally recommend Michael
Snoymans tutorial

The numbers

$ time ./TestCopy 5MB
./TestCopy 5MB  2,91s user 0,32s system 96% cpu 3,356 total

$ time ./TestCopy2 5MB
./TestCopy2 5MB  0,04s user 0,03s system 93% cpu 0,075 total

That's quite an improvement. Now in order to implement your fold you probably want to write an Enumeratee, which is used to transform a input stream. Fortunately there is already a map function defined in the enumerator package, which can be modified for your need, i.e. it can be modified to carry over state.

On the construction of the intermediate result

You construct wordsList in reverse order and reverse it afterwards. I think difference lists do a better job, because appends take only O(1) time due to the fact that appending is only a function composition. I'm not sure whether they takes more space though. Here's a rough sketch of difference lists:

type DList a = [a] -> [a]

emptyList :: DList a
emptyList = id

snoc :: DList a -> a -> DList a
snoc dlist a = dlist . (a:)

toList :: DList a -> [a]
toList dlist = dlist []

This answer is probably not needed anymore, but I added it for completeness.

〃温暖了心ぐ 2024-11-01 09:49:14

我认为这是昨天的 在 haskell 中读取大文件? 的后续内容。

尝试使用“-rtsopts -O2”而不是“-O”进行编译。

您声称“我想逐个字节地浏览输入文件,并生成一个又一个字节的输出”。但您的代码在尝试创建任何输出之前会读取整个输入。这并不能很好地代表目标。

在我的系统中,我看到“ghc -rtsopts --make -O2 b.hs”,

(! 741)-> time ./b five
real 0m2.789s user 0m2.235s sys 0m0.518s
(! 742)-> time ./b ten
real 0m5.830s user 0m4.623s sys 0m1.027s

现在对我来说看起来是线性的。

I take it this is a follow on to Reading large file in haskell? from yesterday.

Try compiling with "-rtsopts -O2" instead of just "-O".

You claim "I'd like to browse an input file byte after byte, and to generate an output byte after byte." but your code reads the entire input before trying to create any output. This is just not very representative of the goal.

With my system I see "ghc -rtsopts --make -O2 b.hs" giving

(! 741)-> time ./b five
real 0m2.789s user 0m2.235s sys 0m0.518s
(! 742)-> time ./b ten
real 0m5.830s user 0m4.623s sys 0m1.027s

Which now looks linear to me.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文