haskell 中大文件的 IO:性能问题
我正在尝试使用 Haskell 处理大文件。我想逐个字节地浏览输入文件,并生成一个又一个字节的输出。当然,我需要使用合理大小(几KB)的块来缓冲 IO。我做不到,我需要你的帮助。
import System
import qualified Data.ByteString.Lazy as BL
import Data.Word
import Data.List
main :: IO ()
main =
do
args <- System.getArgs
let filename = head args
byteString <- BL.readFile filename
let wordsList = BL.unpack byteString
let foldFun acc word = doSomeStuff word : acc
let wordsListCopy = foldl' foldFun [] wordsList
let byteStringCopy = BL.pack (reverse wordsListCopy)
BL.writeFile (filename ++ ".cpy") byteStringCopy
where
doSomeStuff = id
我将此文件命名为 TestCopy.hs
,然后执行以下操作:
$ ls -l *MB
-rwxrwxrwx 1 root root 10000000 2011-03-24 13:11 10MB
-rwxrwxrwx 1 root root 5000000 2011-03-24 13:31 5MB
$ ghc --make -O TestCopy.hs
[1 of 1] Compiling Main ( TestCopy.hs, TestCopy.o )
Linking TestCopy ...
$ time ./TestCopy 5MB
real 0m5.631s
user 0m1.972s
sys 0m2.488s
$ diff 5MB 5MB.cpy
$ time ./TestCopy 10MB
real 3m6.671s
user 0m3.404s
sys 1m21.649s
$ diff 10MB 10MB.cpy
$ time ./TestCopy 10MB +RTS -K500M -RTS
real 2m50.261s
user 0m3.808s
sys 1m13.849s
$ diff 10MB 10MB.cpy
$
我的问题:5MB 和 10 MB 文件之间存在巨大差异。我希望性能与输入文件的大小成线性关系。请问我做错了什么,我该如何实现这一目标?我不介意使用惰性字节串或其他任何东西,只要它有效,但它必须是标准的 ghc 库。
Precision:这是一个大学项目。我并不是想复制文件。 doSomeStuff 函数应执行我必须自定义的压缩/解压缩操作。
I'm trying to work over big files using Haskell. I'd like to browse an input file byte after byte, and to generate an output byte after byte. Of course I need the IO to be buffered with blocks of reasonable size (a few KB). I can't do it, and I need your help please.
import System
import qualified Data.ByteString.Lazy as BL
import Data.Word
import Data.List
main :: IO ()
main =
do
args <- System.getArgs
let filename = head args
byteString <- BL.readFile filename
let wordsList = BL.unpack byteString
let foldFun acc word = doSomeStuff word : acc
let wordsListCopy = foldl' foldFun [] wordsList
let byteStringCopy = BL.pack (reverse wordsListCopy)
BL.writeFile (filename ++ ".cpy") byteStringCopy
where
doSomeStuff = id
I name this file TestCopy.hs
, then do the following:
$ ls -l *MB
-rwxrwxrwx 1 root root 10000000 2011-03-24 13:11 10MB
-rwxrwxrwx 1 root root 5000000 2011-03-24 13:31 5MB
$ ghc --make -O TestCopy.hs
[1 of 1] Compiling Main ( TestCopy.hs, TestCopy.o )
Linking TestCopy ...
$ time ./TestCopy 5MB
real 0m5.631s
user 0m1.972s
sys 0m2.488s
$ diff 5MB 5MB.cpy
$ time ./TestCopy 10MB
real 3m6.671s
user 0m3.404s
sys 1m21.649s
$ diff 10MB 10MB.cpy
$ time ./TestCopy 10MB +RTS -K500M -RTS
real 2m50.261s
user 0m3.808s
sys 1m13.849s
$ diff 10MB 10MB.cpy
$
My problem: There is a huge difference between a 5MB and a 10 MB file. I'd like the performances to be linear in the size of the input file. Please what am i doing wrong, and how can I achieve this? I don't mind using lazy bytestrings or anything else as long as it works, but it has to be a standard ghc library.
Precision: It's for a university project. And I'm not trying to copy files. The doSomeStuff
function shall perform compression/decompression actions that I have to customize.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
对于分块输入处理,我将使用 enumerator 包。
我们使用字节串
和 IO
您的主函数可能如下所示:
enumFile 每个块读取 4096 字节并将它们传递给 writeFile,后者将其写下来。
enumWrite 定义为:
如您所见,step 函数获取字节串块并附加它们
到目标文件。这些块的类型为 Stream BS.Bytestring,其中
流定义为:
在 EOF 步骤终止时,产生 ()。
为了更详细地阅读这一点,我个人推荐迈克尔
Snoymans 教程
数字
这是一个很大的进步。现在为了实现您的折叠,您可能需要编写一个 Enumeratee,它用于转换输入流。幸运的是,枚举器包中已经定义了一个映射函数,可以根据您的需要对其进行修改,即可以修改它以继承状态。
关于中间结果的构造
您以相反的顺序构造wordsList,然后将其反转。我认为 差异列表 做得更好,因为附加需要由于附加只是一个函数组合,因此只需 O(1) 时间。我不确定它们是否占用更多空间。这是差异列表的粗略草图:
可能不再需要这个答案,但为了完整性我添加了它。
For chunked input processing I would use the enumerator package.
We use bytestrings
and IO
Your main function could look like following:
enumFile reads 4096 bytes per chunk and passes these to writeFile, which writes it down.
enumWrite is defined as:
As you can see, the step function takes chunks of bytestrings and appends them
to the destination file. These chunks have the type Stream BS.Bytestring, where
Stream is defined as:
On an EOF step terminates, yielding ().
To have a much more elaborate read on this I personally recommend Michael
Snoymans tutorial
The numbers
That's quite an improvement. Now in order to implement your fold you probably want to write an Enumeratee, which is used to transform a input stream. Fortunately there is already a map function defined in the enumerator package, which can be modified for your need, i.e. it can be modified to carry over state.
On the construction of the intermediate result
You construct wordsList in reverse order and reverse it afterwards. I think difference lists do a better job, because appends take only O(1) time due to the fact that appending is only a function composition. I'm not sure whether they takes more space though. Here's a rough sketch of difference lists:
This answer is probably not needed anymore, but I added it for completeness.
我认为这是昨天的 在 haskell 中读取大文件? 的后续内容。
尝试使用“-rtsopts -O2”而不是“-O”进行编译。
您声称“我想逐个字节地浏览输入文件,并生成一个又一个字节的输出”。但您的代码在尝试创建任何输出之前会读取整个输入。这并不能很好地代表目标。
在我的系统中,我看到“ghc -rtsopts --make -O2 b.hs”,
现在对我来说看起来是线性的。
I take it this is a follow on to Reading large file in haskell? from yesterday.
Try compiling with "-rtsopts -O2" instead of just "-O".
You claim "I'd like to browse an input file byte after byte, and to generate an output byte after byte." but your code reads the entire input before trying to create any output. This is just not very representative of the goal.
With my system I see "ghc -rtsopts --make -O2 b.hs" giving
Which now looks linear to me.