Haskell 惰性字节字符串 +读/写进度功能

发布于 2024-11-19 14:14:45 字数 1924 浏览 10 评论 0原文

我正在学习 Haskell Lazy IO。

我正在寻找一种优雅的方式来复制大文件（8Gb），同时将复制进度打印到控制台。

考虑以下以静默方式复制文件的简单程序。

module Main where

import System
import qualified Data.ByteString.Lazy as B

main = do [from, to] <- getArgs
          body <- B.readFile from
          B.writeFile to body

想象一下您想要用于报告的回调函数：

onReadBytes :: Integer -> IO ()
onReadBytes count = putStrLn $ "Bytes read: " ++ (show count)

问题：如何将 onReadBytes 函数编织到 Lazy ByteString 中，以便在成功读取时回调它？或者如果这个设计不好，那么 Haskell 的方法是什么呢？

注意：回调的频率并不重要，可以每 1024 字节或每 1 Mb 调用一次 - 不重要

答案：非常感谢 camccann 的回答。我建议完整地阅读它。

以下是我基于 camccann 代码编写的代码版本，您可能会发现它很有用。

module Main where

import System
import System.IO
import qualified Data.ByteString.Lazy as B

main = do [from, to] <- getArgs
          withFile from ReadMode $ \fromH ->
            withFile to WriteMode $ \toH ->
              copyH fromH toH $ \x -> putStrLn $ "Bytes copied: " ++ show x

copyH :: Handle -> Handle -> (Integer -> IO()) -> IO ()
copyH fromH toH onProgress =
    copy (B.hGet fromH (256 * 1024)) (write toH) B.null onProgress
    where write o x  = do B.hPut o x
                          return . fromIntegral $ B.length x

copy :: (Monad m) => m a -> (a -> m Integer) -> (a -> Bool) -> (Integer -> m()) -> m()
copy = copy_ 0

copy_ :: (Monad m) => Integer -> m a -> (a -> m Integer) -> (a -> Bool) -> (Integer -> m()) -> m()
copy_ count inp outp done onProgress = do x <- inp
                                          unless (done x) $
                                            do n <- outp x
                                               onProgress (n + count)
                                               copy_ (n + count) inp outp done onProgress

原文

I am learing Haskell Lazy IO.

I am looking for an elegant way to copy a large file (8Gb) while printing copy progress to console.

Consider the following simple program that copies a file silently.

module Main where

import System
import qualified Data.ByteString.Lazy as B

main = do [from, to] <- getArgs
          body <- B.readFile from
          B.writeFile to body

Imgine there is a callback function you want to use for reporting:

onReadBytes :: Integer -> IO ()
onReadBytes count = putStrLn $ "Bytes read: " ++ (show count)

QUESTION: how to weave onReadBytes function into Lazy ByteString so it will be called back on successfull read? Or if this design is not good, then what is the Haskell way to do it?

NOTE: the frequency of callback is not important, it can be called every 1024 bytes or every 1 Mb -- not important

ANSWER: Many thanks to camccann for the answer. I suggest to read it entirely.

Bellow is my version of the code based on camccann's code, you may find it useful.

module Main where

import System
import System.IO
import qualified Data.ByteString.Lazy as B

main = do [from, to] <- getArgs
          withFile from ReadMode $ \fromH ->
            withFile to WriteMode $ \toH ->
              copyH fromH toH $ \x -> putStrLn $ "Bytes copied: " ++ show x

copyH :: Handle -> Handle -> (Integer -> IO()) -> IO ()
copyH fromH toH onProgress =
    copy (B.hGet fromH (256 * 1024)) (write toH) B.null onProgress
    where write o x  = do B.hPut o x
                          return . fromIntegral $ B.length x

copy :: (Monad m) => m a -> (a -> m Integer) -> (a -> Bool) -> (Integer -> m()) -> m()
copy = copy_ 0

copy_ :: (Monad m) => Integer -> m a -> (a -> m Integer) -> (a -> Bool) -> (Integer -> m()) -> m()
copy_ count inp outp done onProgress = do x <- inp
                                          unless (done x) $
                                            do n <- outp x
                                               onProgress (n + count)
                                               copy_ (n + count) inp outp done onProgress

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

别挽留 2024-11-26 14:14:45

首先，我想指出，相当多的 Haskell 程序员总体上对惰性 IO 持怀疑态度。从技术上讲，它违反了纯度，但在有限的方面（据我所知）在一致输入上运行单个程序时并不明显^[0]。另一方面，很多人对此表示满意，同样是因为它只涉及非常有限的杂质。

为了营造一种实际上是通过按需 I/O 创建的惰性数据结构的错觉，像 readFile 这样的函数是在幕后使用狡猾的诡计实现的。按需 I/O 中的编织是该函数固有的，并且它并不是真正可扩展的，其原因与从中获取常规 ByteString 的幻觉几乎相同。

掌握细节并编写伪代码，像 readFile 这样的东西基本上是这样工作的：

lazyInput inp = lazyIO (lazyInput' inp)
lazyInput' inp = do x <- readFrom inp
                    if (endOfInput inp)
                        then return []
                        else do xs <- lazyInput inp
                                return (x:xs)

...每次调用 lazyIO 时，它都会推迟 I/O 直到实际使用该值。要在每次实际读取发生时调用报告函数，您需要直接将其编织进去，虽然可以编写此类函数的通用版本，但据我所知，不存在。

鉴于上述情况，您有几个选择：

查找您正在使用的惰性 I/O 函数的实现，并实现您自己的函数，其中包括进度报告函数。如果这感觉像是一个肮脏的黑客，那是因为它几乎就是这样，但是就这样。
放弃惰性 I/O 并切换到更明确和可组合的东西。这就是 Haskell 社区作为一个整体正在走向的方向，特别是针对 Iteratees，它为您提供了可很好组合的小型流处理器构建块，这些块具有更可预测的行为。缺点是这个概念仍处于积极开发阶段，因此在实现方面没有达成共识，也没有学习使用它们的单一起点。
放弃惰性 I/O 并切换到普通的常规 I/O：编写一个 IO 操作来读取块、打印报告信息并处理尽可能多的输入；然后循环调用它直到完成。根据您对输入所做的操作以及您在处理中对惰性的依赖程度，这可能涉及从编写几个几乎微不足道的函数到构建一堆有限状态机流处理器并获得 90 % 的方法来重新发明 Iteratees。

[0]：这里的底层函数称为unsafeInterleaveIO，据我所知，从中观察杂质的唯一方法需要在不同的输入上运行程序（在这种情况下，无论如何它都有权表现不同，它可能只是以在纯代码中没有意义的方式这样做），或者以某些方式更改代码（即，应该无效的重构可能会产生非本地影响）效果）。

下面是一个使用更多可组合函数以“普通旧常规 I/O”方式执行操作的粗略示例：

import System
import System.IO
import qualified Data.ByteString.Lazy as B

main = do [from, to] <- getArgs
          -- withFile closes the handle for us after the action completes
          withFile from ReadMode $ \inH ->
            withFile to WriteMode $ \outH ->
                -- run the loop with the appropriate actions
                runloop (B.hGet inH 128) (processBytes outH) B.null

-- note the very generic type; this is useful, because it proves that the
-- runloop function can only execute what it's given, not do anything else
-- behind our backs.
runloop :: (Monad m) => m a -> (a -> m ()) -> (a -> Bool) -> m ()
runloop inp outp done = do x <- inp
                           if done x
                             then return ()
                             else do outp x
                                     runloop inp outp done

-- write the output and report progress to stdout. note that this can be easily
-- modified, or composed with other output functions.
processBytes :: Handle -> B.ByteString -> IO ()
processBytes h bs | B.null bs = return ()
                  | otherwise = do onReadBytes (fromIntegral $ B.length bs)
                                   B.hPut h bs

onReadBytes :: Integer -> IO ()
onReadBytes count = putStrLn $ "Bytes read: " ++ (show count)

上面的“128”表示一次要读取的字节数。在我的“Stack Overflow snippets”目录中的随机源文件上运行此命令：

$ runhaskell ReadBStr.hs Corec.hs temp
Bytes read: 128
Bytes read: 128
Bytes read: 128
Bytes read: 128
Bytes read: 128
Bytes read: 128
Bytes read: 128
Bytes read: 128
Bytes read: 128
Bytes read: 128
Bytes read: 83
$

First, I'd like to note that a fair number of Haskell programmers regard lazy IO in general with some suspicion. It technically violates purity, but in a limited way that (as far as I'm aware) isn't noticeable when running a single program on consistent input^[0]. On the other hand, plenty of people are fine with it, again because it involves only a very restricted kind of impurity.

To create the illusion of a lazy data structure that's actually created with on-demand I/O, functions like readFile are implemented using sneaky shenanigans behind the scenes. Weaving in the on-demand I/O is inherent to the function, and it's not really extensible for pretty much the same reasons that the illusion of getting a regular ByteString from it is convincing.

Handwaving the details and writing pseudocode, something like readFile basically works like this:

lazyInput inp = lazyIO (lazyInput' inp)
lazyInput' inp = do x <- readFrom inp
                    if (endOfInput inp)
                        then return []
                        else do xs <- lazyInput inp
                                return (x:xs)

...where each time lazyIO is called, it defers the I/O until the value is actually used. To invoke your reporting function each time the actual read occurs, you'd need to weave it in directly, and while a generalized version of such a function could be written, to my knowledge none exist.

Given the above, you have a few options:

Look up the implementation of the lazy I/O functions you're using, and implement your own that include the progress reporting function. If this feels like a dirty hack, that's because it pretty much is, but there you go.
Abandon lazy I/O and switch to something more explicit and composable. This is the direction that the Haskell community as a whole seems to be heading in, specifically toward some variation on Iteratees, which give you nicely composable little stream processor building blocks that have more predictable behavior. The downside is that the concept is still under active development so there's no consensus on implementation or single starting point for learning to use them.
Abandon lazy I/O and switch to plain old regular I/O: Write an IO action that reads a chunk, prints the reporting info, and processes as much input as it can; then invoke it in a loop until done. Depending on what you're doing with the input and how much you're relying on laziness in your processing, this could involve anything from writing a couple nearly-trivial functions to building a bunch of finite-state-machine stream processors and getting 90% of the way to reinventing Iteratees.

[0]: The underlying function here is called unsafeInterleaveIO, and to the best of my knowledge the only ways to observe impurity from it require either running the program on different input (in which case it's entitled to behave differently anyhow, it just may be doing so in ways that don't make sense in pure code), or changing the code in certain ways (i.e., refactorings that should have no effect can have non-local effects).

Here's a rough example of doing things the "plain old regular I/O" way, using more composable functions:

import System
import System.IO
import qualified Data.ByteString.Lazy as B

main = do [from, to] <- getArgs
          -- withFile closes the handle for us after the action completes
          withFile from ReadMode $ \inH ->
            withFile to WriteMode $ \outH ->
                -- run the loop with the appropriate actions
                runloop (B.hGet inH 128) (processBytes outH) B.null

-- note the very generic type; this is useful, because it proves that the
-- runloop function can only execute what it's given, not do anything else
-- behind our backs.
runloop :: (Monad m) => m a -> (a -> m ()) -> (a -> Bool) -> m ()
runloop inp outp done = do x <- inp
                           if done x
                             then return ()
                             else do outp x
                                     runloop inp outp done

-- write the output and report progress to stdout. note that this can be easily
-- modified, or composed with other output functions.
processBytes :: Handle -> B.ByteString -> IO ()
processBytes h bs | B.null bs = return ()
                  | otherwise = do onReadBytes (fromIntegral $ B.length bs)
                                   B.hPut h bs

onReadBytes :: Integer -> IO ()
onReadBytes count = putStrLn $ "Bytes read: " ++ (show count)

The "128" up there is how many bytes to read at a time. Running this on a random source file in my "Stack Overflow snippets" directory:

$ runhaskell ReadBStr.hs Corec.hs temp
Bytes read: 128
Bytes read: 128
Bytes read: 128
Bytes read: 128
Bytes read: 128
Bytes read: 128
Bytes read: 128
Bytes read: 128
Bytes read: 128
Bytes read: 128
Bytes read: 83
$

回复收藏 0 原文