Haskell ByteStrings - 最终将大文件加载到内存中

发布于 2024-09-27 15:36:22 字数 1092 浏览 2 评论 0原文

您好,

我试图理解为什么我看到整个文件通过以下程序加载到内存中,但是如果您注释掉“(***)”下面的行,那么程序会以恒定的速度运行(大约 1.5M)空间。

编辑:该文件约为 660MB,第 26 列中的字段是日期字符串,如“2009-10-01”,有 100 万行。该过程在到达“getLine”时使用了大约 810MB,

我是否正确地认为这与使用“split”分割字符串有关,并且不知何故从文件中读取的底层 ByteString 不能是垃圾- 收集是因为它仍然被引用?但如果是这样,那么我认为 BS.copy 可以解决这个问题。任何如何强制计算的想法 - 我似乎无法将“seq”放入正确的位置以产生效果。

(注意源文件是制表符分隔行)

提前致谢,

凯文

module Main where

import System.IO
import qualified Data.ByteString.Lazy.Char8 as BS
import Control.Monad


type Record = BS.ByteString

importRecords :: String -> IO [Record]
importRecords filename = do
    liftM (map importRecord.BS.lines) (BS.readFile filename)

importRecord :: BS.ByteString -> Record
importRecord txt = r
  where 
    r = getField 26
    getField f = BS.copy $ ((BS.split '\t' txt) !! f)

loopInput :: [Record] -> IO ()
loopInput jrs = do
    putStrLn $ "Done" ++ (show $ last jrs)
    hFlush stdout
    x <- getLine
    return ()

    -- (***)
    loopInput jrs

main = do 
    jrs <- importRecords "c:\\downloads\\lcg1m.txt"
    loopInput jrs

Greetings,

I'm trying to understand why I'm seeing the entire file loaded into memory with the following program, yet if you comment out the line below "(***)" then the program runs in constant (about 1.5M) space.

EDIT: The file is about 660MB, the field in column 26 is a date string like '2009-10-01', and there are one million lines. The process uses about 810MB by the time it hits the 'getLine'

Am I right in thinking it's related to the splitting of the string using 'split', and that somehow the underlying ByteString that has been read from the file can't be garbage-collected because it's still referenced? But if so, then I thought BS.copy would work around that. Any ideas how to force the computation - I can't seem to get 'seq' into the right place to have an effect.

(NB the source file is tab-separated lines)

Thanks in advance,

Kevin

module Main where

import System.IO
import qualified Data.ByteString.Lazy.Char8 as BS
import Control.Monad


type Record = BS.ByteString

importRecords :: String -> IO [Record]
importRecords filename = do
    liftM (map importRecord.BS.lines) (BS.readFile filename)

importRecord :: BS.ByteString -> Record
importRecord txt = r
  where 
    r = getField 26
    getField f = BS.copy $ ((BS.split '\t' txt) !! f)

loopInput :: [Record] -> IO ()
loopInput jrs = do
    putStrLn $ "Done" ++ (show $ last jrs)
    hFlush stdout
    x <- getLine
    return ()

    -- (***)
    loopInput jrs

main = do 
    jrs <- importRecords "c:\\downloads\\lcg1m.txt"
    loopInput jrs

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

滿滿的愛 2024-10-04 15:36:22

您对 last 的调用会强制列表 jrs。为了弄清楚这一点,它必须运行整个文件,为 jrs 中的每个条目构建 thunk。因为您没有评估 jrs 中的每个元素(最后一个元素除外),所以这些 thunk 会与字节串的引用一起出现,因此必须保留在内存中。

解决方案是强制对这些重击进行评估。因为我们谈论的是空间,所以我做的第一件事实际上是以较小的格式存储您的信息:

type Year   = Word16
type Month  = Word8
type Day    = Word8
data Record = Rec {-# UNPACK #-} !Year {-# UNPACK #-} !Month {-# UNPACK #-} !Day 
        deriving (Eq, Ord, Show, Read)

这将丑陋的 10 字节字节串(+ 约 16 字节结构信息的开销)减少到大约 8 字节。

importRecord 现在必须调用 toRecord r 来获取正确的类型:

toRecord :: BS.ByteString -> Record
toRecord bs =
    case BS.splitWith (== '-') bs of
            (y:m:d:[]) -> Rec (rup y) (rup m) (rup d)
            _ -> Rec 0 0 0

rup :: (Read a) => BS.ByteString -> a
rup = read . BS.unpack

当我们从 ByteString 转换为 时,我们需要评估数据记录,因此让我们使用 parallel 包并从 DeepSeq

instance NFData Record where
    rnf (Rec y m d) = y `seq` m `seq` d `seq` ()

现在我们准备好了,我修改了 main 以使用 evalList,从而在需要最后一个列表的函数之前强制使用整个列表:

main = do
    jrs <- importRecords "./tabLines"
    let jrs' = using jrs (evalList rdeepseq)
    loopInput jrs'

我们可以看到堆配置文件看起来很漂亮(并且 top 同意,该程序使用很少的内存)。

alt text

对于其他误导性错误答案感到抱歉 - 我对增量处理这一事实着迷修复了它,但并没有真正意识到重击声确实在附近徘徊,不知道为什么我的大脑会忽略它。尽管我确实坚持要点,但您应该逐步处理这些信息,从而使所有这些答案变得毫无意义。

仅供参考,巨大的字节串没有出现在我之前发布的堆配置文件中,因为堆分析器不跟踪外部分配(包括 ByteString)。

Your call to last forces the list, jrs. To figure that out it must run through the entire file building up thunks for each entry in jrs. Because you aren't evaluating each element in jrs (except the last one) these thunks hang out with references to the bytestring, so that must stay in memory.

The solution is to force the evaluation of those thunks. Because we're talking about space the first thing I did was actually to store your info in a smaller format:

type Year   = Word16
type Month  = Word8
type Day    = Word8
data Record = Rec {-# UNPACK #-} !Year {-# UNPACK #-} !Month {-# UNPACK #-} !Day 
        deriving (Eq, Ord, Show, Read)

This reduces that ugly 10 byte Bytestring (+ overhead of ~16 bytes of structure information) to around 8 bytes.

importRecord now has to call toRecord r to get the right type:

toRecord :: BS.ByteString -> Record
toRecord bs =
    case BS.splitWith (== '-') bs of
            (y:m:d:[]) -> Rec (rup y) (rup m) (rup d)
            _ -> Rec 0 0 0

rup :: (Read a) => BS.ByteString -> a
rup = read . BS.unpack

We'll need to evalute data when we convert from ByteString to Record, so lets use the parallel package and define an NFData instance from DeepSeq.

instance NFData Record where
    rnf (Rec y m d) = y `seq` m `seq` d `seq` ()

Now we're ready to go, I modified main to use evalList, thus forcing the whole list before your function that wants the last one:

main = do
    jrs <- importRecords "./tabLines"
    let jrs' = using jrs (evalList rdeepseq)
    loopInput jrs'

And we can see the heap profile looks beautiful (and top agrees, the program uses very little memory).

alt text

Sorry about that other misleading wrong answer - I was hooked on the fact that incremental processing fixes it and didn't really realize the thunks really were hanging around, not sure why my brain glided over that. Though I do stand by the gist, you should incrementally process this information making all of this answer moot.

FYI the huge bytestring didn't show up in those previous heap profiles I posted because foreign allocations (which includes ByteString) aren't tracked by the heap profiler.

左耳近心 2024-10-04 15:36:22

这里似乎有两个问题:

  • 为什么内存使用量取决于行(***)的存在或不存在;
  • 为什么带有 (***) 的内存使用量约为 800MB,而不是 40MB。

对于 TomMD 尚未说过的第一个问题,我真的不知道该说什么;在 loopInput 循环内,jrs 永远无法被释放,因为它需要作为 loopInput 递归调用的参数。 (您知道,当 (***) 存在时,return () 不会执行任何操作,对吗?)

至于第二个问题,我认为您是对的,输入 ByteString 没有被垃圾收集。原因是除了最后一个元素之外,您从未评估列表 jrs 中的元素,因此它们仍然包含对原始 ByteString 的引用(即使它们的形式为 BS.copy .. .)。我认为用 show jrs 替换 show $ last jrs 会减少内存使用量;是吗?或者,您可以尝试更严格的映射,例如

map' f []     = []
map' f (x:xs) = ((:) $! (f $! x)) (map' f xs)

importRecords 中的 map 替换为 map',看看是否会减少内存使用量。

There seem to be two questions here:

  • why does the memory usage depend on the presence or absence of the line (***);
  • why is the memory usage with (***) present about 800MB, rather than, say, 40MB.

I don't really know what to say about the first one that TomMD didn't already say; inside the loopInput loop, jrs can never be freed, because it's needed as an argument to the recursive call of loopInput. (You know that return () doesn't do anything when (***) is present, right?)

As for the second question, I think you are right that the input ByteString isn't being garbage collected. The reason is that you never evaluate the elements of your list jrs besides the last one, so they still contain references to the original ByteString (even though they are of the form BS.copy ...). I would think that replacing show $ last jrs with show jrs would reduce your memory usage; does it? Alternatively, you could try a stricter map, like

map' f []     = []
map' f (x:xs) = ((:) $! (f $! x)) (map' f xs)

Replace the map in importRecords with map' and see whether that reduces your memory usage.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文