Haskell ByteStrings - 最终将大文件加载到内存中
您好,
我试图理解为什么我看到整个文件通过以下程序加载到内存中,但是如果您注释掉“(***)”下面的行,那么程序会以恒定的速度运行(大约 1.5M)空间。
编辑:该文件约为 660MB,第 26 列中的字段是日期字符串,如“2009-10-01”,有 100 万行。该过程在到达“getLine”时使用了大约 810MB,
我是否正确地认为这与使用“split”分割字符串有关,并且不知何故从文件中读取的底层 ByteString 不能是垃圾- 收集是因为它仍然被引用?但如果是这样,那么我认为 BS.copy 可以解决这个问题。任何如何强制计算的想法 - 我似乎无法将“seq”放入正确的位置以产生效果。
(注意源文件是制表符分隔行)
提前致谢,
凯文
module Main where
import System.IO
import qualified Data.ByteString.Lazy.Char8 as BS
import Control.Monad
type Record = BS.ByteString
importRecords :: String -> IO [Record]
importRecords filename = do
liftM (map importRecord.BS.lines) (BS.readFile filename)
importRecord :: BS.ByteString -> Record
importRecord txt = r
where
r = getField 26
getField f = BS.copy $ ((BS.split '\t' txt) !! f)
loopInput :: [Record] -> IO ()
loopInput jrs = do
putStrLn $ "Done" ++ (show $ last jrs)
hFlush stdout
x <- getLine
return ()
-- (***)
loopInput jrs
main = do
jrs <- importRecords "c:\\downloads\\lcg1m.txt"
loopInput jrs
Greetings,
I'm trying to understand why I'm seeing the entire file loaded into memory with the following program, yet if you comment out the line below "(***)" then the program runs in constant (about 1.5M) space.
EDIT: The file is about 660MB, the field in column 26 is a date string like '2009-10-01', and there are one million lines. The process uses about 810MB by the time it hits the 'getLine'
Am I right in thinking it's related to the splitting of the string using 'split', and that somehow the underlying ByteString that has been read from the file can't be garbage-collected because it's still referenced? But if so, then I thought BS.copy would work around that. Any ideas how to force the computation - I can't seem to get 'seq' into the right place to have an effect.
(NB the source file is tab-separated lines)
Thanks in advance,
Kevin
module Main where
import System.IO
import qualified Data.ByteString.Lazy.Char8 as BS
import Control.Monad
type Record = BS.ByteString
importRecords :: String -> IO [Record]
importRecords filename = do
liftM (map importRecord.BS.lines) (BS.readFile filename)
importRecord :: BS.ByteString -> Record
importRecord txt = r
where
r = getField 26
getField f = BS.copy $ ((BS.split '\t' txt) !! f)
loopInput :: [Record] -> IO ()
loopInput jrs = do
putStrLn $ "Done" ++ (show $ last jrs)
hFlush stdout
x <- getLine
return ()
-- (***)
loopInput jrs
main = do
jrs <- importRecords "c:\\downloads\\lcg1m.txt"
loopInput jrs
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您对
last
的调用会强制列表jrs
。为了弄清楚这一点,它必须运行整个文件,为jrs
中的每个条目构建 thunk。因为您没有评估 jrs 中的每个元素(最后一个元素除外),所以这些 thunk 会与字节串的引用一起出现,因此必须保留在内存中。解决方案是强制对这些重击进行评估。因为我们谈论的是空间,所以我做的第一件事实际上是以较小的格式存储您的信息:
这将丑陋的 10 字节字节串(+ 约 16 字节结构信息的开销)减少到大约 8 字节。
importRecord
现在必须调用toRecord r
来获取正确的类型:当我们从
ByteString
转换为时,我们需要评估数据记录
,因此让我们使用 parallel 包并从 DeepSeq。现在我们准备好了,我修改了 main 以使用
evalList
,从而在需要最后一个列表的函数之前强制使用整个列表:我们可以看到堆配置文件看起来很漂亮(并且
top
同意,该程序使用很少的内存)。对于其他
误导性错误答案感到抱歉 - 我对增量处理这一事实着迷修复了它,但并没有真正意识到重击声确实在附近徘徊,不知道为什么我的大脑会忽略它。尽管我确实坚持要点,但您应该逐步处理这些信息,从而使所有这些答案变得毫无意义。仅供参考,巨大的字节串没有出现在我之前发布的堆配置文件中,因为堆分析器不跟踪外部分配(包括
ByteString
)。Your call to
last
forces the list,jrs
. To figure that out it must run through the entire file building up thunks for each entry injrs
. Because you aren't evaluating each element injrs
(except the last one) these thunks hang out with references to the bytestring, so that must stay in memory.The solution is to force the evaluation of those thunks. Because we're talking about space the first thing I did was actually to store your info in a smaller format:
This reduces that ugly 10 byte Bytestring (+ overhead of ~16 bytes of structure information) to around 8 bytes.
importRecord
now has to calltoRecord r
to get the right type:We'll need to evalute data when we convert from
ByteString
toRecord
, so lets use the parallel package and define an NFData instance from DeepSeq.Now we're ready to go, I modified main to use
evalList
, thus forcing the whole list before your function that wants the last one:And we can see the heap profile looks beautiful (and
top
agrees, the program uses very little memory).Sorry about that other
misleadingwrong answer - I was hooked on the fact that incremental processing fixes it and didn't really realize the thunks really were hanging around, not sure why my brain glided over that. Though I do stand by the gist, you should incrementally process this information making all of this answer moot.FYI the huge bytestring didn't show up in those previous heap profiles I posted because foreign allocations (which includes
ByteString
) aren't tracked by the heap profiler.这里似乎有两个问题:
对于 TomMD 尚未说过的第一个问题,我真的不知道该说什么;在
loopInput
循环内,jrs
永远无法被释放,因为它需要作为loopInput
递归调用的参数。 (您知道,当 (***) 存在时,return ()
不会执行任何操作,对吗?)至于第二个问题,我认为您是对的,输入 ByteString 没有被垃圾收集。原因是除了最后一个元素之外,您从未评估列表
jrs
中的元素,因此它们仍然包含对原始 ByteString 的引用(即使它们的形式为BS.copy .. .
)。我认为用show jrs
替换show $ last jrs
会减少内存使用量;是吗?或者,您可以尝试更严格的映射,例如将
importRecords
中的map
替换为map'
,看看是否会减少内存使用量。There seem to be two questions here:
I don't really know what to say about the first one that TomMD didn't already say; inside the
loopInput
loop,jrs
can never be freed, because it's needed as an argument to the recursive call ofloopInput
. (You know thatreturn ()
doesn't do anything when (***) is present, right?)As for the second question, I think you are right that the input ByteString isn't being garbage collected. The reason is that you never evaluate the elements of your list
jrs
besides the last one, so they still contain references to the original ByteString (even though they are of the formBS.copy ...
). I would think that replacingshow $ last jrs
withshow jrs
would reduce your memory usage; does it? Alternatively, you could try a stricter map, likeReplace the
map
inimportRecords
withmap'
and see whether that reduces your memory usage.