Haskell 批处理文件处理不会改善空间配置
我有一个简单的算法要实现:将每一行与其他行进行比较。每一行包含一个数字,比较函数是距离。所有距离的总和就是最终结果。
这可以简单地实现如下:
sumOfDistancesOnSmallFile :: FilePath -> IO Integer
sumOfDistancesOnSmallFile path = withFile path ReadMode $ \h->do
distances <- liftM ( (map textRead) ) $ hListToEOF Text.hGetLine h
let subSet = drop offset distances
let emptySubSet = null subSet
return $ if (emptySubSet)
then (0)
else (distancesManyToMany subSet)
hListToEOF :: (Handle -> IO a) -> Handle -> IO [a]
hListToEOF func h = do
element <- func h
atEOF <- hIsEOF h
rest <- case(atEOF) of
True -> return []
False -> hListToEOF func h
return $ element:rest
distancesManyToMany :: [Integer]->Integer
distancesManyToMany (x:xs) = distancesOneToMany x xs + (distancesManyToMany xs)
distancesManyToMany _ = 0
distancesOneToMany :: Integer -> [Integer] -> Integer
distancesOneToMany one many = sum $ map (distance one) many
distance :: Integer -> Integer -> Integer
distance a b = (a-b)
为了在每行上获得合理的大数据,我使用了以下文件生成器:
createTestFile :: Int -> FilePath -> IO ()
createTestFile n path = writeFile path $ unlines $ map show $ take n $ infiniteList 0 1
where infiniteList :: Integer->Integer-> [Integer]
infiniteList i j = (i+j) * (i+j) : infiniteList j (i+j)
840kb 的 2000 行文件将需要 1.92 秒和 1.5Gb 分配,最大使用量约为 1.5 MB。
7.5mb 的 6k 行文件将需要 22 秒,分配 34Gb,最大内存使用量约为 15Mb
不幸的是我的数据将有数百万行。我最初尝试提高速度(关于这一点,我在之前的两篇文章中询问过MapReduce 与 Iteratee IO),但实际的限制问题是空间。
中间想法:可以通过阅读每个数字的完整文件进行比较来克服这个问题。这确实需要花费大量额外时间,因为需要打开文件并解析每一行,以便与文件的其余部分进行比较。内存分配的数量也将成二次方。因此,这作为最终解决方案并不是很有用
最后一步: 这是我实现目标的第一步:批量执行。我想将几 k 行存入内存。对内存中的数据应用 ManyToMany 算法。然后,迭代文件的其余部分。在每个迭代步骤中,仅需要读取和解析连续的一行,然后可以将其与内存批次中的所有项目进行比较。
通过选择足够大的批量大小,文件不必经常重新读取。我的实现如下:
sumOfDistancesOnBigFileUsingBatches :: FilePath -> Int -> Int -> IO Integer
sumOfDistancesOnBigFileUsingBatches path batchSize offset = do
(firstResult, maybeRecurse) <- singleResultBatch path batchSize offset
recursiveResult <- case maybeRecurse of
Nothing -> return 0
Just newOffset -> sumOfDistancesOnBigFileUsingBatches path batchSize newOffset
return $ firstResult + recursiveResult
singleResultBatch :: FilePath -> Int -> Int -> IO(Integer, Maybe Int)
singleResultBatch path batchSize offset = withFile path ReadMode $ \h->do
distances <- readDistances h
let (batch, subSet) = splitAt batchSize $ drop offset distances
let batchInner = distancesManyToMany batch
let recursionTerminated = null subSet
let (batchToSubSet, newOffset) = if (recursionTerminated)
then (0, Nothing)
else (distancesSetToSet batch subSet, Just (offset+batchSize))
return (batchInner+batchToSubSet, newOffset)
where
readDistances h = liftM ( (map textRead) ) $ hListToEOF Text.hGetLine h
distancesSetToSet :: [Integer] -> [Integer] -> Integer
distancesSetToSet xs ys = sum $ map (\one->distancesOneToMany one xs) ys
在 2k 行文件上,批量大小为 500,完成时间为 2.16 秒、2.2Gb 分配和大约 6Mb 所需空间。这是最简单版本空间的 4 倍!也许是巧合,但也使用了 4 个批次……
令我惊讶的是,最初所需的空间全部被消耗掉,后来所需的空间只会减少。这对于 50k 行文件 (500MB) 来说是一个问题,因为这样它就会耗尽内存。
我的问题是:为什么批量解决方案会消耗更多内存?它似乎将每个批次的整个文件保留在内存中,即使它应该(至少这是我的意图)只在内存中保留一个批次。
编辑: 我删除了 6k 行文件和 500 行批次的详细信息(我使用了错误的配置文件)
此外,这里是使用 2k 行文件和 500 行批次生成的空间配置文件:
EDIT2: 使用固定器进行分析的结果是:
total time = 2.24 secs (112 ticks @ 20 ms)
total alloc = 2,126,803,896 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
textRead MapReduceTestStrictStrings 47.3 44.4
distance MapReduceTestStrictStrings 25.9 25.3
distancesOneToMany MapReduceTestStrictStrings 18.8 29.5
singleResultBatch MapReduceTestStrictStrings 4.5 0.0
readTextDevice Data.Text.IO.Internal 2.7 0.0
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 1 0 0.0 0.0 100.0 100.0
main Main 1604 2 0.0 0.0 100.0 100.0
sumOfDistancesOnBigFileUsingBatches MapReduceTestStrictStrings 1605 4 0.0 0.0 100.0 100.0
singleResultBatch MapReduceTestStrictStrings 1606 20 4.5 0.0 100.0 100.0
distancesSetToSet MapReduceTestStrictStrings 1615 3 0.0 0.0 34.8 43.3
distancesOneToMany MapReduceTestStrictStrings 1616 3000 14.3 23.2 34.8 43.2
distance MapReduceTestStrictStrings 1617 1500000 20.5 20.0 20.5 20.0
textRead MapReduceTestStrictStrings 1614 5000 47.3 44.4 47.3 44.4
distancesManyToMany MapReduceTestStrictStrings 1611 2004 0.0 0.0 9.8 11.7
distancesOneToMany MapReduceTestStrictStrings 1612 2000 4.5 6.3 9.8 11.6
distance MapReduceTestStrictStrings 1613 499000 5.4 5.3 5.4 5.3
hListToEOF MapReduceTestStrictStrings 1609 23996 0.9 0.6 3.6 0.6
readTextDevice Data.Text.IO.Internal 1610 1660 2.7 0.0 2.7 0.0
CAF:main4 Main 1591 1 0.0 0.0 0.0 0.0
CAF:main5 Main 1590 1 0.0 0.0 0.0 0.0
main Main 1608 0 0.0 0.0 0.0 0.0
CAF GHC.Num 1580 1 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD 1526 2 0.0 0.0 0.0 0.0
CAF GHC.IO.FD 1510 2 0.0 0.0 0.0 0.0
CAF System.Event.Thread 1508 3 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv 1487 2 0.0 0.0 0.0 0.0
CAF System.Event.Internal 1486 2 0.0 0.0 0.0 0.0
CAF System.Event.Unique 1483 1 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal 1480 1 0.0 0.0 0.0 0.0
CAF Data.Text.Internal 813 1 0.0 0.0 0.0 0.0
CAF Data.Text.Array 811 1 0.0 0.0 0.0 0.0
Retainer sets created during profiling:
SET 2 = {<MAIN.SYSTEM>}
SET 3 = {<MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 15 = {<GHC.IO.FD.CAF>}
SET 17 = {<System.Event.Thread.CAF>}
SET 18 = {<>}
SET 44 = {<GHC.IO.Handle.FD.CAF>}
SET 47 = {<GHC.IO.Handle.FD.CAF>, <MAIN.SYSTEM>}
SET 56 = {<GHC.Conc.Signal.CAF>}
SET 57 = {<>, <MAIN.SYSTEM>}
SET 66 = {<MAIN.SYSTEM>, <MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 67 = {<System.Event.Thread.CAF>, <>, <MAIN.SYSTEM>}
SET 72 = {<GHC.Conc.Sync.CAF>, <MAIN.SYSTEM>}
SET 76 = {<MapReduceTestStrictStrings.hListToEOF,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 81 = {<GHC.IO.Handle.FD.CAF>, <MAIN.SYSTEM>, <MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 83 = {<GHC.IO.Encoding.Iconv.CAF>, <GHC.IO.Handle.FD.CAF>, <MAIN.SYSTEM>, <MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 86 = {<GHC.Conc.Signal.CAF>, <>}
SET 95 = {<MapReduceTestStrictStrings.distancesOneToMany,MapReduceTestStrictStrings.distancesManyToMany,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 96 = {<MAIN.SYSTEM>, <MapReduceTestStrictStrings.distancesOneToMany,MapReduceTestStrictStrings.distancesManyToMany,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 100 = {<MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>, <MapReduceTestStrictStrings.hListToEOF,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 102 = {<MAIN.SYSTEM>, <MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>, <MapReduceTestStrictStrings.distancesOneToMany,MapReduceTestStrictStrings.distancesManyToMany,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 103 = {<MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 136 = {<GHC.IO.Handle.FD.CAF>, <MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 143 = {<GHC.Conc.Sync.CAF>, <GHC.IO.Handle.FD.CAF>, <MAIN.SYSTEM>}
SET 144 = {<MapReduceTestStrictStrings.distancesOneToMany,MapReduceTestStrictStrings.distancesSetToSet,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 145 = {<MAIN.SYSTEM>, <MapReduceTestStrictStrings.distancesOneToMany,MapReduceTestStrictStrings.distancesSetToSet,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 146 = {<MapReduceTestStrictStrings.distancesSetToSet,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 147 = {<MAIN.SYSTEM>, <MapReduceTestStrictStrings.distancesSetToSet,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 148 = {<MapReduceTestStrictStrings.distancesSetToSet,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>, <MapReduceTestStrictStrings.distancesOneToMany,MapReduceTestStrictStrings.distancesSetToSet,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
以及以下 .hp 图像:
编辑 3: 以前的代码都使用了这些包:
Data.Text.IO
Data.Text
Data.Text.Read
当我使用它们的惰性版本时,总时间/内存/空间使用量并没有真正改变:2.62秒,2.25Gb分配和5.5MB空间
接受的解决方案: 惰性版本不起作用,因为 hListToEOF 强制读取完整文件(我期望 : 构造函数惰性工作)。
解决方案是使用以下导入:
import qualified Data.ByteString.Char8 as Str
import qualified Data.Text.Lazy.IO as TextIO
import qualified Data.Text.Lazy as T
import Data.Text.Lazy.Read
并在 singleResultBatch 函数中进行以下修改:
readDistances = liftM ( (map textRead . T.lines)) $ TextIO.readFile path
然后速度(2.72s)和内存分配(2.3GB)都不会改变,这是预期的。
堆配置文件(空间使用情况)确实有所改善(1.8MB 而不是 5.5MB),如下所示:
I have a simple algorithm to implement: compare each line with each other line. Each line contains one number, and the comparison function is the distance. The sum of all distances is the final result.
This can be implemented as simply as follows:
sumOfDistancesOnSmallFile :: FilePath -> IO Integer
sumOfDistancesOnSmallFile path = withFile path ReadMode $ \h->do
distances <- liftM ( (map textRead) ) $ hListToEOF Text.hGetLine h
let subSet = drop offset distances
let emptySubSet = null subSet
return $ if (emptySubSet)
then (0)
else (distancesManyToMany subSet)
hListToEOF :: (Handle -> IO a) -> Handle -> IO [a]
hListToEOF func h = do
element <- func h
atEOF <- hIsEOF h
rest <- case(atEOF) of
True -> return []
False -> hListToEOF func h
return $ element:rest
distancesManyToMany :: [Integer]->Integer
distancesManyToMany (x:xs) = distancesOneToMany x xs + (distancesManyToMany xs)
distancesManyToMany _ = 0
distancesOneToMany :: Integer -> [Integer] -> Integer
distancesOneToMany one many = sum $ map (distance one) many
distance :: Integer -> Integer -> Integer
distance a b = (a-b)
To get reasonable big data on each line, i've used the following file generator:
createTestFile :: Int -> FilePath -> IO ()
createTestFile n path = writeFile path $ unlines $ map show $ take n $ infiniteList 0 1
where infiniteList :: Integer->Integer-> [Integer]
infiniteList i j = (i+j) * (i+j) : infiniteList j (i+j)
A 2000 line file, of 840kb will take 1.92 seconds and 1.5Gb allocations, with a maximum usage of around 1.5Mb.
A 6k line file, of 7.5mb will take 22 seconds, 34Gb allocations, with a maximum memory usage of around 15Mb
Unfortunately my data will be millions of lines. I initially attempted to improve speed (about which I asked in 2 previous posts about MapReduce combined with Iteratee IO), but the actual limiting problem is space.
Intermediate thought:This could be overcome by reading the complete file for each number to compare. This does take a lot of additional time, because the file needs to be opened and parsed for each line that is to be compared with the remainder of the file. Also the number of memory allocations will become quadratic. So that's not really useful as a final solution
The final step:
That was my first step towards my goal: batched execution. I would like to take a few k lines into memory. Apply the ManyToMany algorithm on those in memory. Then, iterate through the remainder of the file. On each iteration step, only one successive line needs to be read and parsed, which then can be compared to all items in the memory batch.
By choosing a batch size big enough the file does not have to be re-read often. My implementation is as follows:
sumOfDistancesOnBigFileUsingBatches :: FilePath -> Int -> Int -> IO Integer
sumOfDistancesOnBigFileUsingBatches path batchSize offset = do
(firstResult, maybeRecurse) <- singleResultBatch path batchSize offset
recursiveResult <- case maybeRecurse of
Nothing -> return 0
Just newOffset -> sumOfDistancesOnBigFileUsingBatches path batchSize newOffset
return $ firstResult + recursiveResult
singleResultBatch :: FilePath -> Int -> Int -> IO(Integer, Maybe Int)
singleResultBatch path batchSize offset = withFile path ReadMode $ \h->do
distances <- readDistances h
let (batch, subSet) = splitAt batchSize $ drop offset distances
let batchInner = distancesManyToMany batch
let recursionTerminated = null subSet
let (batchToSubSet, newOffset) = if (recursionTerminated)
then (0, Nothing)
else (distancesSetToSet batch subSet, Just (offset+batchSize))
return (batchInner+batchToSubSet, newOffset)
where
readDistances h = liftM ( (map textRead) ) $ hListToEOF Text.hGetLine h
distancesSetToSet :: [Integer] -> [Integer] -> Integer
distancesSetToSet xs ys = sum $ map (\one->distancesOneToMany one xs) ys
On the 2k line file, with a batch size of 500 it finished with 2.16secs, 2.2Gb allocations and around 6Mb required space. That is 4 times the space of the simplest version! It might be coincidence, but there are also 4 batches utilized...
What surprised me, is that all the required space is consumed initially, later on the required space only decreases. This becomes a problem with a 50k line file (500MB), because then it runs out of memory.
My question is: why does the batches solution consume more memory? It seems to keep the whole file in memory for each batch, even though it should (at least that's my intention) only keep one single batch in memory.
EDIT:
I removed the details of the 6k line file and 500line batches (I took a wrong profile file)
And as addition, here is the space profile generated using the 2k line file and 500line batches:
EDIT2:
Profiling with retainer resulted in:
total time = 2.24 secs (112 ticks @ 20 ms)
total alloc = 2,126,803,896 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
textRead MapReduceTestStrictStrings 47.3 44.4
distance MapReduceTestStrictStrings 25.9 25.3
distancesOneToMany MapReduceTestStrictStrings 18.8 29.5
singleResultBatch MapReduceTestStrictStrings 4.5 0.0
readTextDevice Data.Text.IO.Internal 2.7 0.0
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 1 0 0.0 0.0 100.0 100.0
main Main 1604 2 0.0 0.0 100.0 100.0
sumOfDistancesOnBigFileUsingBatches MapReduceTestStrictStrings 1605 4 0.0 0.0 100.0 100.0
singleResultBatch MapReduceTestStrictStrings 1606 20 4.5 0.0 100.0 100.0
distancesSetToSet MapReduceTestStrictStrings 1615 3 0.0 0.0 34.8 43.3
distancesOneToMany MapReduceTestStrictStrings 1616 3000 14.3 23.2 34.8 43.2
distance MapReduceTestStrictStrings 1617 1500000 20.5 20.0 20.5 20.0
textRead MapReduceTestStrictStrings 1614 5000 47.3 44.4 47.3 44.4
distancesManyToMany MapReduceTestStrictStrings 1611 2004 0.0 0.0 9.8 11.7
distancesOneToMany MapReduceTestStrictStrings 1612 2000 4.5 6.3 9.8 11.6
distance MapReduceTestStrictStrings 1613 499000 5.4 5.3 5.4 5.3
hListToEOF MapReduceTestStrictStrings 1609 23996 0.9 0.6 3.6 0.6
readTextDevice Data.Text.IO.Internal 1610 1660 2.7 0.0 2.7 0.0
CAF:main4 Main 1591 1 0.0 0.0 0.0 0.0
CAF:main5 Main 1590 1 0.0 0.0 0.0 0.0
main Main 1608 0 0.0 0.0 0.0 0.0
CAF GHC.Num 1580 1 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD 1526 2 0.0 0.0 0.0 0.0
CAF GHC.IO.FD 1510 2 0.0 0.0 0.0 0.0
CAF System.Event.Thread 1508 3 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv 1487 2 0.0 0.0 0.0 0.0
CAF System.Event.Internal 1486 2 0.0 0.0 0.0 0.0
CAF System.Event.Unique 1483 1 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal 1480 1 0.0 0.0 0.0 0.0
CAF Data.Text.Internal 813 1 0.0 0.0 0.0 0.0
CAF Data.Text.Array 811 1 0.0 0.0 0.0 0.0
Retainer sets created during profiling:
SET 2 = {<MAIN.SYSTEM>}
SET 3 = {<MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 15 = {<GHC.IO.FD.CAF>}
SET 17 = {<System.Event.Thread.CAF>}
SET 18 = {<>}
SET 44 = {<GHC.IO.Handle.FD.CAF>}
SET 47 = {<GHC.IO.Handle.FD.CAF>, <MAIN.SYSTEM>}
SET 56 = {<GHC.Conc.Signal.CAF>}
SET 57 = {<>, <MAIN.SYSTEM>}
SET 66 = {<MAIN.SYSTEM>, <MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 67 = {<System.Event.Thread.CAF>, <>, <MAIN.SYSTEM>}
SET 72 = {<GHC.Conc.Sync.CAF>, <MAIN.SYSTEM>}
SET 76 = {<MapReduceTestStrictStrings.hListToEOF,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 81 = {<GHC.IO.Handle.FD.CAF>, <MAIN.SYSTEM>, <MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 83 = {<GHC.IO.Encoding.Iconv.CAF>, <GHC.IO.Handle.FD.CAF>, <MAIN.SYSTEM>, <MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 86 = {<GHC.Conc.Signal.CAF>, <>}
SET 95 = {<MapReduceTestStrictStrings.distancesOneToMany,MapReduceTestStrictStrings.distancesManyToMany,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 96 = {<MAIN.SYSTEM>, <MapReduceTestStrictStrings.distancesOneToMany,MapReduceTestStrictStrings.distancesManyToMany,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 100 = {<MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>, <MapReduceTestStrictStrings.hListToEOF,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 102 = {<MAIN.SYSTEM>, <MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>, <MapReduceTestStrictStrings.distancesOneToMany,MapReduceTestStrictStrings.distancesManyToMany,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 103 = {<MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 136 = {<GHC.IO.Handle.FD.CAF>, <MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 143 = {<GHC.Conc.Sync.CAF>, <GHC.IO.Handle.FD.CAF>, <MAIN.SYSTEM>}
SET 144 = {<MapReduceTestStrictStrings.distancesOneToMany,MapReduceTestStrictStrings.distancesSetToSet,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 145 = {<MAIN.SYSTEM>, <MapReduceTestStrictStrings.distancesOneToMany,MapReduceTestStrictStrings.distancesSetToSet,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 146 = {<MapReduceTestStrictStrings.distancesSetToSet,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 147 = {<MAIN.SYSTEM>, <MapReduceTestStrictStrings.distancesSetToSet,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 148 = {<MapReduceTestStrictStrings.distancesSetToSet,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>, <MapReduceTestStrictStrings.distancesOneToMany,MapReduceTestStrictStrings.distancesSetToSet,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
And the following .hp image:
EDIT 3:
The previous code all used the packages:
Data.Text.IO
Data.Text
Data.Text.Read
When i use the lazy versions of them, the total time / memory / space usages doesn't really change: 2.62 secs, 2.25Gb allocations and 5.5MB space
The accepted solution:
The lazy versions did not work because the hListToEOF forced a full file read (I expected the : constructor to work lazily).
The solution is to use he following imports:
import qualified Data.ByteString.Char8 as Str
import qualified Data.Text.Lazy.IO as TextIO
import qualified Data.Text.Lazy as T
import Data.Text.Lazy.Read
and in the singleResultBatch function the following modification:
readDistances = liftM ( (map textRead . T.lines)) $ TextIO.readFile path
Then the both the speed (2.72s) and the memory allocations (2.3GB) do not change, which is expected.
The heap profile (space usage) does improve (1.8MB instead of 5.5MB), as visible in:
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您需要增量处理数据。目前,
hListToEOF
一次性读取所有数据,然后慢慢处理(因此,在读入所有内容时会出现初始内存峰值,然后在释放列表时缓慢减少)。不要通过
hListToEOF
执行自己的 IO,而是延迟读取/流式传输文件(例如使用 Text.Lazy 库中的readFile
)并将处理函数映射到它们之上。You need to process data incrementally. Currently,
hListToEOF
reads all the data in in one go, which you then slowly process (hence the initial memory spike as everything is read in, then a slow reduction as the list is deallocated).Instead of doing your own IO via
hListToEOF
, read/stream the files lazily (e.g. withreadFile
from the Text.Lazy library) and map your processing functions over them.我认为你的问题的很大一部分是你保留了列表,如果需要保留它们,那么在空间方面效率相当低。尝试切换到更紧凑的存储机制,例如
Vector
。这是对你的一些函数的相当直接的翻译,可能有很大的优化空间:在我的系统上,这将在大约 14 秒内执行一个 10000 行测试数据文件,最大内存使用量为 125MB。
I think a big part of your problem is that you're holding on to lists, which are fairly inefficient space-wise if they need to be retained. Try switching to a more compact storage mechanism, such as
Vector
. This is a fairly direct translation of some of your functions, with probably a lot of space for optimizations:On my system, this will execute on a 10000 line file of test data in about 14s, with maximum memory usage of 125MB.