Haskell 批处理文件处理不会改善空间配置

发布于 2024-11-06 11:03:06 字数 13592 浏览 5 评论 0原文

我有一个简单的算法要实现：将每一行与其他行进行比较。每一行包含一个数字，比较函数是距离。所有距离的总和就是最终结果。

这可以简单地实现如下：

sumOfDistancesOnSmallFile :: FilePath -> IO Integer
sumOfDistancesOnSmallFile path = withFile path ReadMode $ \h->do
                          distances <- liftM ( (map textRead) ) $ hListToEOF Text.hGetLine h
                          let subSet = drop offset distances
                          let emptySubSet = null subSet
                          return $ if (emptySubSet)
                                           then (0)
                                           else (distancesManyToMany subSet)

hListToEOF :: (Handle -> IO a) -> Handle -> IO [a]
hListToEOF func h = do
    element <- func h
    atEOF <- hIsEOF h
    rest <- case(atEOF) of
        True -> return []
        False -> hListToEOF func h
    return $ element:rest

distancesManyToMany :: [Integer]->Integer
distancesManyToMany (x:xs) = distancesOneToMany x xs + (distancesManyToMany xs)
distancesManyToMany _ = 0

distancesOneToMany :: Integer -> [Integer] -> Integer
distancesOneToMany one many = sum $ map (distance one) many

distance :: Integer -> Integer -> Integer
distance a b = (a-b)

为了在每行上获得合理的大数据，我使用了以下文件生成器：

createTestFile :: Int -> FilePath -> IO ()
createTestFile n path = writeFile path $ unlines $ map show $ take n $ infiniteList 0 1
    where infiniteList :: Integer->Integer-> [Integer]
          infiniteList i j = (i+j) * (i+j) : infiniteList j (i+j)

840kb 的 2000 行文件将需要 1.92 秒和 1.5Gb 分配，最大使用量约为 1.5 MB。

7.5mb 的 6k 行文件将需要 22 秒，分配 34Gb，最大内存使用量约为 15Mb

不幸的是我的数据将有数百万行。我最初尝试提高速度（关于这一点，我在之前的两篇文章中询问过MapReduce 与 Iteratee IO），但实际的限制问题是空间。

中间想法：可以通过阅读每个数字的完整文件进行比较来克服这个问题。这确实需要花费大量额外时间，因为需要打开文件并解析每一行，以便与文件的其余部分进行比较。内存分配的数量也将成二次方。因此，这作为最终解决方案并不是很有用

最后一步：这是我实现目标的第一步：批量执行。我想将几 k 行存入内存。对内存中的数据应用 ManyToMany 算法。然后，迭代文件的其余部分。在每个迭代步骤中，仅需要读取和解析连续的一行，然后可以将其与内存批次中的所有项目进行比较。

通过选择足够大的批量大小，文件不必经常重新读取。我的实现如下：

sumOfDistancesOnBigFileUsingBatches :: FilePath -> Int -> Int -> IO Integer
sumOfDistancesOnBigFileUsingBatches path batchSize offset = do
                      (firstResult, maybeRecurse) <- singleResultBatch path batchSize offset
                      recursiveResult <- case maybeRecurse of
                                             Nothing -> return 0
                                             Just newOffset -> sumOfDistancesOnBigFileUsingBatches path batchSize newOffset
                      return $ firstResult + recursiveResult

singleResultBatch :: FilePath -> Int -> Int -> IO(Integer, Maybe Int)
singleResultBatch path batchSize offset = withFile path ReadMode $ \h->do
                          distances <- readDistances h
                          let (batch, subSet) = splitAt batchSize $ drop offset distances
                          let batchInner = distancesManyToMany batch
                          let recursionTerminated = null subSet
                          let (batchToSubSet, newOffset) = if (recursionTerminated)
                                              then (0, Nothing)
                                              else (distancesSetToSet batch subSet, Just (offset+batchSize))
                          return (batchInner+batchToSubSet, newOffset)
                          where
                            readDistances h = liftM ( (map textRead) ) $ hListToEOF Text.hGetLine h


distancesSetToSet :: [Integer] -> [Integer] -> Integer
distancesSetToSet xs ys = sum $ map (\one->distancesOneToMany one xs) ys

在 2k 行文件上，批量大小为 500，完成时间为 2.16 秒、2.2Gb 分配和大约 6Mb 所需空间。这是最简单版本空间的 4 倍！也许是巧合，但也使用了 4 个批次……

令我惊讶的是，最初所需的空间全部被消耗掉，后来所需的空间只会减少。这对于 50k 行文件 (500MB) 来说是一个问题，因为这样它就会耗尽内存。

我的问题是：为什么批量解决方案会消耗更多内存？它似乎将每个批次的整个文件保留在内存中，即使它应该（至少这是我的意图）只在内存中保留一个批次。

编辑：我删除了 6k 行文件和 500 行批次的详细信息（我使用了错误的配置文件）

此外，这里是使用 2k 行文件和 500 行批次生成的空间配置文件： 2k 行文件上批处理解决方案的空间配置文件 (840KB)

EDIT2：使用固定器进行分析的结果是：

total time  =        2.24 secs   (112 ticks @ 20 ms)
total alloc = 2,126,803,896 bytes  (excludes profiling overheads)

COST CENTRE                    MODULE               %time %alloc

textRead                       MapReduceTestStrictStrings  47.3   44.4
distance                       MapReduceTestStrictStrings  25.9   25.3
distancesOneToMany             MapReduceTestStrictStrings  18.8   29.5
singleResultBatch              MapReduceTestStrictStrings   4.5    0.0
readTextDevice                 Data.Text.IO.Internal   2.7    0.0


                                                                                               individual    inherited
COST CENTRE              MODULE                                               no.    entries  %time %alloc   %time %alloc

MAIN                     MAIN                                                   1           0   0.0    0.0   100.0  100.0
 main                    Main                                                1604           2   0.0    0.0   100.0  100.0
  sumOfDistancesOnBigFileUsingBatches MapReduceTestStrictStrings                          1605           4   0.0    0.0   100.0  100.0
   singleResultBatch     MapReduceTestStrictStrings                          1606          20   4.5    0.0   100.0  100.0
    distancesSetToSet    MapReduceTestStrictStrings                          1615           3   0.0    0.0    34.8   43.3
     distancesOneToMany  MapReduceTestStrictStrings                          1616        3000  14.3   23.2    34.8   43.2
      distance           MapReduceTestStrictStrings                          1617     1500000  20.5   20.0    20.5   20.0
    textRead             MapReduceTestStrictStrings                          1614        5000  47.3   44.4    47.3   44.4
    distancesManyToMany  MapReduceTestStrictStrings                          1611        2004   0.0    0.0     9.8   11.7
     distancesOneToMany  MapReduceTestStrictStrings                          1612        2000   4.5    6.3     9.8   11.6
      distance           MapReduceTestStrictStrings                          1613      499000   5.4    5.3     5.4    5.3
    hListToEOF           MapReduceTestStrictStrings                          1609       23996   0.9    0.6     3.6    0.6
     readTextDevice      Data.Text.IO.Internal                               1610        1660   2.7    0.0     2.7    0.0
 CAF:main4               Main                                                1591           1   0.0    0.0     0.0    0.0
 CAF:main5               Main                                                1590           1   0.0    0.0     0.0    0.0
  main                   Main                                                1608           0   0.0    0.0     0.0    0.0
 CAF                     GHC.Num                                             1580           1   0.0    0.0     0.0    0.0
 CAF                     GHC.IO.Handle.FD                                    1526           2   0.0    0.0     0.0    0.0
 CAF                     GHC.IO.FD                                           1510           2   0.0    0.0     0.0    0.0
 CAF                     System.Event.Thread                                 1508           3   0.0    0.0     0.0    0.0
 CAF                     GHC.IO.Encoding.Iconv                               1487           2   0.0    0.0     0.0    0.0
 CAF                     System.Event.Internal                               1486           2   0.0    0.0     0.0    0.0
 CAF                     System.Event.Unique                                 1483           1   0.0    0.0     0.0    0.0
 CAF                     GHC.Conc.Signal                                     1480           1   0.0    0.0     0.0    0.0
 CAF                     Data.Text.Internal                                   813           1   0.0    0.0     0.0    0.0
 CAF                     Data.Text.Array                                      811           1   0.0    0.0     0.0    0.0

Retainer sets created during profiling:
SET 2 = {<MAIN.SYSTEM>}
SET 3 = {<MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 15 = {<GHC.IO.FD.CAF>}
SET 17 = {<System.Event.Thread.CAF>}
SET 18 = {<>}
SET 44 = {<GHC.IO.Handle.FD.CAF>}
SET 47 = {<GHC.IO.Handle.FD.CAF>, <MAIN.SYSTEM>}
SET 56 = {<GHC.Conc.Signal.CAF>}
SET 57 = {<>, <MAIN.SYSTEM>}
SET 66 = {<MAIN.SYSTEM>, <MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 67 = {<System.Event.Thread.CAF>, <>, <MAIN.SYSTEM>}
SET 72 = {<GHC.Conc.Sync.CAF>, <MAIN.SYSTEM>}
SET 76 = {<MapReduceTestStrictStrings.hListToEOF,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 81 = {<GHC.IO.Handle.FD.CAF>, <MAIN.SYSTEM>, <MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 83 = {<GHC.IO.Encoding.Iconv.CAF>, <GHC.IO.Handle.FD.CAF>, <MAIN.SYSTEM>, <MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 86 = {<GHC.Conc.Signal.CAF>, <>}
SET 95 = {<MapReduceTestStrictStrings.distancesOneToMany,MapReduceTestStrictStrings.distancesManyToMany,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 96 = {<MAIN.SYSTEM>, <MapReduceTestStrictStrings.distancesOneToMany,MapReduceTestStrictStrings.distancesManyToMany,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 100 = {<MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>, <MapReduceTestStrictStrings.hListToEOF,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 102 = {<MAIN.SYSTEM>, <MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>, <MapReduceTestStrictStrings.distancesOneToMany,MapReduceTestStrictStrings.distancesManyToMany,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 103 = {<MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 136 = {<GHC.IO.Handle.FD.CAF>, <MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 143 = {<GHC.Conc.Sync.CAF>, <GHC.IO.Handle.FD.CAF>, <MAIN.SYSTEM>}
SET 144 = {<MapReduceTestStrictStrings.distancesOneToMany,MapReduceTestStrictStrings.distancesSetToSet,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 145 = {<MAIN.SYSTEM>, <MapReduceTestStrictStrings.distancesOneToMany,MapReduceTestStrictStrings.distancesSetToSet,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 146 = {<MapReduceTestStrictStrings.distancesSetToSet,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 147 = {<MAIN.SYSTEM>, <MapReduceTestStrictStrings.distancesSetToSet,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 148 = {<MapReduceTestStrictStrings.distancesSetToSet,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>, <MapReduceTestStrictStrings.distancesOneToMany,MapReduceTestStrictStrings.distancesSetToSet,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}

以及以下 .hp 图像： 2k 行文件和 500 行批次的保留堆配置文件

编辑 3： 以前的代码都使用了这些包：

Data.Text.IO
Data.Text
Data.Text.Read

当我使用它们的惰性版本时，总时间/内存/空间使用量并没有真正改变：2.62秒，2.25Gb分配和5.5MB空间

接受的解决方案：惰性版本不起作用，因为 hListToEOF 强制读取完整文件（我期望 : 构造函数惰性工作）。

解决方案是使用以下导入：

import qualified Data.ByteString.Char8 as Str
import qualified Data.Text.Lazy.IO as TextIO
import qualified Data.Text.Lazy as T 
import Data.Text.Lazy.Read

并在 singleResultBatch 函数中进行以下修改：

                            readDistances = liftM  ( (map textRead . T.lines)) $ TextIO.readFile path

然后速度（2.72s）和内存分配（2.3GB）都不会改变，这是预期的。

堆配置文件（空间使用情况）确实有所改善（1.8MB 而不是 5.5MB），如下所示：使用 Text.Lazy 包时的堆配置文件

原文

I have a simple algorithm to implement: compare each line with each other line. Each line contains one number, and the comparison function is the distance. The sum of all distances is the final result.

This can be implemented as simply as follows:

sumOfDistancesOnSmallFile :: FilePath -> IO Integer
sumOfDistancesOnSmallFile path = withFile path ReadMode $ \h->do
                          distances <- liftM ( (map textRead) ) $ hListToEOF Text.hGetLine h
                          let subSet = drop offset distances
                          let emptySubSet = null subSet
                          return $ if (emptySubSet)
                                           then (0)
                                           else (distancesManyToMany subSet)

hListToEOF :: (Handle -> IO a) -> Handle -> IO [a]
hListToEOF func h = do
    element <- func h
    atEOF <- hIsEOF h
    rest <- case(atEOF) of
        True -> return []
        False -> hListToEOF func h
    return $ element:rest

distancesManyToMany :: [Integer]->Integer
distancesManyToMany (x:xs) = distancesOneToMany x xs + (distancesManyToMany xs)
distancesManyToMany _ = 0

distancesOneToMany :: Integer -> [Integer] -> Integer
distancesOneToMany one many = sum $ map (distance one) many

distance :: Integer -> Integer -> Integer
distance a b = (a-b)

To get reasonable big data on each line, i've used the following file generator:

createTestFile :: Int -> FilePath -> IO ()
createTestFile n path = writeFile path $ unlines $ map show $ take n $ infiniteList 0 1
    where infiniteList :: Integer->Integer-> [Integer]
          infiniteList i j = (i+j) * (i+j) : infiniteList j (i+j)

A 2000 line file, of 840kb will take 1.92 seconds and 1.5Gb allocations, with a maximum usage of around 1.5Mb.

A 6k line file, of 7.5mb will take 22 seconds, 34Gb allocations, with a maximum memory usage of around 15Mb

Unfortunately my data will be millions of lines. I initially attempted to improve speed (about which I asked in 2 previous posts about MapReduce combined with Iteratee IO), but the actual limiting problem is space.

Intermediate thought:This could be overcome by reading the complete file for each number to compare. This does take a lot of additional time, because the file needs to be opened and parsed for each line that is to be compared with the remainder of the file. Also the number of memory allocations will become quadratic. So that's not really useful as a final solution

The final step:
That was my first step towards my goal: batched execution. I would like to take a few k lines into memory. Apply the ManyToMany algorithm on those in memory. Then, iterate through the remainder of the file. On each iteration step, only one successive line needs to be read and parsed, which then can be compared to all items in the memory batch.

By choosing a batch size big enough the file does not have to be re-read often. My implementation is as follows:

sumOfDistancesOnBigFileUsingBatches :: FilePath -> Int -> Int -> IO Integer
sumOfDistancesOnBigFileUsingBatches path batchSize offset = do
                      (firstResult, maybeRecurse) <- singleResultBatch path batchSize offset
                      recursiveResult <- case maybeRecurse of
                                             Nothing -> return 0
                                             Just newOffset -> sumOfDistancesOnBigFileUsingBatches path batchSize newOffset
                      return $ firstResult + recursiveResult

singleResultBatch :: FilePath -> Int -> Int -> IO(Integer, Maybe Int)
singleResultBatch path batchSize offset = withFile path ReadMode $ \h->do
                          distances <- readDistances h
                          let (batch, subSet) = splitAt batchSize $ drop offset distances
                          let batchInner = distancesManyToMany batch
                          let recursionTerminated = null subSet
                          let (batchToSubSet, newOffset) = if (recursionTerminated)
                                              then (0, Nothing)
                                              else (distancesSetToSet batch subSet, Just (offset+batchSize))
                          return (batchInner+batchToSubSet, newOffset)
                          where
                            readDistances h = liftM ( (map textRead) ) $ hListToEOF Text.hGetLine h


distancesSetToSet :: [Integer] -> [Integer] -> Integer
distancesSetToSet xs ys = sum $ map (\one->distancesOneToMany one xs) ys

On the 2k line file, with a batch size of 500 it finished with 2.16secs, 2.2Gb allocations and around 6Mb required space. That is 4 times the space of the simplest version! It might be coincidence, but there are also 4 batches utilized...

What surprised me, is that all the required space is consumed initially, later on the required space only decreases. This becomes a problem with a 50k line file (500MB), because then it runs out of memory.

My question is: why does the batches solution consume more memory? It seems to keep the whole file in memory for each batch, even though it should (at least that's my intention) only keep one single batch in memory.

EDIT:
I removed the details of the 6k line file and 500line batches (I took a wrong profile file)

And as addition, here is the space profile generated using the 2k line file and 500line batches:
space profile of batched solution on 2k line file (840KB)

EDIT2:
Profiling with retainer resulted in:

total time  =        2.24 secs   (112 ticks @ 20 ms)
total alloc = 2,126,803,896 bytes  (excludes profiling overheads)

COST CENTRE                    MODULE               %time %alloc

textRead                       MapReduceTestStrictStrings  47.3   44.4
distance                       MapReduceTestStrictStrings  25.9   25.3
distancesOneToMany             MapReduceTestStrictStrings  18.8   29.5
singleResultBatch              MapReduceTestStrictStrings   4.5    0.0
readTextDevice                 Data.Text.IO.Internal   2.7    0.0


                                                                                               individual    inherited
COST CENTRE              MODULE                                               no.    entries  %time %alloc   %time %alloc

MAIN                     MAIN                                                   1           0   0.0    0.0   100.0  100.0
 main                    Main                                                1604           2   0.0    0.0   100.0  100.0
  sumOfDistancesOnBigFileUsingBatches MapReduceTestStrictStrings                          1605           4   0.0    0.0   100.0  100.0
   singleResultBatch     MapReduceTestStrictStrings                          1606          20   4.5    0.0   100.0  100.0
    distancesSetToSet    MapReduceTestStrictStrings                          1615           3   0.0    0.0    34.8   43.3
     distancesOneToMany  MapReduceTestStrictStrings                          1616        3000  14.3   23.2    34.8   43.2
      distance           MapReduceTestStrictStrings                          1617     1500000  20.5   20.0    20.5   20.0
    textRead             MapReduceTestStrictStrings                          1614        5000  47.3   44.4    47.3   44.4
    distancesManyToMany  MapReduceTestStrictStrings                          1611        2004   0.0    0.0     9.8   11.7
     distancesOneToMany  MapReduceTestStrictStrings                          1612        2000   4.5    6.3     9.8   11.6
      distance           MapReduceTestStrictStrings                          1613      499000   5.4    5.3     5.4    5.3
    hListToEOF           MapReduceTestStrictStrings                          1609       23996   0.9    0.6     3.6    0.6
     readTextDevice      Data.Text.IO.Internal                               1610        1660   2.7    0.0     2.7    0.0
 CAF:main4               Main                                                1591           1   0.0    0.0     0.0    0.0
 CAF:main5               Main                                                1590           1   0.0    0.0     0.0    0.0
  main                   Main                                                1608           0   0.0    0.0     0.0    0.0
 CAF                     GHC.Num                                             1580           1   0.0    0.0     0.0    0.0
 CAF                     GHC.IO.Handle.FD                                    1526           2   0.0    0.0     0.0    0.0
 CAF                     GHC.IO.FD                                           1510           2   0.0    0.0     0.0    0.0
 CAF                     System.Event.Thread                                 1508           3   0.0    0.0     0.0    0.0
 CAF                     GHC.IO.Encoding.Iconv                               1487           2   0.0    0.0     0.0    0.0
 CAF                     System.Event.Internal                               1486           2   0.0    0.0     0.0    0.0
 CAF                     System.Event.Unique                                 1483           1   0.0    0.0     0.0    0.0
 CAF                     GHC.Conc.Signal                                     1480           1   0.0    0.0     0.0    0.0
 CAF                     Data.Text.Internal                                   813           1   0.0    0.0     0.0    0.0
 CAF                     Data.Text.Array                                      811           1   0.0    0.0     0.0    0.0

Retainer sets created during profiling:
SET 2 = {<MAIN.SYSTEM>}
SET 3 = {<MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 15 = {<GHC.IO.FD.CAF>}
SET 17 = {<System.Event.Thread.CAF>}
SET 18 = {<>}
SET 44 = {<GHC.IO.Handle.FD.CAF>}
SET 47 = {<GHC.IO.Handle.FD.CAF>, <MAIN.SYSTEM>}
SET 56 = {<GHC.Conc.Signal.CAF>}
SET 57 = {<>, <MAIN.SYSTEM>}
SET 66 = {<MAIN.SYSTEM>, <MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 67 = {<System.Event.Thread.CAF>, <>, <MAIN.SYSTEM>}
SET 72 = {<GHC.Conc.Sync.CAF>, <MAIN.SYSTEM>}
SET 76 = {<MapReduceTestStrictStrings.hListToEOF,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 81 = {<GHC.IO.Handle.FD.CAF>, <MAIN.SYSTEM>, <MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 83 = {<GHC.IO.Encoding.Iconv.CAF>, <GHC.IO.Handle.FD.CAF>, <MAIN.SYSTEM>, <MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 86 = {<GHC.Conc.Signal.CAF>, <>}
SET 95 = {<MapReduceTestStrictStrings.distancesOneToMany,MapReduceTestStrictStrings.distancesManyToMany,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 96 = {<MAIN.SYSTEM>, <MapReduceTestStrictStrings.distancesOneToMany,MapReduceTestStrictStrings.distancesManyToMany,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 100 = {<MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>, <MapReduceTestStrictStrings.hListToEOF,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 102 = {<MAIN.SYSTEM>, <MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>, <MapReduceTestStrictStrings.distancesOneToMany,MapReduceTestStrictStrings.distancesManyToMany,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 103 = {<MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 136 = {<GHC.IO.Handle.FD.CAF>, <MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 143 = {<GHC.Conc.Sync.CAF>, <GHC.IO.Handle.FD.CAF>, <MAIN.SYSTEM>}
SET 144 = {<MapReduceTestStrictStrings.distancesOneToMany,MapReduceTestStrictStrings.distancesSetToSet,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 145 = {<MAIN.SYSTEM>, <MapReduceTestStrictStrings.distancesOneToMany,MapReduceTestStrictStrings.distancesSetToSet,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 146 = {<MapReduceTestStrictStrings.distancesSetToSet,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 147 = {<MAIN.SYSTEM>, <MapReduceTestStrictStrings.distancesSetToSet,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}
SET 148 = {<MapReduceTestStrictStrings.distancesSetToSet,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>, <MapReduceTestStrictStrings.distancesOneToMany,MapReduceTestStrictStrings.distancesSetToSet,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>}

And the following .hp image:
Retainer heap profile for 2k line file and 500line batches

EDIT 3:
The previous code all used the packages:

Data.Text.IO
Data.Text
Data.Text.Read

When i use the lazy versions of them, the total time / memory / space usages doesn't really change: 2.62 secs, 2.25Gb allocations and 5.5MB space

The accepted solution:
The lazy versions did not work because the hListToEOF forced a full file read (I expected the : constructor to work lazily).

The solution is to use he following imports:

import qualified Data.ByteString.Char8 as Str
import qualified Data.Text.Lazy.IO as TextIO
import qualified Data.Text.Lazy as T 
import Data.Text.Lazy.Read

and in the singleResultBatch function the following modification:

                            readDistances = liftM  ( (map textRead . T.lines)) $ TextIO.readFile path

Then the both the speed (2.72s) and the memory allocations (2.3GB) do not change, which is expected.

The heap profile (space usage) does improve (1.8MB instead of 5.5MB), as visible in:
heap profile when Text.Lazy packages are used

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

烟柳画桥 2024-11-13 11:03:06

您需要增量处理数据。目前，hListToEOF 一次性读取所有数据，然后慢慢处理（因此，在读入所有内容时会出现初始内存峰值，然后在释放列表时缓慢减少）。

不要通过 hListToEOF 执行自己的 IO，而是延迟读取/流式传输文件（例如使用 Text.Lazy 库中的 readFile）并将处理函数映射到它们之上。

回复收藏 0 原文

蝶舞 2024-11-13 11:03:06

我认为你的问题的很大一部分是你保留了列表，如果需要保留它们，那么在空间方面效率相当低。尝试切换到更紧凑的存储机制，例如Vector。这是对你的一些函数的相当直接的翻译，可能有很大的优化空间：

sumOfDistancesOnSmallFile :: FilePath -> IO IntegersumOfDistancesOnSmallFile path = withFile path ReadMode $ \h->do
                          distances <- liftM ( V.fromList . (map textRead) ) $ hListToEOF Text.hGetLine h
                          let subSet = distances
                          let emptySubSet = V.null subSet
                          return $ if (emptySubSet)
                                           then (0)
                                           else (distancesManyToMany subSet)


distancesManyToMany :: V.Vector Integer -> Integer
distancesManyToMany mp0 = fst $ V.foldl' (\(!acc,mp) _ -> let (el,mp1) = (V.head mp, V.tail mp) in (acc+el,mp1)) (0,mp0) mp0

distancesOneToMany :: Integer -> V.Vector Integer -> Integer
distancesOneToMany one many = V.sum $ V.map (distance one) many

在我的系统上，这将在大约 14 秒内执行一个 10000 行测试数据文件，最大内存使用量为 125MB。

I think a big part of your problem is that you're holding on to lists, which are fairly inefficient space-wise if they need to be retained. Try switching to a more compact storage mechanism, such as Vector. This is a fairly direct translation of some of your functions, with probably a lot of space for optimizations:

sumOfDistancesOnSmallFile :: FilePath -> IO IntegersumOfDistancesOnSmallFile path = withFile path ReadMode $ \h->do
                          distances <- liftM ( V.fromList . (map textRead) ) $ hListToEOF Text.hGetLine h
                          let subSet = distances
                          let emptySubSet = V.null subSet
                          return $ if (emptySubSet)
                                           then (0)
                                           else (distancesManyToMany subSet)


distancesManyToMany :: V.Vector Integer -> Integer
distancesManyToMany mp0 = fst $ V.foldl' (\(!acc,mp) _ -> let (el,mp1) = (V.head mp, V.tail mp) in (acc+el,mp1)) (0,mp0) mp0

distancesOneToMany :: Integer -> V.Vector Integer -> Integer
distancesOneToMany one many = V.sum $ V.map (distance one) many

On my system, this will execute on a 10000 line file of test data in about 14s, with maximum memory usage of 125MB.

回复收藏 0 原文

~没有更多了~