GHC 的垃圾收集 RTS 选项

发布于 2024-09-07 18:01:28 字数 1182 浏览 1 评论 0原文

我有一个 Haskell 程序,它处理一个文本文件并构建一个 Map (包含数百万个元素)。整个过程可以持续2-3分钟。我发现调整 -H 和 -A 选项会对运行时间产生很大影响。

关于 RTS 的此功能的文档,但这对我来说很难阅读,因为我不知道 GC 理论中的算法和术语。我正在寻找技术性较低的解释,最好是针对 Haskell/GHC 的。是否有关于为这些选项选择合理值的参考?

编辑: 这就是代码,它为给定的单词列表构建一个字典树。

buildTrie :: [B.ByteString] -> MyDFA 
buildTrie l = fst3 $ foldl' step (emptyDFA, B.empty, 1) $ sort $ map B.reverse l where
    step :: (MyDFA , B.ByteString, Int) -> B.ByteString -> (MyDFA , B.ByteString, Int)
    step (dfa, lastWord, newIndex) newWord = (insertNewStates, newWord, newIndex + B.length newSuffix) where            
        (pref, lastSuffix, newSuffix) = splitPrefix lastWord newWord
        branchPoint = transStar dfa pref

        --new state labels for the newSuffix path
        newStates = [newIndex .. newIndex + B.length newSuffix - 1]
        --insert newStates
        insertNewStates = (foldl' (flip insertTransition) dfa $ zip3 (branchPoint:init newStates) (B.unpack newSuffix) newStates)

I have a Haskell program which processes a text file and builds a Map (with several million elements). The whole thing can run for 2-3 minutes. I found that tweaking the -H and -A options makes a big difference in running time.

There is documentation about this functionality of the RTS, but it's a hard read for me since I don't know the algorithms and terms from GC theory. I'm looking for a less technical explanation, preferably specific to Haskell/GHC. Are there any references about choosing sensible values for these options?

EDIT:
That's the code, it builds a trie for a given list of words.

buildTrie :: [B.ByteString] -> MyDFA 
buildTrie l = fst3 $ foldl' step (emptyDFA, B.empty, 1) $ sort $ map B.reverse l where
    step :: (MyDFA , B.ByteString, Int) -> B.ByteString -> (MyDFA , B.ByteString, Int)
    step (dfa, lastWord, newIndex) newWord = (insertNewStates, newWord, newIndex + B.length newSuffix) where            
        (pref, lastSuffix, newSuffix) = splitPrefix lastWord newWord
        branchPoint = transStar dfa pref

        --new state labels for the newSuffix path
        newStates = [newIndex .. newIndex + B.length newSuffix - 1]
        --insert newStates
        insertNewStates = (foldl' (flip insertTransition) dfa $ zip3 (branchPoint:init newStates) (B.unpack newSuffix) newStates)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

东北女汉子 2024-09-14 18:01:28

一般来说,垃圾收集是空间/时间的权衡。给GC更多的空间,就会花费更少的时间。还有(许多)其他因素在起作用,特别是缓存,但空间/时间权衡是最重要的一个。

这种权衡是这样的:程序分配内存直到达到某个限制(由 GC 的自动调整参数决定,或通过 RTS 选项显式决定)。当达到限制时,GC 会跟踪程序当前正在使用的所有数据,并回收不再需要的数据所使用的所有内存。此过程延迟的时间越长,同时无法访问(“死亡”)的数据就越多,因此 GC 会避免跟踪该数据。延迟 GC 的唯一方法是提供更多内存可供分配;因此更多的内存等于更少的 GC,等于更低的 GC 开销。粗略地说,GHC 的 -H 选项允许您设置 GC 使用的内存量的下限,因此可以降低 GC 开销。

GHC 使用分代 GC,这是对基本方案的优化,其中堆被分为两代或更多代。对象被分配到“年轻”一代,而存活足够长的对象被提升到“老”一代(在第二代设置中)。年轻代的收集比老一代更频繁,其想法是“大多数对象在年轻时死亡”,因此年轻代收集很便宜(它们不跟踪太多数据),但它们回收大量内存。粗略地说,-A 选项设置年轻代的大小 - 即年轻代被收集之前将分配的内存量。

-A 的默认值是 512k:最好让年轻代小于 L2 缓存,如果超过 L2 缓存大小,性能通常会下降。但相反的方向是 GC 空间/时间权衡:使用非常大的年轻代大小可能会通过减少 GC 必须完成的工作量来抵消缓存带来的好处。这种情况并不总是发生,它取决于应用程序的动态,这使得 GC 很难自动调整自身。 -H 选项还会增加年轻代的大小,因此也会对缓存使用产生不利影响。

底线是:尝试各种选项,看看什么有效。如果您有足够的空闲内存,您很可能可以通过使用 -A 或 -H 来提高性能,但不一定。

Generally speaking, garbage collection is a space/time tradeoff. Give the GC more space, and it will take less time. There are (many) other factors in play, cache in particular, but the space/time tradeoff is the most important one.

The tradeoff works like this: the program allocates memory until it reaches some limit (decided by the GC's automatic tuning parameters, or explicitly via RTS options). When the limit is reached, the GC traces all the data that the program is currently using, and reclaims all the memory used by data that is no longer required. The longer you can delay this process, the more data will have become unreachable ("dead") in the meantime, so the GC avoids tracing that data. The only way to delay a GC is to make more memory available to allocate into; hence more memory equals fewer GCs, equals lower GC overhead. Roughly speaking, GHC's -H option lets you set a lower bound on the amount of memory used by the GC, so can lower the GC overhead.

GHC uses a generational GC, which is an optimisation to the basic scheme, in which the heap is divided into two or more generations. Objects are allocated into the "young" generation, and the ones that live long enough get promoted into the "old" generation (in a 2-generation setup). The young generation is collected much more frequently than the old generation, the idea being that "most objects die young", so young-generation collections are cheap (they don't trace much data), but they reclaim a lot of memory. Roughly speaking, the -A option sets the size of the young generation - that is, the amount of memory that will be allocated before the young generation is collected.

The default for -A is 512k: it's a good idea to keep the young generation smaller than the L2 cache, and performance generally drops if you exceed the L2 cache size. But working in the opposite direction is the GC space/time tradeoff: using a very large young generation size might outweigh the cache benefits by reducing the amount of work the GC has to do. This doesn't always happen, it depends on the dynamics of the application, which makes it hard for the GC to automatically tune itself. The -H option also increases the young generation size, so can also have an adverse effect on cache usage.

The bottom line is: play around with the options and see what works. If you have plenty of memory to spare, you might well be able to get a performance increase by using either -A or -H, but not necessarily.

嗫嚅 2024-09-14 18:01:28

对于较小的数据集,可能可以重现该问题,这样可以更容易地了解发生了什么。特别是,我建议熟悉一下分析:

然后,检查您看到的内存配置文件是否符合您的期望(您不需要了解所有分析选项来获得有用的图表)。严格 foldl' 与非严格元组的组合作为累加器将是我首先要考虑的事情:如果元组的组件不是强制的,则递归正在构建未评估的表达式。

顺便说一句,您可以通过尝试针对非常小的数据集手动评估代码来建立对此类事物的良好直觉。几次迭代就足以查看表达式是否以适合您的应用程序的方式进行计算或保持不计算。

It is probably possible to reproduce the issue for smaller data sets, where it will be easier to see what is going on. In particular, I suggest to gain some familiarity with profiling:

Then, check whether the memory profiles you see match your expectations (you don't need to know about all the profiling options to get useful graphs). The combination of strict foldl' with a non-strict tuple as the accumulator would be the first thing I'd look at: if the components of the tuple are not forced, that recursion is building up unevaluated expressions.

Btw, you can build up a good intuition about such things by trying to evaluate your code by hand for really tiny data sets. A few iterations will suffice to see whether expressions get evaluated or remain unevaluated in the way that fits your application.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文