是否有一个基于对象身份的、线程安全的记忆库？

发布于 2024-11-26 22:35:09 字数 3145 浏览 6 评论 0原文

我知道记忆似乎是堆栈溢出的 haskell 标签上的一个长期话题，但我认为以前没有人问过这个问题。

我知道 Haskell 有几个不同的“现成”记忆库：

memo-combinators 和 memotrie 包，它们利用涉及惰性无限数据结构的漂亮技巧以纯函数方式实现记忆化。（据我了解，前者稍微灵活一些，而后者在简单情况下更容易使用：请参阅这个SO答案供讨论。）
uglymemo包，它在内部使用unsafePerformIO，但仍然提供一个引用透明的接口。在内部使用 unsafePerformIO 会带来更好的性能< /a> 比前两个包。（现成的，它的实现使用基于比较的搜索数据结构，而不是可能稍微高效的哈希函数；但我认为，如果您找到并替换 Cmp 为 Hashable< /code> 和 Data.Map for Data.HashMap 并添加适当的 import，您将获得基于哈希的版本。）

但是，我我不知道有哪个图书馆会根据基于对象标识而不是对象值。这可能很重要，因为有时用作备忘录表键（即，作为正在记忆的函数的输入）的对象类型可能很大——大到需要全面检查该对象以确定您是否以前见过它本身就是一个缓慢的操作。如果您将一次又一次地将记忆函数应用到存储在给定“内存中的位置”的对象，那么速度很慢，而且也是不必要的 1。（例如，如果我们正在记忆一个在具有大量结构共享的大型数据结构上递归调用的函数，则可能会发生这种情况。）如果我们之前已经在该确切对象上计算了记忆函数，我们可以即使不看物体本身就已经知道答案了！

实现这样的记忆库涉及几个微妙的问题，并且正确地实现它需要语言的几个特殊支持。幸运的是，GHC 提供了我们需要的所有特殊功能，并且有一个 Peyton-Jones、Marlow 和 Elliott 的论文基本上为您担心了大部分问题，解释了如何构建可靠的实现。他们没有提供所有细节，但已经很接近了。

我可以看到哪个人可能应该担心但他们不担心的一个细节是线程安全——他们的代码显然根本不是线程安全的。

我的问题是：有谁知道有一个打包库可以执行 Peyton-Jones、Marlow 和 Elliott 论文中讨论的那种记忆功能，填充所有细节（最好也填充适当的线程安全性）？

如果做不到这一点，我想我将不得不自己编写代码：是否有人对此类库的实现者会做的其他微妙之处（超出线程安全性和论文中讨论的内容）有任何想法好记住吗？

更新

按照下面@luqui的建议，这里有一些关于我所面临的确切问题的更多数据。假设有一个类型：

data Node = Node [Node] [Annotation]

这个类型可以用来表示内存中一种简单的有根DAG，其中Node是DAG节点，根只是一个区分的Node ，并且每个节点都用一些 Annotation 进行注释，我认为其内部结构不需要我们关心（但如果重要，请询问，我会更具体。）如果以这种方式使用，请注意，很可能存在显着的内存中节点之间的结构共享——从根到节点的路径可能比节点本身的数量多得多。我从必须与之交互的外部库中获得了这种形式的数据结构；我无法更改数据类型。

我有一个函数，

myTransform : Node -> Node

其细节不需要我们关心（或者至少我这么认为；但如果需要的话我可以更具体）。它将节点映射到节点，检查给定节点的注释及其直接子节点的注释，以得出具有相同子节点但可能不同注释的新 Node。我希望编写一个函数，

recursiveTransform : Node -> Node

其输出与您将得到的数据结构“看起来相同”：

recursiveTransform Node originalChildren annotations = 
   myTransform Node recursivelyTransformedChildren annotations
   where
     recursivelyTransformedChildren = map recursiveTransform originalChildren

除了它以明显的方式使用结构共享，这样它就不会返回指数数据结构，而是返回一个与其输入大小相同的订单。

我很高兴，如果在获得节点之前对它们进行编号，或者我可以更改节点的定义，那么这一切都会更容易。我不能（轻易）做这两件事。

我也对是否存在一个实现我提到的功能的库的一般问题感兴趣，该库与我现在面临的特定具体问题完全无关：我觉得我必须在某些情况下解决此类问题，如果能一劳永逸地杀死巨龙就好了。 SPJ 等人认为值得向 GHC 添加三个功能来支持这种形式的库的存在，这一事实表明该功能确实有用，并且无法在所有案例。（但我仍然对听到在这种特殊情况下也有帮助的解决方法非常感兴趣：长期问题并不像我现在面临的问题那么紧迫:-)）

1 从技术上讲，我并不是指位置内存，因为垃圾收集器有时会稍微移动对象——我真正的意思是“对象标识”。但我们可以认为这与我们对内存中位置的直观想法大致相同。

原文

I know that memoization seems to be a perennial topic here on the haskell tag on stack overflow, but I think this question has not been asked before.

I'm aware of several different 'off the shelf' memoization libraries for Haskell:

The memo-combinators and memotrie packages, which make use of a beautiful trick involving lazy infinite data structures to achieve memoization in a purely functional way. (As I understand it, the former is slightly more flexible, while the latter is easier to use in simple cases: see this SO answer for discussion.)
The uglymemo package, which uses unsafePerformIO internally but still presents a referentially transparent interface. The use of unsafePerformIO internally results in better performance than the previous two packages. (Off the shelf, its implementation uses comparison-based search data structures, rather than perhaps-slightly-more-efficient hash functions; but I think that if you find and replace Cmp for Hashable and Data.Map for Data.HashMap and add the appropraite imports, you get a hash based version.)

However, I'm not aware of any library that looks answers up based on object identity rather than object value. This can be important, because sometimes the kinds of object which are being used as keys to your memo table (that is, as input to the function being memoized) can be large---so large that fully examining the object to determine whether you've seen it before is itself a slow operation. Slow, and also unnecessary, if you will be applying the memoized function again and again to an object which is stored at a given 'location in memory' 1. (This might happen, for example, if we're memoizing a function which is being called recursively over some large data structure with a lot of structural sharing.) If we've already computed our memoized function on that exact object before, we can already know the answer, even without looking at the object itself!

Implementing such a memoization library involves several subtle issues and doing it properly requires several special pieces of support from the language. Luckily, GHC provides all the special features that we need, and there is a paper by Peyton-Jones, Marlow and Elliott which basically worries about most of these issues for you, explaining how to build a solid implementation. They don't provide all details, but they get close.

The one detail which I can see which one probably ought to worry about, but which they don't worry about, is thread safety---their code is apparently not threadsafe at all.

My question is: does anyone know of a packaged library which does the kind of memoization discussed in the Peyton-Jones, Marlow and Elliott paper, filling in all the details (and preferably filling in proper thread-safety as well)?

Failing that, I guess I will have to code it up myself: does anyone have any ideas of other subtleties (beyond thread safety and the ones discussed in the paper) which the implementer of such a library would do well to bear in mind?

UPDATE

Following @luqui's suggestion below, here's a little more data on the exact problem I face. Let's suppose there's a type:

data Node = Node [Node] [Annotation]

This type can be used to represent a simple kind of rooted DAG in memory, where Nodes are DAG Nodes, the root is just a distinguished Node, and each node is annotated with some Annotations whose internal structure, I think, need not concern us (but if it matters, just ask and I'll be more specific.) If used in this way, note that there may well be significant structural sharing between Nodes in memory---there may be exponentially more paths which lead from the root to a node than there are nodes themselves. I am given a data structure of this form, from an external library with which I must interface; I cannot change the data type.

I have a function

myTransform : Node -> Node

the details of which need not concern us (or at least I think so; but again I can be more specific if needed). It maps nodes to nodes, examining the annotations of the node it is given, and the annotations its immediate children, to come up with a new Node with the same children but possibly different annotations. I wish to write a function

recursiveTransform : Node -> Node

whose output 'looks the same' as the data structure as you would get by doing:

recursiveTransform Node originalChildren annotations = 
   myTransform Node recursivelyTransformedChildren annotations
   where
     recursivelyTransformedChildren = map recursiveTransform originalChildren

except that it uses structural sharing in the obvious way so that it doesn't return an exponential data structure, but rather one on the order of the same size as its input.

I appreciate that this would all be easier if say, the Nodes were numbered before I got them, or I could otherwise change the definition of a Node. I can't (easily) do either of these things.

I am also interested in the general question of the existence of a library implementing the functionality I mention quite independently of the particular concrete problem I face right now: I feel like I've had to work around this kind of issue on a few occasions, and it would be nice to slay the dragon once and for all. The fact that SPJ et al felt that it was worth adding not one but three features to GHC to support the existence of libraries of this form suggests that the feature is genuinely useful and can't be worked around in all cases. (BUT I'd still also be very interested in hearing about workarounds which will help in this particular case too: the long term problem is not as urgent as the problem I face right now :-) )

1 Technically, I don't quite mean location in memory, since the garbage collector sometimes moves objects around a bit---what I really mean is 'object identity'. But we can think of this as being roughly the same as our intuitive idea of location in memory.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

笔芯 2024-12-03 22:35:09

如果您只想基于对象标识而不是相等性进行记忆，则可以使用语言中内置的现有惰性机制。

例如，如果您有这样的数据结构

data Foo = Foo { ... }
expensive :: Foo -> Bar

，那么您只需添加要记忆的值作为额外字段，然后让惰性为您处理其余的事情。

data Foo = Foo { ..., memo :: Bar }

为了让它更容易使用，添加一个智能构造函数来喜结良缘。

makeFoo ... = let foo = Foo { ..., memo = expensive foo } in foo

尽管这比使用库稍微不太优雅，并且需要修改数据类型才能真正有用，但这是一种非常简单的技术，并且所有线程安全问题都已为您解决。

If you only want to memoize based on object identity, and not equality, you can just use the existing laziness mechanisms built into the language.

For example, if you have a data structure like this

data Foo = Foo { ... }
expensive :: Foo -> Bar

then you can just add the value to be memoized as an extra field and let the laziness take care of the rest for you.

data Foo = Foo { ..., memo :: Bar }

To make it easier to use, add a smart constructor to tie the knot.

makeFoo ... = let foo = Foo { ..., memo = expensive foo } in foo

Though this is somewhat less elegant than using a library, and requires modification of the data type to really be useful, it's a very simple technique and all thread-safety issues are already taken care of for you.

回复收藏 0 原文

风吹雪碎 2024-12-03 22:35:09

似乎 stable-memo 正是您所需要的（尽管我不确定如果它可以处理多个线程）：

大多数备忘录组合器基于相等性进行记忆，而 stable-memo 则根据之前是否已将完全相同的参数传递给函数（即内存中的相同参数）来进行记忆。
stable-memo 仅评估 WHNF 的密钥。
这更适合具有循环的图上的递归函数。
stable-memo 不会保留迄今为止所看到的键，这使得它们在不再使用时可以被垃圾回收。如果发生这种情况，终结器会从备忘录表中删除相应的条目。
Data.StableMemo.Weak 提供了一组替代组合器，它们也避免保留函数的结果，仅在结果尚未被垃圾收集时重用结果。
函数的参数没有类型类约束。
stable-memo 不适用于碰巧具有相同值但不是同一堆对象的参数。这排除了许多记忆化的候选者，例如最常见的例子，其域是机器整数的朴素斐波那契实现；不过，它仍然可以适用于某些领域，例如懒惰的自然人。

回复收藏 0 原文

作妖 2024-12-03 22:35:09

Ekmett 刚刚上传了一个处理此问题及更多内容的库（由 HacPhi 生成）： http://hackage.haskell.org /包/实习生。他向我保证它是线程安全的。

编辑：实际上，严格来说，我意识到这做了一些相当不同的事情。但我认为你可以将它用于你的目的。它实际上更像是一个 stringtable-atom 类型的实习库，可以处理任意数据结构（包括递归数据结构）。它在内部使用 WeakPtrs 来维护表。但是，它使用 Int 来索引值以避免结构相等性检查，这意味着将它们打包到数据类型中，而您显然想要的实际上是 StableName 。所以我意识到这回答了一个相关的问题，但需要修改您的数据类型，这是您想要避免的......

回复收藏 0 原文

~没有更多了~