是否有一个基于对象身份的、线程安全的记忆库?
我知道记忆似乎是堆栈溢出的 haskell 标签上的一个长期话题,但我认为以前没有人问过这个问题。
我知道 Haskell 有几个不同的“现成”记忆库:
- memo-combinators 和 memotrie 包,它们利用涉及惰性无限数据结构的漂亮技巧以纯函数方式实现记忆化。 (据我了解,前者稍微灵活一些,而后者在简单情况下更容易使用:请参阅 这个SO答案供讨论。)
- uglymemo包,它在内部使用unsafePerformIO,但仍然提供一个引用透明的接口。在内部使用 unsafePerformIO 会带来更好的性能< /a> 比前两个包。 (现成的,它的实现使用基于比较的搜索数据结构,而不是可能稍微高效的哈希函数;但我认为,如果您找到并替换
Cmp
为Hashable< /code> 和
Data.Map
forData.HashMap
并添加适当的import
,您将获得基于哈希的版本。)
但是,我我不知道有哪个图书馆会根据基于对象标识而不是对象值。这可能很重要,因为有时用作备忘录表键(即,作为正在记忆的函数的输入)的对象类型可能很大——大到需要全面检查该对象以确定您是否以前见过它本身就是一个缓慢的操作。如果您将一次又一次地将记忆函数应用到存储在给定“内存中的位置”的对象,那么速度很慢,而且也是不必要的 1。 (例如,如果我们正在记忆一个在具有大量结构共享的大型数据结构上递归调用的函数,则可能会发生这种情况。)如果我们之前已经在该确切对象上计算了记忆函数,我们可以即使不看物体本身就已经知道答案了!
实现这样的记忆库涉及几个微妙的问题,并且正确地实现它需要语言的几个特殊支持。幸运的是,GHC 提供了我们需要的所有特殊功能,并且有一个 Peyton-Jones、Marlow 和 Elliott 的论文基本上为您担心了大部分问题,解释了如何构建可靠的实现。他们没有提供所有细节,但已经很接近了。
我可以看到哪个人可能应该担心但他们不担心的一个细节是线程安全——他们的代码显然根本不是线程安全的。
我的问题是:有谁知道有一个打包库可以执行 Peyton-Jones、Marlow 和 Elliott 论文中讨论的那种记忆功能,填充所有细节(最好也填充适当的线程安全性)?
如果做不到这一点,我想我将不得不自己编写代码:是否有人对此类库的实现者会做的其他微妙之处(超出线程安全性和论文中讨论的内容)有任何想法好记住吗?
更新
按照下面@luqui的建议,这里有一些关于我所面临的确切问题的更多数据。假设有一个类型:
data Node = Node [Node] [Annotation]
这个类型可以用来表示内存中一种简单的有根DAG,其中Node
是DAG节点,根只是一个区分的Node
,并且每个节点都用一些 Annotation
进行注释,我认为其内部结构不需要我们关心(但如果重要,请询问,我会更具体。)如果以这种方式使用,请注意,很可能存在显着的内存中节点
之间的结构共享——从根到节点的路径可能比节点本身的数量多得多。我从必须与之交互的外部库中获得了这种形式的数据结构;我无法更改数据类型。
我有一个函数,
myTransform : Node -> Node
其细节不需要我们关心(或者至少我这么认为;但如果需要的话我可以更具体)。它将节点映射到节点,检查给定节点的注释及其直接子节点的注释,以得出具有相同子节点但可能不同注释的新 Node
。我希望编写一个函数,
recursiveTransform : Node -> Node
其输出与您将得到的数据结构“看起来相同”:
recursiveTransform Node originalChildren annotations =
myTransform Node recursivelyTransformedChildren annotations
where
recursivelyTransformedChildren = map recursiveTransform originalChildren
除了它以明显的方式使用结构共享,这样它就不会返回指数数据结构,而是返回一个与其输入大小相同的订单。
我很高兴,如果在获得节点之前对它们进行编号,或者我可以更改节点的定义,那么这一切都会更容易。我不能(轻易)做这两件事。
我也对是否存在一个实现我提到的功能的库的一般问题感兴趣,该库与我现在面临的特定具体问题完全无关:我觉得我必须在某些情况下解决此类问题,如果能一劳永逸地杀死巨龙就好了。 SPJ 等人认为值得向 GHC 添加三个功能来支持这种形式的库的存在,这一事实表明该功能确实有用,并且无法在所有 案例。 (但我仍然对听到在这种特殊情况下也有帮助的解决方法非常感兴趣:长期问题并不像我现在面临的问题那么紧迫:-))
1 从技术上讲,我并不是指位置内存,因为垃圾收集器有时会稍微移动对象——我真正的意思是“对象标识”。但我们可以认为这与我们对内存中位置的直观想法大致相同。
I know that memoization seems to be a perennial topic here on the haskell tag on stack overflow, but I think this question has not been asked before.
I'm aware of several different 'off the shelf' memoization libraries for Haskell:
- The memo-combinators and memotrie packages, which make use of a beautiful trick involving lazy infinite data structures to achieve memoization in a purely functional way. (As I understand it, the former is slightly more flexible, while the latter is easier to use in simple cases: see this SO answer for discussion.)
- The uglymemo package, which uses unsafePerformIO internally but still presents a referentially transparent interface. The use of unsafePerformIO internally results in better performance than the previous two packages. (Off the shelf, its implementation uses comparison-based search data structures, rather than perhaps-slightly-more-efficient hash functions; but I think that if you find and replace
Cmp
forHashable
andData.Map
forData.HashMap
and add the appropraiteimport
s, you get a hash based version.)
However, I'm not aware of any library that looks answers up based on object identity rather than object value. This can be important, because sometimes the kinds of object which are being used as keys to your memo table (that is, as input to the function being memoized) can be large---so large that fully examining the object to determine whether you've seen it before is itself a slow operation. Slow, and also unnecessary, if you will be applying the memoized function again and again to an object which is stored at a given 'location in memory' 1. (This might happen, for example, if we're memoizing a function which is being called recursively over some large data structure with a lot of structural sharing.) If we've already computed our memoized function on that exact object before, we can already know the answer, even without looking at the object itself!
Implementing such a memoization library involves several subtle issues and doing it properly requires several special pieces of support from the language. Luckily, GHC provides all the special features that we need, and there is a paper by Peyton-Jones, Marlow and Elliott which basically worries about most of these issues for you, explaining how to build a solid implementation. They don't provide all details, but they get close.
The one detail which I can see which one probably ought to worry about, but which they don't worry about, is thread safety---their code is apparently not threadsafe at all.
My question is: does anyone know of a packaged library which does the kind of memoization discussed in the Peyton-Jones, Marlow and Elliott paper, filling in all the details (and preferably filling in proper thread-safety as well)?
Failing that, I guess I will have to code it up myself: does anyone have any ideas of other subtleties (beyond thread safety and the ones discussed in the paper) which the implementer of such a library would do well to bear in mind?
UPDATE
Following @luqui's suggestion below, here's a little more data on the exact problem I face. Let's suppose there's a type:
data Node = Node [Node] [Annotation]
This type can be used to represent a simple kind of rooted DAG in memory, where Node
s are DAG Nodes, the root is just a distinguished Node
, and each node is annotated with some Annotation
s whose internal structure, I think, need not concern us (but if it matters, just ask and I'll be more specific.) If used in this way, note that there may well be significant structural sharing between Node
s in memory---there may be exponentially more paths which lead from the root to a node than there are nodes themselves. I am given a data structure of this form, from an external library with which I must interface; I cannot change the data type.
I have a function
myTransform : Node -> Node
the details of which need not concern us (or at least I think so; but again I can be more specific if needed). It maps nodes to nodes, examining the annotations of the node it is given, and the annotations its immediate children, to come up with a new Node
with the same children but possibly different annotations. I wish to write a function
recursiveTransform : Node -> Node
whose output 'looks the same' as the data structure as you would get by doing:
recursiveTransform Node originalChildren annotations =
myTransform Node recursivelyTransformedChildren annotations
where
recursivelyTransformedChildren = map recursiveTransform originalChildren
except that it uses structural sharing in the obvious way so that it doesn't return an exponential data structure, but rather one on the order of the same size as its input.
I appreciate that this would all be easier if say, the Nodes
were numbered before I got them, or I could otherwise change the definition of a Node
. I can't (easily) do either of these things.
I am also interested in the general question of the existence of a library implementing the functionality I mention quite independently of the particular concrete problem I face right now: I feel like I've had to work around this kind of issue on a few occasions, and it would be nice to slay the dragon once and for all. The fact that SPJ et al felt that it was worth adding not one but three features to GHC to support the existence of libraries of this form suggests that the feature is genuinely useful and can't be worked around in all cases. (BUT I'd still also be very interested in hearing about workarounds which will help in this particular case too: the long term problem is not as urgent as the problem I face right now :-) )
1 Technically, I don't quite mean location in memory, since the garbage collector sometimes moves objects around a bit---what I really mean is 'object identity'. But we can think of this as being roughly the same as our intuitive idea of location in memory.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如果您只想基于对象标识而不是相等性进行记忆,则可以使用语言中内置的现有惰性机制。
例如,如果您有这样的数据结构
,那么您只需添加要记忆的值作为额外字段,然后让惰性为您处理其余的事情。
为了让它更容易使用,添加一个智能构造函数来喜结良缘。
尽管这比使用库稍微不太优雅,并且需要修改数据类型才能真正有用,但这是一种非常简单的技术,并且所有线程安全问题都已为您解决。
If you only want to memoize based on object identity, and not equality, you can just use the existing laziness mechanisms built into the language.
For example, if you have a data structure like this
then you can just add the value to be memoized as an extra field and let the laziness take care of the rest for you.
To make it easier to use, add a smart constructor to tie the knot.
Though this is somewhat less elegant than using a library, and requires modification of the data type to really be useful, it's a very simple technique and all thread-safety issues are already taken care of for you.
似乎 stable-memo 正是您所需要的(尽管我不确定如果它可以处理多个线程):
It seems that stable-memo would be just what you needed (although I'm not sure if it can handle multiple threads):
Ekmett 刚刚上传了一个处理此问题及更多内容的库(由 HacPhi 生成): http://hackage.haskell.org /包/实习生。他向我保证它是线程安全的。
编辑:实际上,严格来说,我意识到这做了一些相当不同的事情。但我认为你可以将它用于你的目的。它实际上更像是一个 stringtable-atom 类型的实习库,可以处理任意数据结构(包括递归数据结构)。它在内部使用 WeakPtrs 来维护表。但是,它使用
Int
来索引值以避免结构相等性检查,这意味着将它们打包到数据类型中,而您显然想要的实际上是StableName
。所以我意识到这回答了一个相关的问题,但需要修改您的数据类型,这是您想要避免的......Ekmett just uploaded a library that handles this and more (produced at HacPhi): http://hackage.haskell.org/package/intern. He assures me that it is thread safe.
Edit: Actually, strictly speaking I realize this does something rather different. But I think you can use it for your purposes. It's really more of a stringtable-atom type interning library that works over arbitrary data structures (including recursive ones). It uses WeakPtrs internally to maintain the table. However, it uses
Int
s to index the values to avoid structural equality checks, which means packing them into the data type, when what you want are apparently actuallyStableName
s. So I realize this answers a related question, but requires modifying your data type, which you want to avoid...