R 中的缓存/记忆/散列选项
我试图找到一种简单的方法来使用 R 中的 Perl 哈希函数(本质上是缓存),因为我打算同时进行 Perl 风格的哈希并编写自己的计算记忆。然而,其他人已经抢先一步并提供了用于记忆的软件包。我挖掘得越多,发现的就越多,例如memoise
和R.cache
,但差异并不明显。此外,除了使用 hash
包之外,还不清楚如何获得 Perl 风格的散列(或 Python 风格的字典)并编写自己的记忆,这似乎并不支持两个记忆包。
由于我无法在 CRAN 或其他地方找到任何信息来区分选项,也许这应该是关于 SO 的社区 wiki 问题:What are the options for memoization and caching in R,以及它们的区别是什么?
作为比较的基础,这里是我找到的选项列表。另外,在我看来,所有这些都取决于哈希,所以我也会注意到哈希选项。键/值存储有些相关,但会引发大量有关数据库系统的蠕虫(例如 BerkeleyDB、Redis、MemcacheDB 和 其他人的分数)。
看起来选项是:
Hashing
- digest - 提供哈希任意 R 对象。
Memoization
缓存
- hash - 提供类似于 Perl 的哈希和 Python 字典的缓存功能。
键/值存储
这些是 R 对象外部存储的基本选项。
检查点
- 缓存器 - 这似乎更类似于检查点。
- CodeDepends - 一个 OmegaHat 项目,支持
cacher
并提供一些有用的功能。 - DMTCP(不是 R 包)- 似乎支持多种语言的检查点,并且 开发人员最近寻求帮助测试 DMTCP R 中的检查点。
其他
- Base R 支持:命名向量和列表、数据框的行和列名称以及环境中的项目名称。在我看来,使用列表有点混乱。 (还有
pairlist
,但是它已被弃用。) - data.table 包支持快速查找数据表中的元素。
用例
虽然我最感兴趣的是了解这些选项,但我遇到了两个基本用例:
- 缓存:简单的字符串计数。 [注意:这不是为了 NLP,而是为了一般用途,所以 NLP 库是多余的;表是不够的,因为我不喜欢等到整个字符串集加载到内存中。 Perl 风格的散列处于正确的实用级别。]
- 巨大计算的记忆。
这些确实出现是因为我深入研究一些 slooooow 代码的分析并且我我真的很想计算简单的字符串,看看我是否可以通过记忆来加快一些计算速度。即使我不记忆,能够对输入值进行哈希处理,也可以让我看看记忆是否有帮助。
注 1:可重复研究的 CRAN 任务视图列出了一些包(cacher
和 R.cache
),但没有详细说明使用选项。
注 2:为了帮助其他人寻找相关代码,这里有一些关于某些作者或软件包的注释。一些作者使用 SO。 :)
- Dirk Eddelbuettel:
digest
- 很多其他包都依赖于此。 - Roger Peng:
cacher
、filehash
、stashR
- 这些以不同的方式解决不同的问题;有关更多软件包,请参阅 Roger 的网站。 - Christopher Brown:
hash
- 似乎是一个有用的包,但不幸的是,到 ODG 的链接已关闭。 - Henrik Bengtsson:
R.cache
& Hadley Wickham:memoise
——目前尚不清楚何时更喜欢其中一个包。
注 3:有些人使用 memoise/memoization,其他人则使用 memoize/memoization。如果您正在四处寻找,请注意。 Henrik 使用“z”,Hadley 使用“s”。
I am trying to find a simple way to use something like Perl's hash functions in R (essentially caching), as I intended to do both Perl-style hashing and write my own memoisation of calculations. However, others have beaten me to the punch and have packages for memoisation. The more I dig, the more I find, e.g.memoise
and R.cache
, but differences aren't readily clear. In addition, it's not clear how else one can get Perl-style hashes (or Python-style dictionaries) and write one's own memoization, other than to use the hash
package, which doesn't seem to underpin the two memoization packages.
Since I can find no information on CRAN or elsewhere to distinguish between the options, perhaps this should be a community wiki question on SO: What are the options for memoization and caching in R, and what are their differences?
As a basis for comparison, here is a list of the options I've found. Also, it seems to me that all depend on hashing, so I'll note the hashing options as well. Key/value storage is somewhat related, but opens a huge can of worms regarding DB systems (e.g. BerkeleyDB, Redis, MemcacheDB and scores of others).
It looks like the options are:
Hashing
- digest - provides hashing for arbitrary R objects.
Memoization
- memoise - a very simple tool for memoization of functions.
- R.cache - offers more functionality for memoization, though it seems some of the functions lack examples.
Caching
- hash - Provides caching functionality akin to Perl's hashes and Python dictionaries.
Key/value storage
These are basic options for external storage of R objects.
Checkpointing
- cacher - this seems to be more akin to checkpointing.
- CodeDepends - An OmegaHat project that underpins
cacher
and provides some useful functionality. - DMTCP (not an R package) - appears to support checkpointing in a bunch of languages, and a developer recently sought assistance testing DMTCP checkpointing in R.
Other
- Base R supports: named vectors and lists, row and column names of data frames, and names of items in environments. It seems to me that using a list is a bit of a kludge. (There's also
pairlist
, but it is deprecated.) - The data.table package supports rapid lookups of elements in a data table.
Use case
Although I'm mostly interested in knowing the options, I have two basic use cases that arise:
- Caching: Simple counting of strings. [Note: This isn't for NLP, but general use, so NLP libraries are overkill; tables are inadequate because I prefer not to wait until the entire set of strings are loaded into memory. Perl-style hashes are at the right level of utility.]
- Memoization of monstrous calculations.
These really arise because I'm digging in to the profiling of some slooooow code and I'd really like to just count simple strings and see if I can speed up some calculations via memoization. Being able to hash the input values, even if I don't memoize, would let me see if memoization can help.
Note 1: The CRAN Task View on Reproducible Research lists a couple of the packages (cacher
and R.cache
), but there is no elaboration on usage options.
Note 2: To aid others looking for related code, here a few notes on some of the authors or packages. Some of the authors use SO. :)
- Dirk Eddelbuettel:
digest
- a lot of other packages depend on this. - Roger Peng:
cacher
,filehash
,stashR
- these address different problems in different ways; see Roger's site for more packages. - Christopher Brown:
hash
- Seems to be a useful package, but the links to ODG are down, unfortunately. - Henrik Bengtsson:
R.cache
& Hadley Wickham:memoise
-- it's not yet clear when to prefer one package over the other.
Note 3: Some people use memoise/memoisation others use memoize/memoization. Just a note if you're searching around. Henrik uses "z" and Hadley uses "s".
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我在
memoise
方面运气不佳,因为它给我尝试过的包的某些函数带来了“太深的递归”问题。有了R.cache
,我的运气更好了。以下是我从 R.cache 文档改编而来的更多带注释的代码。该代码显示了进行缓存的不同选项:I did not have luck with
memoise
because it gave a 'too deep recursive' problem to some functions of a package I tried it with. WithR.cache
I had better luck. Following is more annotated code I adapted from theR.cache
documentation. The code shows different options for doing caching:对于简单的字符串计数(不使用
table
或类似的),多重集 数据结构似乎很合适。environment
对象可用于模拟这一点。For simple counting of strings (and not using
table
or similar), a multiset data structure seems like a good fit. Theenvironment
object can be used to emulate this.与 @biocyperman 解决方案相关。 R.cache 有一个包装函数,用于避免缓存的加载、保存和评估。请参阅修改后的功能:
R.cache 提供了用于加载、评估、保存的包装器。您可以这样简化您的代码:
Related to @biocyperman solution. R.cache has a wrapping function for avoiding the loading, saving and evaluation of the cache. See the modified function:
R.cache provide a wrapper for loading, evaluating, saving. You can simplify your code like that: