R 中的缓存/记忆/散列选项

发布于 2024-12-02 11:57:01 字数 3636 浏览 8 评论 0原文

我试图找到一种简单的方法来使用 R 中的 Perl 哈希函数（本质上是缓存），因为我打算同时进行 Perl 风格的哈希并编写自己的计算记忆。然而，其他人已经抢先一步并提供了用于记忆的软件包。我挖掘得越多，发现的就越多，例如memoise 和R.cache，但差异并不明显。此外，除了使用 hash 包之外，还不清楚如何获得 Perl 风格的散列（或 Python 风格的字典）并编写自己的记忆，这似乎并不支持两个记忆包。

由于我无法在 CRAN 或其他地方找到任何信息来区分选项，也许这应该是关于 SO 的社区 wiki 问题：What are the options for memoization and caching in R，以及它们的区别是什么？

作为比较的基础，这里是我找到的选项列表。另外，在我看来，所有这些都取决于哈希，所以我也会注意到哈希选项。键/值存储有些相关，但会引发大量有关数据库系统的蠕虫（例如 BerkeleyDB、Redis、MemcacheDB 和其他人的分数）。

看起来选项是：

Hashing

digest - 提供哈希任意 R 对象。

Memoization

memoise - 一个非常简单的函数记忆工具。
R.cache - 不过提供了更多的记忆功能似乎有些功能缺少示例。

缓存

hash - 提供类似于 Perl 的哈希和 Python 字典的缓存功能。

键/值存储

这些是 R 对象外部存储的基本选项。

检查点

缓存器 - 这似乎更类似于检查点。
CodeDepends - 一个 OmegaHat 项目，支持 cacher 并提供一些有用的功能。
DMTCP（不是 R 包）- 似乎支持多种语言的检查点，并且开发人员最近寻求帮助测试 DMTCP R 中的检查点。

其他

Base R 支持：命名向量和列表、数据框的行和列名称以及环境中的项目名称。在我看来，使用列表有点混乱。（还有pairlist，但是它已被弃用。）
data.table 包支持快速查找数据表中的元素。

用例

虽然我最感兴趣的是了解这些选项，但我遇到了两个基本用例：

缓存：简单的字符串计数。 [注意：这不是为了 NLP，而是为了一般用途，所以 NLP 库是多余的；表是不够的，因为我不喜欢等到整个字符串集加载到内存中。 Perl 风格的散列处于正确的实用级别。]
巨大计算的记忆。

这些确实出现是因为我深入研究一些 slooooow 代码的分析并且我我真的很想计算简单的字符串，看看我是否可以通过记忆来加快一些计算速度。即使我不记忆，能够对输入值进行哈希处理，也可以让我看看记忆是否有帮助。

注 1：可重复研究的 CRAN 任务视图列出了一些包（cacher 和 R.cache），但没有详细说明使用选项。

注 2：为了帮助其他人寻找相关代码，这里有一些关于某些作者或软件包的注释。一些作者使用 SO。 :)

Dirk Eddelbuettel：digest - 很多其他包都依赖于此。
Roger Peng：cacher、filehash、stashR - 这些以不同的方式解决不同的问题；有关更多软件包，请参阅 Roger 的网站。
Christopher Brown：hash - 似乎是一个有用的包，但不幸的是，到 ODG 的链接已关闭。
Henrik Bengtsson：R.cache & Hadley Wickham：memoise——目前尚不清楚何时更喜欢其中一个包。

注 3：有些人使用 memoise/memoization，其他人则使用 memoize/memoization。如果您正在四处寻找，请注意。 Henrik 使用“z”，Hadley 使用“s”。

原文

I am trying to find a simple way to use something like Perl's hash functions in R (essentially caching), as I intended to do both Perl-style hashing and write my own memoisation of calculations. However, others have beaten me to the punch and have packages for memoisation. The more I dig, the more I find, e.g.memoise and R.cache, but differences aren't readily clear. In addition, it's not clear how else one can get Perl-style hashes (or Python-style dictionaries) and write one's own memoization, other than to use the hash package, which doesn't seem to underpin the two memoization packages.

Since I can find no information on CRAN or elsewhere to distinguish between the options, perhaps this should be a community wiki question on SO: What are the options for memoization and caching in R, and what are their differences?

As a basis for comparison, here is a list of the options I've found. Also, it seems to me that all depend on hashing, so I'll note the hashing options as well. Key/value storage is somewhat related, but opens a huge can of worms regarding DB systems (e.g. BerkeleyDB, Redis, MemcacheDB and scores of others).

It looks like the options are:

Hashing

digest - provides hashing for arbitrary R objects.

Memoization

memoise - a very simple tool for memoization of functions.
R.cache - offers more functionality for memoization, though it seems some of the functions lack examples.

Caching

hash - Provides caching functionality akin to Perl's hashes and Python dictionaries.

Key/value storage

These are basic options for external storage of R objects.

Checkpointing

cacher - this seems to be more akin to checkpointing.
CodeDepends - An OmegaHat project that underpins cacher and provides some useful functionality.
DMTCP (not an R package) - appears to support checkpointing in a bunch of languages, and a developer recently sought assistance testing DMTCP checkpointing in R.

Other

Base R supports: named vectors and lists, row and column names of data frames, and names of items in environments. It seems to me that using a list is a bit of a kludge. (There's also pairlist, but it is deprecated.)
The data.table package supports rapid lookups of elements in a data table.

Use case

Although I'm mostly interested in knowing the options, I have two basic use cases that arise:

Caching: Simple counting of strings. [Note: This isn't for NLP, but general use, so NLP libraries are overkill; tables are inadequate because I prefer not to wait until the entire set of strings are loaded into memory. Perl-style hashes are at the right level of utility.]
Memoization of monstrous calculations.

These really arise because I'm digging in to the profiling of some slooooow code and I'd really like to just count simple strings and see if I can speed up some calculations via memoization. Being able to hash the input values, even if I don't memoize, would let me see if memoization can help.

Note 1: The CRAN Task View on Reproducible Research lists a couple of the packages (cacher and R.cache), but there is no elaboration on usage options.

Note 2: To aid others looking for related code, here a few notes on some of the authors or packages. Some of the authors use SO. :)

Dirk Eddelbuettel: digest - a lot of other packages depend on this.
Roger Peng: cacher, filehash, stashR - these address different problems in different ways; see Roger's site for more packages.
Christopher Brown: hash - Seems to be a useful package, but the links to ODG are down, unfortunately.
Henrik Bengtsson: R.cache & Hadley Wickham: memoise -- it's not yet clear when to prefer one package over the other.

Note 3: Some people use memoise/memoisation others use memoize/memoization. Just a note if you're searching around. Henrik uses "z" and Hadley uses "s".

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

如歌彻婉言 2024-12-09 11:57:01

我在 memoise 方面运气不佳，因为它给我尝试过的包的某些函数带来了“太深的递归”问题。有了R.cache，我的运气更好了。以下是我从 R.cache 文档改编而来的更多带注释的代码。该代码显示了进行缓存的不同选项：

# Workaround to avoid question when loading R.cache library
dir.create(path="~/.Rcache", showWarnings=F) 
library("R.cache")
setCacheRootPath(path="./.Rcache") # Create .Rcache at current working dir
# In case we need the cache path, but not used in this example.
cache.root = getCacheRootPath() 
simulate <- function(mean, sd) {
    # 1. Try to load cached data, if already generated
    key <- list(mean, sd)
    data <- loadCache(key)
    if (!is.null(data)) {
        cat("Loaded cached data\n")
        return(data);
    }
    # 2. If not available, generate it.
    cat("Generating data from scratch...")
    data <- rnorm(1000, mean=mean, sd=sd)
    Sys.sleep(1) # Emulate slow algorithm
    cat("ok\n")
    saveCache(data, key=key, comment="simulate()")
    data;
}
data <- simulate(2.3, 3.0)
data <- simulate(2.3, 3.5)
a = 2.3
b = 3.0
data <- simulate(a, b) # Will load cached data, params are checked by value
# Clean up
file.remove(findCache(key=list(2.3,3.0)))
file.remove(findCache(key=list(2.3,3.5)))

simulate2 <- function(mean, sd) {
    data <- rnorm(1000, mean=mean, sd=sd)
    Sys.sleep(1) # Emulate slow algorithm
    cat("Done generating data from scratch\n")
    data;
}
# Easy step to memoize a function
# aslo possible to resassign function name.
This would work with any functions from external packages. 
mzs <- addMemoization(simulate2)

data <- mzs(2.3, 3.0)
data <- mzs(2.3, 3.5)
data <- mzs(2.3, 3.0) # Will load cached data
# aslo possible to resassign function name.
# but different memoizations of the same 
# function will return the same cache result
# if input params are the same
simulate2 <- addMemoization(simulate2)
data <- simulate2(2.3, 3.0)

# If the expression being evaluated depends on
# "input" objects, then these must be be specified
# explicitly as "key" objects.
for (ii in 1:2) {
    for (kk in 1:3) {
        cat(sprintf("Iteration #%d:\n", kk))
        res <- evalWithMemoization({
            cat("Evaluating expression...")
            a <- kk
            Sys.sleep(1)
            cat("done\n")
            a
        }, key=list(kk=kk))
        # expressions inside 'res' are skipped on the repeated run
        print(res)
        # Sanity checks
        stopifnot(a == kk)
        # Clean up
        rm(a)
    } # for (kk ...)
} # for (ii ...)

I did not have luck with memoise because it gave a 'too deep recursive' problem to some functions of a package I tried it with. With R.cache I had better luck. Following is more annotated code I adapted from the R.cache documentation. The code shows different options for doing caching:

# Workaround to avoid question when loading R.cache library
dir.create(path="~/.Rcache", showWarnings=F) 
library("R.cache")
setCacheRootPath(path="./.Rcache") # Create .Rcache at current working dir
# In case we need the cache path, but not used in this example.
cache.root = getCacheRootPath() 
simulate <- function(mean, sd) {
    # 1. Try to load cached data, if already generated
    key <- list(mean, sd)
    data <- loadCache(key)
    if (!is.null(data)) {
        cat("Loaded cached data\n")
        return(data);
    }
    # 2. If not available, generate it.
    cat("Generating data from scratch...")
    data <- rnorm(1000, mean=mean, sd=sd)
    Sys.sleep(1) # Emulate slow algorithm
    cat("ok\n")
    saveCache(data, key=key, comment="simulate()")
    data;
}
data <- simulate(2.3, 3.0)
data <- simulate(2.3, 3.5)
a = 2.3
b = 3.0
data <- simulate(a, b) # Will load cached data, params are checked by value
# Clean up
file.remove(findCache(key=list(2.3,3.0)))
file.remove(findCache(key=list(2.3,3.5)))

simulate2 <- function(mean, sd) {
    data <- rnorm(1000, mean=mean, sd=sd)
    Sys.sleep(1) # Emulate slow algorithm
    cat("Done generating data from scratch\n")
    data;
}
# Easy step to memoize a function
# aslo possible to resassign function name.
This would work with any functions from external packages. 
mzs <- addMemoization(simulate2)

data <- mzs(2.3, 3.0)
data <- mzs(2.3, 3.5)
data <- mzs(2.3, 3.0) # Will load cached data
# aslo possible to resassign function name.
# but different memoizations of the same 
# function will return the same cache result
# if input params are the same
simulate2 <- addMemoization(simulate2)
data <- simulate2(2.3, 3.0)

# If the expression being evaluated depends on
# "input" objects, then these must be be specified
# explicitly as "key" objects.
for (ii in 1:2) {
    for (kk in 1:3) {
        cat(sprintf("Iteration #%d:\n", kk))
        res <- evalWithMemoization({
            cat("Evaluating expression...")
            a <- kk
            Sys.sleep(1)
            cat("done\n")
            a
        }, key=list(kk=kk))
        # expressions inside 'res' are skipped on the repeated run
        print(res)
        # Sanity checks
        stopifnot(a == kk)
        # Clean up
        rm(a)
    } # for (kk ...)
} # for (ii ...)

回复收藏 0 原文

风筝在阴天搁浅。 2024-12-09 11:57:01

对于简单的字符串计数（不使用 table 或类似的），多重集数据结构似乎很合适。 environment 对象可用于模拟这一点。

# Define the insert function for a multiset
msetInsert <- function(mset, s) {
    if (exists(s, mset, inherits=FALSE)) {
        mset[[s]] <- mset[[s]] + 1L
    } else {
        mset[[s]] <- 1L 
    }
}

# First we generate a bunch of strings
n <- 1e5L  # Total number of strings
nus <- 1e3L  # Number of unique strings
ustrs <- paste("Str", seq_len(nus))

set.seed(42)
strs <- sample(ustrs, n, replace=TRUE)


# Now we use an environment as our multiset    
mset <- new.env(TRUE, emptyenv()) # Ensure hashing is enabled

# ...and insert the strings one by one...
for (s in strs) {
    msetInsert(mset, s)
}

# Now we should have nus unique strings in the multiset    
identical(nus, length(mset))

# And the names should be correct
identical(sort(ustrs), sort(names(as.list(mset))))

# ...And an example of getting the count for a specific string
mset[["Str 3"]] # "Str 3" instance count (97)

For simple counting of strings (and not using table or similar), a multiset data structure seems like a good fit. The environment object can be used to emulate this.

# Define the insert function for a multiset
msetInsert <- function(mset, s) {
    if (exists(s, mset, inherits=FALSE)) {
        mset[[s]] <- mset[[s]] + 1L
    } else {
        mset[[s]] <- 1L 
    }
}

# First we generate a bunch of strings
n <- 1e5L  # Total number of strings
nus <- 1e3L  # Number of unique strings
ustrs <- paste("Str", seq_len(nus))

set.seed(42)
strs <- sample(ustrs, n, replace=TRUE)


# Now we use an environment as our multiset    
mset <- new.env(TRUE, emptyenv()) # Ensure hashing is enabled

# ...and insert the strings one by one...
for (s in strs) {
    msetInsert(mset, s)
}

# Now we should have nus unique strings in the multiset    
identical(nus, length(mset))

# And the names should be correct
identical(sort(ustrs), sort(names(as.list(mset))))

# ...And an example of getting the count for a specific string
mset[["Str 3"]] # "Str 3" instance count (97)

回复收藏 0 原文

痕至 2024-12-09 11:57:01

与 @biocyperman 解决方案相关。 R.cache 有一个包装函数，用于避免缓存的加载、保存和评估。请参阅修改后的功能：

R.cache 提供了用于加载、评估、保存的包装器。您可以这样简化您的代码：

simulate <- function(mean, sd) {
key <- list(mean, sd)
data <- evalWithMemoization(key = key, expr = {
    cat("Generating data from scratch...")
    data <- rnorm(1000, mean=mean, sd=sd)
    Sys.sleep(1) # Emulate slow algorithm
    cat("ok\n")
    data})
}

Related to @biocyperman solution. R.cache has a wrapping function for avoiding the loading, saving and evaluation of the cache. See the modified function:

R.cache provide a wrapper for loading, evaluating, saving. You can simplify your code like that:

simulate <- function(mean, sd) {
key <- list(mean, sd)
data <- evalWithMemoization(key = key, expr = {
    cat("Generating data from scratch...")
    data <- rnorm(1000, mean=mean, sd=sd)
    Sys.sleep(1) # Emulate slow algorithm
    cat("ok\n")
    data})
}

回复收藏 0 原文

~没有更多了~

关于作者

×纯※雪

暂无简介

文章

27 人气

关注发私信

友情链接

文江博客

R 中的缓存/记忆/散列选项

Hashing

Memoization

缓存

键/值存储

检查点

其他

用例

Hashing

Memoization

Caching

Key/value storage

Checkpointing

Other

Use case

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

李珊平

Quxin

范无咎

github_ZOJ2N8YxBm

若言

南…巷孤猫

友情链接

R 中的缓存/记忆/散列选项

Hashing

Memoization

缓存

键/值存储

检查点

其他

用例

Hashing

Memoization

Caching

Key/value storage

Checkpointing

Other

Use case

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

李珊平

Quxin

范无咎

github_ZOJ2N8YxBm

若言

南…巷孤猫

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。