更严重的罪过:副作用或传递大量物体?

发布于 2024-07-04 19:58:36 字数 511 浏览 13 评论 0原文

我在函数内的循环内有一个函数。 内部函数获取并将一个大的数据向量存储在内存中(作为全局变量......我使用“R”,就像“S-Plus”)。 该循环循环遍历要采集的一长串数据。 外部函数启动该过程并传入要获取的数据集列表。

for (dataset in list_of_datasets) {
  for (datachunk in dataset) {
    <process datachunk>
    <store result? as vector? where?>
  }
}

我对内部函数进行了编程,以在移动到下一个数据集之前存储每个数据集,因此外部函数的所有工作都会对全局变量产生副作用……这是一个很大的禁忌。 这比收集并返回一个巨大的、占用内存的向量向量更好还是更糟? 有没有更好的第三种方法?

如果我将数据向量存储在数据库而不是内存中,答案会改变吗? 理想情况下,我希望能够终止该函数(或由于网络超时而失败),而不会丢失终止之前处理的所有信息。

I have a function inside a loop inside a function. The inner function acquires and stores a large vector of data in memory (as a global variable... I'm using "R" which is like "S-Plus"). The loop loops through a long list of data to be acquired. The outer function starts the process and passes in the list of datasets to be acquired.

for (dataset in list_of_datasets) {
  for (datachunk in dataset) {
    <process datachunk>
    <store result? as vector? where?>
  }
}

I programmed the inner function to store each dataset before moving to the next, so all the work of the outer function occurs as side effects on global variables... a big no-no. Is this better or worse than collecting and returning a giant, memory-hogging vector of vectors? Is there a superior third approach?

Would the answer change if I were storing the data vectors in a database rather than in memory? Ideally, I'd like to be able to terminate the function (or have it fail due to network timeouts) without losing all the information processed prior to termination.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

你的往事 2024-07-11 19:58:36

第三种方法:内部函数返回对大数组的引用,然后循环内的下一个语句取消引用并存储在需要的地方(理想情况下使用单个指针存储,而不是通过内存复制整个数组)。

这消除了副作用和大型数据结构的传递。

Third approach: inner function returns a reference to the large array, which the next statement inside the loop then dereferences and stores wherever it's needed (ideally with a single pointer store and not by having to memcopy the entire array).

This gets rid of both the side effect and the passing of large datastructures.

醉态萌生 2024-07-11 19:58:36

在不知道所使用的语言/编译器的情况下很难明确地说。 但是,如果您可以简单地将指针/引用传递给您正在创建的对象,那么对象本身的大小与函数调用的速度无关。 以后处理这些数据可能会是一个不同的故事。

It's tough to say definitively without knowing the language/compiler used. However, if you can simply pass a pointer/reference to the object that you're creating, then the size of the object itself has nothing to do with the speed of the function calls. Manipulating this data down the road could be a different story.

云巢 2024-07-11 19:58:36

我不确定我是否理解这个问题,但我有几个解决方案。

  1. 在函数内部,创建向量列表并返回该列表。

  2. 在函数内部,创建一个环境并将所有向量存储在其中。 只要确保在出现错误时返回环境即可。

在 R 中:

help(environment)

# You might do something like this:

outer <- function(datasets) {
  # create the return environment
  ret.env <- new.env()
  for(set in dataset) {
    tmp <- inner(set)
    # check for errors however you like here.  You might have inner return a list, and
    # have the list contain an error component
    assign(set, tmp, envir=ret.env)
  }
  return(ret.env)
}

#The inner function might be defined like this

inner <- function(dataset) {
  # I don't know what you are doing here, but lets pretend you are reading a data file
  # that is named by dataset
  filedata <- read.table(dataset, header=T)
  return(filedata)
}

leif

I'm not sure I understand the question, but I have a couple of solutions.

  1. Inside the function, create a list of the vectors and return that.

  2. Inside the function, create an environment and store all the vectors inside of that. Just make sure that you return the environment in case of errors.

in R:

help(environment)

# You might do something like this:

outer <- function(datasets) {
  # create the return environment
  ret.env <- new.env()
  for(set in dataset) {
    tmp <- inner(set)
    # check for errors however you like here.  You might have inner return a list, and
    # have the list contain an error component
    assign(set, tmp, envir=ret.env)
  }
  return(ret.env)
}

#The inner function might be defined like this

inner <- function(dataset) {
  # I don't know what you are doing here, but lets pretend you are reading a data file
  # that is named by dataset
  filedata <- read.table(dataset, header=T)
  return(filedata)
}

leif

女中豪杰 2024-07-11 19:58:36

在外部函数中使用变量而不是全局变量。 这可以让您充分利用两种方法:您不会改变全局状态,也不会复制大量数据。 如果必须提前退出,只需返回部分结果即可。

(请参阅 R 手册中的“范围”部分:http:// /cran.r-project.org/doc/manuals/R-intro.html#Scope)

use variables in the outer function instead of global variables. This gets you the best of both approaches: you're not mutating global state, and you're not copying a big wad of data. If you have to exit early, just return the partial results.

(See the "Scope" section in the R manual: http://cran.r-project.org/doc/manuals/R-intro.html#Scope)

疯到世界奔溃 2024-07-11 19:58:36

记住你的高德纳。 “过早的优化是所有编程罪恶的根源。”

尝试无副作用版本。 看看它是否满足您的绩效目标。 如果是的话,那就太好了,你一开始就没有问题; 如果没有,则使用副作用,并为下一个程序员记下你的手是被迫的。

Remember your Knuth. "Premature optimization is the root of all programming evil."

Try the side effect free version. See if it meets your performance goals. If it does, great, you don't have a problem in the first place; if it doesn't, then use the side effects, and make a note for the next programmer that your hand was forced.

情丝乱 2024-07-11 19:58:36

它不会对内存使用产生太大影响,因此您不妨使代码干净。

由于 R 对变量具有修改时复制功能,因此修改全局对象将与在返回值中传递某些内容具有相同的内存含义。

如果将输出存储在数据库(甚至文件)中,则不会遇到内存使用问题,并且数据将在创建时逐渐可用,而不是在最后才可用。 数据库是否更快主要取决于您使用了多少内存:垃圾收集的减少是否会弥补写入磁盘的成本。

R 中既有时间分析器又有内存分析器,因此您可以根据经验了解影响是什么。

It's not going to make much difference to memory use, so you might as well make the code clean.

Since R has copy-on-modify for variables, modifying the global object will have the same memory implications as passing something up in return values.

If you store the outputs in a database (or even in a file) you won't have the memory use issues, and the data will be incrementally available as it is created, rather than just at the end. Whether it's faster with the database depends primarily on how much memory you are using: is the reduction is garbage collection going to pay for the cost of writing to disk.

There are both time and memory profilers in R, so you can see empirically what the impacts are.

祁梦 2024-07-11 19:58:36

仅供参考,这是一个完整的示例玩具解决方案,可以避免副作用:

outerfunc <- function(names) {
  templist <- list()
  for (aname in names) {
    templist[[aname]] <- innerfunc(aname)
  }
  templist
}

innerfunc <- function(aname) {
  retval <- NULL
  if ("one" %in% aname) retval <- c(1)
  if ("two" %in% aname) retval <- c(1,2)
  if ("three" %in% aname) retval <- c(1,2,3)
  retval
}

names <- c("one","two","three")

name_vals <- outerfunc(names)

for (name in names) assign(name, name_vals[[name]])

FYI, here's a full sample toy solution that avoids side effects:

outerfunc <- function(names) {
  templist <- list()
  for (aname in names) {
    templist[[aname]] <- innerfunc(aname)
  }
  templist
}

innerfunc <- function(aname) {
  retval <- NULL
  if ("one" %in% aname) retval <- c(1)
  if ("two" %in% aname) retval <- c(1,2)
  if ("three" %in% aname) retval <- c(1,2,3)
  retval
}

names <- c("one","two","three")

name_vals <- outerfunc(names)

for (name in names) assign(name, name_vals[[name]])
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文