规避 R 的“if (nbins > .Machine$integer.max) 中的错误”

发布于 2024-10-28 23:29:18 字数 2777 浏览 1 评论 0原文

这是一个始于

> require(foreign)
> ipums <- read.dta('/path/to/data.dta')
> require(survey)
> ipums.design <- svydesign(id=~serial, strata=~strata, data=ipums, weights=perwt)
Error in if (nbins > .Machine$integer.max) stop("attempt to make a table with >= 2^31 elements") : 
  missing value where TRUE/FALSE needed
In addition: Warning messages:
1: In pd * (as.integer(cat) - 1L) : NAs produced by integer overflow
2: In pd * nl : NAs produced by integer overflow
> traceback()
9: tabulate(bin, pd)
8: as.vector(data)
7: array(tabulate(bin, pd), dims, dimnames = dn)
6: table(ids[, 1], strata[, 1])
5: inherits(x, "data.frame")
4: is.data.frame(x)
3: rowSums(table(ids[, 1], strata[, 1]) > 0)
2: svydesign.default(id = ~serial, weights = ~perwt, strata = ~strata, 
       data = ipums)
1: svydesign(id = ~serial, weights = ~perwt, strata = ~strata, data = ipums)

此错误似乎来自tabulate 函数，我希望它足够简单，可以绕过它，首先通过更改 .Machine$integer.max

> .Machine$integer.max <- 2^40

，当它没有改变时工作tabulate的整个源代码：

> tabulate <- function(bin, nbins = max(1L, bin, na.rm=TRUE))
{
    if(!is.numeric(bin) && !is.factor(bin))
    stop("'bin' must be numeric or a factor")
    #if (nbins > .Machine$integer.max)
    if (nbins > 2^40) #replacement line
        stop("attempt to make a table with >= 2^31 elements")
    .C("R_tabulate",
       as.integer(bin),
       as.integer(length(bin)),
       as.integer(nbins),
       ans = integer(nbins),
       NAOK = TRUE,
       PACKAGE="base")$ans
}

都没有回避这个问题。显然，这是创建 ff 包的原因之一，但令我担心的是，这是我在 R 中无法避免的问题。此post 似乎表明，即使我使用可以避免此问题的包，我一次也只能访问 2^31 个元素。我希望使用 sql （sqlite 或 postgresql）来解决内存问题，但我担心我会花费当让它发挥作用时，只会遇到同样的基本限制。

尝试切换回 Stata 也无法解决问题。再次参见上一篇文章，但是我想要运行的计算导致 Stata 挂起：

svy: mean age, over(strata)

是否抛出更多内存它将解决我不知道的问题。我在拥有 16 GB 的桌面上运行 R，并通过 Windows 服务器使用 Stata，当前将内存分配设置为 2000MB，但理论上我可以尝试增加该值。

总而言之：

这是 R 中的硬限制吗？
sql 能解决我的 R 问题吗？
如果我将它分成许多单独的文件会修复它吗（很多工作......）？
在 Stata 上投入大量内存可以吗？
我是不是真的找错了树？

原文

This is a saga which began with the problem of how to do survey weighting. Now that I appear to be doing that correctly, I have hit a bit of a wall (see previous post for details on the import process and where the strata variable came from):

> require(foreign)
> ipums <- read.dta('/path/to/data.dta')
> require(survey)
> ipums.design <- svydesign(id=~serial, strata=~strata, data=ipums, weights=perwt)
Error in if (nbins > .Machine$integer.max) stop("attempt to make a table with >= 2^31 elements") : 
  missing value where TRUE/FALSE needed
In addition: Warning messages:
1: In pd * (as.integer(cat) - 1L) : NAs produced by integer overflow
2: In pd * nl : NAs produced by integer overflow
> traceback()
9: tabulate(bin, pd)
8: as.vector(data)
7: array(tabulate(bin, pd), dims, dimnames = dn)
6: table(ids[, 1], strata[, 1])
5: inherits(x, "data.frame")
4: is.data.frame(x)
3: rowSums(table(ids[, 1], strata[, 1]) > 0)
2: svydesign.default(id = ~serial, weights = ~perwt, strata = ~strata, 
       data = ipums)
1: svydesign(id = ~serial, weights = ~perwt, strata = ~strata, data = ipums)

This error seems to come from the tabulate function, which I hoped would be straightforward enough to circumvent, first by changing .Machine$integer.max

> .Machine$integer.max <- 2^40

and when that didn't work the whole source code of tabulate:

> tabulate <- function(bin, nbins = max(1L, bin, na.rm=TRUE))
{
    if(!is.numeric(bin) && !is.factor(bin))
    stop("'bin' must be numeric or a factor")
    #if (nbins > .Machine$integer.max)
    if (nbins > 2^40) #replacement line
        stop("attempt to make a table with >= 2^31 elements")
    .C("R_tabulate",
       as.integer(bin),
       as.integer(length(bin)),
       as.integer(nbins),
       ans = integer(nbins),
       NAOK = TRUE,
       PACKAGE="base")$ans
}

Neither circumvented the problem. Apparently this is one reason why the ff package was created, but what worries me is the extent to which this is a problem I cannot avoid in R. This post seems to indicate that even if I were to use a package that would avoid this problem, I would only be able to access 2^31 elements at a time. My hope was to use sql (either sqlite or postgresql) to get around the memory problems, but I'm afraid I'll spend a while getting that to work, only to run into the same fundamental limit.

Attempting to switch back to Stata doesn't solve the problem either. Again see the previous post for how I use svyset, but the calculation I would like to run causes Stata to hang:

svy: mean age, over(strata)

Whether throwing more memory at it will solve the problem I don't know. I run R on my desktop which has 16 gigs, and I use Stata through a Windows server, currently setting memory allocation to 2000MB, but I could theoretically experiment with increasing that.

So in sum:

Is this a hard limit in R?
Would sql solve my R problems?
If I split it up into many separate files would that fix it (a lot of work...)?
Would throwing a lot of memory at Stata do it?
Am I seriously barking up the wrong tree somehow?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

花期渐远 2024-11-04 23:29:18

是的，R 对向量使用 32 位索引，因此它们最多可以包含 2^31-1 个条目，而您正在尝试使用 2^40 创建一些内容。有人讨论引入 64 位索引，但这在 R 中出现之前还需要一段时间。向量有规定的硬限制，就基本 R 而言就是这样。

我不熟悉您正在做的事情的细节，无法就您的问题的其他部分提供任何进一步的建议。

您为什么要使用完整的数据集？一个更小的样本能够满足 R 对你施加的限制，难道不是同样有用吗？您可以使用 SQL 存储所有数据并从 R 查询它以返回更合适大小的随机子集。

回复收藏 0 原文

破晓 2024-11-04 23:29:18

由于这个问题是前一段时间提出的，我想指出我在这里的答案使用的是 survey 包的 3.3 版本。

如果您检查 svydesign 的代码，您会发现导致所有问题的函数位于一个检查步骤中，该步骤会检查您是否应该将 Nest 参数设置为 TRUE 或不是。可以通过设置选项 check.strata=FALSE 来禁用此步骤。

当然，除非您知道自己在做什么，否则不应禁用检查步骤。在这种情况下，您应该能够自行决定是否需要将 nest 选项设置为 TRUE 还是 FALSE。当相同的 PSU（集群）id 在不同的层中回收时，nest 应设置为 TRUE。

具体而言，对于 IPUMS 数据集，由于您使用 serial 变量进行聚类识别，并且 serial 对于给定样本中的每个家庭都是唯一的，因此您可能需要设置 嵌套到FALSE。

因此，您的调查设计路线将是：

ipums.design <- svydesign(id=~serial, strata=~strata, data=ipums, weights=perwt, check.strata=FALSE, nest=FALSE)

额外建议：即使在规避此问题之后，您也会发现代码非常慢，除非您将层重新映射到 1 到 length(unique( ipums$strata)):

ipums$strata <- match(ipums$strata,unique(ipums$strata))

Since this question was asked some time ago, I'd like to point that my answer here uses the version 3.3 of the survey package.

If you check the code of svydesign, you can see that the function that causes all the problem is within a check step that looks whether you should set the nest parameter to TRUE or not. This step can be disabled setting the option check.strata=FALSE.

Of course, you shouldn't disable a check step unless you know what you are doing. In this case, you should be able to decide yourself whether you need to set the nest option to TRUE or FALSE. nest should be set to TRUE when the same PSU (cluster) id is recycled in different strata.

Concretely for the IPUMS dataset, since you are using the serial variable for cluster identification and serial is unique for each household in a given sample, you may want to set nest to FALSE.

So, your survey design line would be:

ipums.design <- svydesign(id=~serial, strata=~strata, data=ipums, weights=perwt, check.strata=FALSE, nest=FALSE)

Extra advice: even after circumventing this problem you will find that the code is pretty slow unless you remap strata to a range from 1 to length(unique(ipums$strata)):

ipums$strata <- match(ipums$strata,unique(ipums$strata))

回复收藏 0 原文

你的心境我的脸 2024-11-04 23:29:18

@Gavin 和 @Martin 的这个答案都值得赞扬，或者至少引导我走向正确的方向。我主要是单独回答，以便于阅读。

按照我问的顺序：

是的，2^31 是 R 中的硬限制，尽管它是什么类型似乎很重要（考虑到它的长度，这有点奇怪的向量，而不是内存量（我有足够的），这是所指出的问题。变量到 factors，这样就可以修复它们的长度并消除子集化的影响（这是解决此问题的方法）。

sql可能会有所帮助，前提是我学会了如何正确使用它。我做了以下测试：

library(multicore) # 让 svy 更快！
ri.ny <- 子集(ipums, statefips_num %in% c(36, 44))
ri.ny.design <- svydesign（id=~serial，权重=~perwt，strata=~strata，数据=ri.ny）
svyby（〜incwage，〜strata，ri.ny.design，svymean，数据= ri.ny，na.rm = TRUE，多核= TRUE）

ri <- 子集(ri.ny, statefips_num==44)
ri.design <- svydesign（id=~serial，权重=~perwt，strata=~strata，数据=ri）
ri.mean <- svymean(~incwage, ri.design, data=ri, na.rm=TRUE)

ny <- 子集(ri.ny, statefips_num==36)
ny.design <- svydesign(id=~serial,weights=~perwt,strata=~strata,data=ny)
ny.mean <- svymean(~incwage, ny.design, data=ny, na.rm=TRUE, 多核=TRUE)

并且发现方法是相同的，这似乎是一个合理的测试。

所以：理论上，只要我可以使用 plyr 或 sql 来拆分计算，结果应该还是不错的。

参见 2。
在 Stata 上投入大量内存肯定有帮助，但现在我遇到了烦人的格式问题。我似乎能够执行我想要的大部分计算（更快，也更稳定），但我不知道如何将其变成我想要的形式。可能会就此提出一个单独的问题。我认为这里的简短版本是，对于大型调查数据，Stata 开箱即用要好得多。
在很多方面是的。尝试用这么大的数据进行分析并不是我应该掉以轻心的事情，而且我现在还远远没有弄清楚。我正确使用了 svydesign 函数，但我并不真正知道发生了什么。我现在有了（稍微）更好的理解，并且令人振奋的是，我知道我对如何解决问题的看法总体上是正确的。 @Gavin 关于尝试小数据与外部结果进行比较的一般建议是无价的，我应该在很久以前就开始这样做。非常感谢@Gavin 和@Martin。

Both @Gavin and @Martin deserve credit for this answer, or at least leading me in the right direction. I'm mostly answering it separately to make it easier to read.

In the order I asked:

Yes 2^31 is a hard limit in R, though it seems to matter what type it is (which is a bit strange given it is the length of the vector, rather than the amount of memory (which I have plenty of) which is the stated problem. Do not convert strata or id variables to factors, that will just fix their length and nullify the effects of subsetting (which is the way to get around this problem).

sql could probably help, provided I learn how to use it correctly. I did the following test:

library(multicore) # make svy fast!
ri.ny <- subset(ipums, statefips_num %in% c(36, 44))
ri.ny.design <- svydesign(id=~serial, weights=~perwt, strata=~strata, data=ri.ny)
svyby(~incwage, ~strata, ri.ny.design, svymean, data=ri.ny, na.rm=TRUE, multicore=TRUE)

ri <- subset(ri.ny, statefips_num==44)
ri.design <- svydesign(id=~serial, weights=~perwt, strata=~strata, data=ri)
ri.mean <- svymean(~incwage, ri.design, data=ri, na.rm=TRUE)

ny <- subset(ri.ny, statefips_num==36)
ny.design <- svydesign(id=~serial, weights=~perwt, strata=~strata, data=ny)
ny.mean <- svymean(~incwage, ny.design, data=ny, na.rm=TRUE, multicore=TRUE)

And found the means to be the same, which seems like a reasonable test.

So: in theory, provided I can split up the calculation by either using plyr or sql, the results should still be fine.

See 2.
Throwing a lot of memory at Stata definitely helps, but now I'm running into annoying formatting issues. I seem to be able to perform most of the calculation I want (much quicker and with more stability as well) but I can't figure out how to get it into the form I want. Will probably ask a separate question on this. I think the short version here is that for big survey data, Stata is much better out of the box.
In many ways yes. Trying to do analysis with data this big is not something I should have taken on lightly, and I'm far from figuring it out even now. I was using the svydesign function correctly, but I didn't really know what's going on. I have a (very slightly) better grasp now, and it's heartening to know I was generally correct about how to solve the problem. @Gavin's general suggestion of trying out small data with external results to compare to is invaluable, something I should have started ages ago. Many thanks to both @Gavin and @Martin.