规避 R 的“if (nbins > .Machine$integer.max) 中的错误”
这是一个始于
> require(foreign)
> ipums <- read.dta('/path/to/data.dta')
> require(survey)
> ipums.design <- svydesign(id=~serial, strata=~strata, data=ipums, weights=perwt)
Error in if (nbins > .Machine$integer.max) stop("attempt to make a table with >= 2^31 elements") :
missing value where TRUE/FALSE needed
In addition: Warning messages:
1: In pd * (as.integer(cat) - 1L) : NAs produced by integer overflow
2: In pd * nl : NAs produced by integer overflow
> traceback()
9: tabulate(bin, pd)
8: as.vector(data)
7: array(tabulate(bin, pd), dims, dimnames = dn)
6: table(ids[, 1], strata[, 1])
5: inherits(x, "data.frame")
4: is.data.frame(x)
3: rowSums(table(ids[, 1], strata[, 1]) > 0)
2: svydesign.default(id = ~serial, weights = ~perwt, strata = ~strata,
data = ipums)
1: svydesign(id = ~serial, weights = ~perwt, strata = ~strata, data = ipums)
此错误似乎来自tabulate
函数,我希望它足够简单,可以绕过它,首先通过更改 .Machine$integer.max
> .Machine$integer.max <- 2^40
,当它没有改变时工作tabulate
的整个源代码:
> tabulate <- function(bin, nbins = max(1L, bin, na.rm=TRUE))
{
if(!is.numeric(bin) && !is.factor(bin))
stop("'bin' must be numeric or a factor")
#if (nbins > .Machine$integer.max)
if (nbins > 2^40) #replacement line
stop("attempt to make a table with >= 2^31 elements")
.C("R_tabulate",
as.integer(bin),
as.integer(length(bin)),
as.integer(nbins),
ans = integer(nbins),
NAOK = TRUE,
PACKAGE="base")$ans
}
都没有回避这个问题。显然,这是创建 ff 包的原因之一,但令我担心的是,这是我在 R 中无法避免的问题。 此post 似乎表明,即使我使用可以避免此问题的包,我一次也只能访问 2^31 个元素。我希望使用 sql
(sqlite
或 postgresql
)来解决内存问题,但我担心我会花费当让它发挥作用时,只会遇到同样的基本限制。
尝试切换回 Stata
也无法解决问题。再次参见上一篇文章,但是我想要运行的计算导致 Stata
挂起:
svy: mean age, over(strata)
是否抛出更多内存它将解决我不知道的问题。我在拥有 16 GB 的桌面上运行 R
,并通过 Windows 服务器使用 Stata
,当前将内存分配设置为 2000MB,但理论上我可以尝试增加该值。
总而言之:
- 这是 R 中的硬限制吗?
sql
能解决我的R
问题吗?- 如果我将它分成许多单独的文件会修复它吗(很多工作......)?
- 在
Stata
上投入大量内存可以吗? - 我是不是真的找错了树?
This is a saga which began with the problem of how to do survey weighting. Now that I appear to be doing that correctly, I have hit a bit of a wall (see previous post for details on the import process and where the strata
variable came from):
> require(foreign)
> ipums <- read.dta('/path/to/data.dta')
> require(survey)
> ipums.design <- svydesign(id=~serial, strata=~strata, data=ipums, weights=perwt)
Error in if (nbins > .Machine$integer.max) stop("attempt to make a table with >= 2^31 elements") :
missing value where TRUE/FALSE needed
In addition: Warning messages:
1: In pd * (as.integer(cat) - 1L) : NAs produced by integer overflow
2: In pd * nl : NAs produced by integer overflow
> traceback()
9: tabulate(bin, pd)
8: as.vector(data)
7: array(tabulate(bin, pd), dims, dimnames = dn)
6: table(ids[, 1], strata[, 1])
5: inherits(x, "data.frame")
4: is.data.frame(x)
3: rowSums(table(ids[, 1], strata[, 1]) > 0)
2: svydesign.default(id = ~serial, weights = ~perwt, strata = ~strata,
data = ipums)
1: svydesign(id = ~serial, weights = ~perwt, strata = ~strata, data = ipums)
This error seems to come from the tabulate
function, which I hoped would be straightforward enough to circumvent, first by changing .Machine$integer.max
> .Machine$integer.max <- 2^40
and when that didn't work the whole source code of tabulate
:
> tabulate <- function(bin, nbins = max(1L, bin, na.rm=TRUE))
{
if(!is.numeric(bin) && !is.factor(bin))
stop("'bin' must be numeric or a factor")
#if (nbins > .Machine$integer.max)
if (nbins > 2^40) #replacement line
stop("attempt to make a table with >= 2^31 elements")
.C("R_tabulate",
as.integer(bin),
as.integer(length(bin)),
as.integer(nbins),
ans = integer(nbins),
NAOK = TRUE,
PACKAGE="base")$ans
}
Neither circumvented the problem. Apparently this is one reason why the ff
package was created, but what worries me is the extent to which this is a problem I cannot avoid in R
. This post seems to indicate that even if I were to use a package that would avoid this problem, I would only be able to access 2^31 elements at a time. My hope was to use sql
(either sqlite
or postgresql
) to get around the memory problems, but I'm afraid I'll spend a while getting that to work, only to run into the same fundamental limit.
Attempting to switch back to Stata
doesn't solve the problem either. Again see the previous post for how I use svyset
, but the calculation I would like to run causes Stata
to hang:
svy: mean age, over(strata)
Whether throwing more memory at it will solve the problem I don't know. I run R
on my desktop which has 16 gigs, and I use Stata
through a Windows server, currently setting memory allocation to 2000MB, but I could theoretically experiment with increasing that.
So in sum:
- Is this a hard limit in
R
? - Would
sql
solve myR
problems? - If I split it up into many separate files would that fix it (a lot of work...)?
- Would throwing a lot of memory at
Stata
do it? - Am I seriously barking up the wrong tree somehow?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我不熟悉您正在做的事情的细节,无法就您的问题的其他部分提供任何进一步的建议。
您为什么要使用完整的数据集?一个更小的样本能够满足 R 对你施加的限制,难道不是同样有用吗?您可以使用 SQL 存储所有数据并从 R 查询它以返回更合适大小的随机子集。
I am unfamiliar with the details of what you are doing to offer any further advice on the other parts of your Q.
Why do you want to work with the full data set? Wouldn't a smaller sample that can fit in to the restrictions R places on you be just as useful? You could use SQL to store all the data and query it from R to return a random subset of more appropriate size.
由于这个问题是前一段时间提出的,我想指出我在这里的答案使用的是
survey
包的 3.3 版本。如果您检查 svydesign 的代码,您会发现导致所有问题的函数位于一个检查步骤中,该步骤会检查您是否应该将 Nest 参数设置为 TRUE 或不是。可以通过设置选项
check.strata=FALSE
来禁用此步骤。当然,除非您知道自己在做什么,否则不应禁用检查步骤。在这种情况下,您应该能够自行决定是否需要将
nest
选项设置为TRUE
还是FALSE
。当相同的 PSU(集群)id 在不同的层中回收时,nest
应设置为 TRUE。具体而言,对于 IPUMS 数据集,由于您使用
serial
变量进行聚类识别,并且serial
对于给定样本中的每个家庭都是唯一的,因此您可能需要设置嵌套
到FALSE
。因此,您的调查设计路线将是:
额外建议:即使在规避此问题之后,您也会发现代码非常慢,除非您将层重新映射到 1 到
length(unique( ipums$strata))
:Since this question was asked some time ago, I'd like to point that my answer here uses the version 3.3 of the
survey
package.If you check the code of
svydesign,
you can see that the function that causes all the problem is within a check step that looks whether you should set thenest
parameter to TRUE or not. This step can be disabled setting the optioncheck.strata=FALSE
.Of course, you shouldn't disable a check step unless you know what you are doing. In this case, you should be able to decide yourself whether you need to set the
nest
option toTRUE
orFALSE
.nest
should be set to TRUE when the same PSU (cluster) id is recycled in different strata.Concretely for the IPUMS dataset, since you are using the
serial
variable for cluster identification andserial
is unique for each household in a given sample, you may want to setnest
toFALSE
.So, your survey design line would be:
Extra advice: even after circumventing this problem you will find that the code is pretty slow unless you remap strata to a range from 1 to
length(unique(ipums$strata))
:@Gavin 和 @Martin 的这个答案都值得赞扬,或者至少引导我走向正确的方向。我主要是单独回答,以便于阅读。
按照我问的顺序:
是的,2^31 是 R 中的硬限制,尽管它是什么类型似乎很重要(考虑到它的长度,这有点奇怪 的向量,而不是内存量(我有足够的),这是所指出的问题。 变量到
factors
,这样就可以修复它们的长度并消除子集化的影响(这是解决此问题的方法)。sql
可能会有所帮助,前提是我学会了如何正确使用它。我做了以下测试:并且发现方法是相同的,这似乎是一个合理的测试。
所以:理论上,只要我可以使用
plyr
或sql
来拆分计算,结果应该还是不错的。参见 2。
在
Stata
上投入大量内存肯定有帮助,但现在我遇到了烦人的格式问题。我似乎能够执行我想要的大部分计算(更快,也更稳定),但我不知道如何将其变成我想要的形式。可能会就此提出一个单独的问题。我认为这里的简短版本是,对于大型调查数据,Stata
开箱即用要好得多。在很多方面是的。尝试用这么大的数据进行分析并不是我应该掉以轻心的事情,而且我现在还远远没有弄清楚。我正确使用了 svydesign 函数,但我并不真正知道发生了什么。我现在有了(稍微)更好的理解,并且令人振奋的是,我知道我对如何解决问题的看法总体上是正确的。 @Gavin 关于尝试小数据与外部结果进行比较的一般建议是无价的,我应该在很久以前就开始这样做。非常感谢@Gavin 和@Martin。
Both @Gavin and @Martin deserve credit for this answer, or at least leading me in the right direction. I'm mostly answering it separately to make it easier to read.
In the order I asked:
Yes 2^31 is a hard limit in
R
, though it seems to matter what type it is (which is a bit strange given it is the length of the vector, rather than the amount of memory (which I have plenty of) which is the stated problem. Do not convertstrata
orid
variables tofactors
, that will just fix their length and nullify the effects of subsetting (which is the way to get around this problem).sql
could probably help, provided I learn how to use it correctly. I did the following test:And found the means to be the same, which seems like a reasonable test.
So: in theory, provided I can split up the calculation by either using
plyr
orsql
, the results should still be fine.See 2.
Throwing a lot of memory at
Stata
definitely helps, but now I'm running into annoying formatting issues. I seem to be able to perform most of the calculation I want (much quicker and with more stability as well) but I can't figure out how to get it into the form I want. Will probably ask a separate question on this. I think the short version here is that for big survey data,Stata
is much better out of the box.In many ways yes. Trying to do analysis with data this big is not something I should have taken on lightly, and I'm far from figuring it out even now. I was using the
svydesign
function correctly, but I didn't really know what's going on. I have a (very slightly) better grasp now, and it's heartening to know I was generally correct about how to solve the problem. @Gavin's general suggestion of trying out small data with external results to compare to is invaluable, something I should have started ages ago. Many thanks to both @Gavin and @Martin.