如何处理 lm 中以多级因子为控制的矢量大小过大的情况
我正在尝试拟合一个包含大约 900,000 个观测值和两个解释变量的线性模型。然而,我还需要包含一个控制变量,它是一个多级因子变量(11,135 个级别)。回归的代码如下所示:
model1 <- dep_var ~ expl_var_1 + expl_var_2 + Factor(control_var), data=data
但是,R 向我抛出错误“无法分配大小为 75.6 GB 的向量” ” 我很清楚这是由于多级因子变量造成的,但是,我需要将此变量作为控件包含在内。请注意:这不是一个有序因素;它只是一个没有任何顺序的id。
我试图找到这个问题的解决方案,但遇到了问题:
- 我研究了 plm - 但这不起作用,因为虽然我的控制变量可以解释为 ID,但时间不起作用(即使确实如此;每次每个ID可以有> 1个观察)
- 我研究了biglm,但这更适合大数据而不是多级因素的情况
我的问题:
- 有没有办法在回归中包含一个变量并离开将回归结果分配给 model1 时会出现这种情况吗?我真的对每个控制变量因子水平的系数根本不感兴趣。我只需要控制它。
- 如果没有:即使我无法确保每个块中都存在所有控制变量因子水平,我是否可以有效地分割回归(这是不可行的,因为某些水平只有 1 个观察值)?
我很感激任何解决方案的起点和寻找解决方案的想法 - 目前我只是坚持我的知识和理解水平。
预先感谢您的时间、支持和耐心。
I'm trying to fit a linear model with roughly 900,000 observations and just two explanatory variables. Yet, I additionally need to include a control variable that is a many-level factor variable (11,135 levels). The code for the regression looks like this:
model1 <- dep_var ~ expl_var_1 + expl_var_2 + factor(control_var), data=data
However, R throws me the error "Cannot allocate a vector of size 75.6 GB"
I'm well aware that this is due to the many-level factor variable, however, I need to include this variable as a control. Please note: this is not an ordered factor; it is simply an id without any order.
I've tried to find a solution to this problem, but ran into problems:
- I looked into plm - but that doesn't work because while my control variable can be interpreted as an ID time doesn't play a role (and even if it did; there can be >1 observation per ID per time)
- I looked into biglm but this fits better the case of big data and not many-level factor
My questions:
- Is there a way to include a variable in the regression and leaving it out when assigning the outcome of the regression to model1? I'm really not interested at all in the coefficients per control variable factor level. I just need to control for it.
- If there isn't: can I efficiently split up my regression even if I cannot make sure that in each chunk there are all control variable factor levels present (that isn't feasible, because some levels just have 1 observation)?
I'd appreciate any starting points for a solution and ideas where to look for a solution - currently I'm just stuck with my level of knowledge and understanding.
Thanks in advance for your time, support, and patience.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我参加聚会迟到了,但实际上不明白为什么 biglm 不起作用。您不需要将所有控制作为虚拟对象,而是作为一个因素,从而使问题变得不那么稀疏。唯一的事情是在 biglm 之前创建数据块(您可以使用 split 或sample and split 来完成),在第一个块上运行 biglm,然后使用 biglm::update 函数在其他块上运行。块的数量取决于您的记忆。
唯一的事情是确保您以完全相同的方式定义每个块中的因素级别(在分块之前使用不带/不重新级别的级别)。对于块中不存在的那些因素,biglm 将返回 NA,该 NA 将在下一阶段更新。
I am late to the party, but actually don't see why biglm would not work. You would not need to have all control as dummies, but as one factor, thus making the problem much less sparse. The only thing is to create chunks of the data ahead of the biglm (which you can do with split or sample and split), run biglm on the first chunk then on the other chunks using the biglm::update function. The number of chunks will depend on your memory.
The only thing is to make sure you define the levels of factors in each chunk the exact same way (using levels with/out relevel before chunking). For those factors absent from a chunk, biglm will return a NA, which will be updated in the next stages.
lfe
库(线性固定效果)提供felm()
函数。它的工作原理与 Stefano Barbi 建议的类似,但不将因子变量视为随机效应,而是视为固定效应。这更接近我最初想要的。此外,felm()
对象与sandwich
兼容,以便我可以对标准错误进行聚类 - 尽管不是在吸收因子变量上(在我的情况下这不是问题) ):与
lm()
相比,felm()
函数的速度快得令人难以置信 - 至少在我的具体情况下是如此。lfe
针对不同的具体问题提供了许多不同的小插图。来自帮助文件“该软件包使用交替投影方法来估计具有多组固定效应的线性模型。内部估计器的推广。它支持通过 2SLS 对多个内生变量进行 IV 估计,并使用条件 F 统计来检测弱它是线程并行的,旨在解决大型问题,还包括一种纠正有限移动性偏差的方法。”到目前为止我发现的唯一缺点:
felm()
对象与边距不兼容。所以你必须手动计算利润。The
lfe
library (Linear Fixed Effects) provides thefelm()
function. It works similarly to what Stefano Barbi suggested but does not treat the factor variable as random effects but as fixed effects. This is much closer to what I initially wanted. Also, thefelm()
objects are compatible withsandwich
so that I can cluster my standard errors - though not on the absorbed factor variable (which in my case is not a problem):The
felm()
function is incredibly fast compared tolm()
- at least in my specific case.lfe
comes with a number of different vignettes for different specific problems. From the help file "The package uses the Method of Alternating Projections to estimate linear models with multiple group fixed effects. A generalization of the within estimator. It supports IV-estimation with multiple endogenous variables via 2SLS, with conditional F statistics for detection of weak instruments. It is thread-parallelized and intended for large problems. A method for correcting limited mobility bias is also included."Only drawback I have found so far:
felm()
objects are not compatible with margins. So you have to compute margins by hand.