具有许多嵌套分类协变量的回归
我有几十万次测量,其中依赖 变量是一个概率,想使用逻辑回归。 然而,我拥有的协变量都是分类的,更糟糕的是,都是 嵌套的。我的意思是,如果某个测量具有“城市 - 凤凰城”那么显然肯定有“州 - 亚利桑那州”并且 “国家 - 美国”我有四个这样的因素 - 最细粒度的有 大约 20k 关卡,但如果需要的话我想我可以不用那个关卡。 我还有一些非嵌套的分类协变量(只有四个左右, 每个可能有三个不同的级别)。 我最感兴趣的是什么 是预测 - 鉴于在某个城市的新观察,我想 知道相关的概率/因变量。我不感兴趣 与相关的推理机制一样 - 标准偏差, 等等——至少到目前为止。我希望我能承担得起马虎的后果。 但是,我很想获得这些信息,除非需要 计算成本更高的方法。 有人对如何攻击这个有什么建议吗?我调查过 混合效果,但不确定这是否是我想要的。
I have a few hundred thousand measurements where the dependent
variable is a probability, and would like to use logistic regression.
However, the covariates I have are all categorical, and worse, are all
nested. By this I mean that if a certain measurement has "city -
Phoenix" then obviously it is certain to have "state - Arizona" and
"country - U.S." I have four such factors - the most granular has
some 20k levels, but if need be I could do without that one, I think.
I also have a few non-nested categorical covariates (only four or so,
with maybe three different levels each).
What I am most interested in
is prediction - given a new observation in some city, I would like to
know the relevant probability/dependent variable. I am not interested
as much in the related inferential machinery - standard deviations,
etc - at least as of now. I am hoping I can afford to be sloppy.
However, I would love to have that information unless it requires
methods that are more computationally expensive.
Does anyone have any advice on how to attack this? I have looked into
mixed effects, but am not sure it is what I am looking for.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我认为这更多的是模型设计问题,而不是具体的 R 问题;因此,我想首先解决问题的上下文,然后解决适当的 R 包。
如果您的因变量是概率,例如 $y\in[0,1]$,则逻辑回归不是合适的数据 - 特别是考虑到您有兴趣预测样本外的概率。逻辑将对自变量对因变量从 0 翻转到 1 的概率的贡献进行建模,并且由于您的变量是连续的且被截断,因此您需要不同的规范。
我认为你对混合效应的后一种直觉是好的。由于您的观察结果是嵌套的,即
US <-> AZ<-> Phoenix
,一个多级模型,或者在本例中为分层线性模型,可能是您的数据的最佳规范。对于这种类型的建模来说,最好的 R 包是multilevel
和nlme
,并且对这两个包都有很好的介绍 R 和 nlme 中的多级模型可在此处获取。您可能对多级建模的数据操作的讨论特别感兴趣,该讨论从第 26 页开始。I think this is more of model design question than on R specifically; as such, I'd like to address the context of the question first then the appropriate R packages.
If your dependent variable is a probability, e.g., $y\in[0,1]$, a logistic regression is not data appropriate---particularly given that you are interested in predicting probabilities out of sample. The logistic is going to be modeling the contribution of the independent variables to the probability that your dependent variable flips from a zero to a one, and since your variable is continuous and truncated you need a different specification.
I think your latter intuition about mixed effects is a good one. Since your observations are nested, i.e.,
US <-> AZ <-> Phoenix
, a multi-level model, or in this case a hierarchical linear model, may be the best specification for your data. The best R packages for this type of modeling aremultilevel
andnlme
, and there is an excellent introduction to both multi-level models in R and nlme available here. You may be particularly interested in the discussion of data manipulation for multi-level modeling, which begins on page 26.我建议研究像弹性网这样的惩罚回归。弹性网络用于文本挖掘,其中每一列代表单个单词的存在或不存在,并且可能有数十万个变量,这与您的问题类似。从 R 开始的一个好地方是
glmnet
包及其随附的 JSS 论文:http://www.jstatsoft.org/v33/i01/。I would suggest looking into penalised regressions like the elastic net. The elastic net is used in text mining where each column represents the present or absence of a single word, and there maybe hundreds of thousands of variables, an analogous problem to yours. A good place to start with R would be the
glmnet
package and its accompanying JSS paper: http://www.jstatsoft.org/v33/i01/.