在 R 中使用 h2o.glm 随机效应时出错
我想在 R 中使用 h2o 进行 glm 回归,但具有随机效应(HGLM,似乎可能来自 此页面 )。我还没有设法让它工作,并出现我不明白的错误。
这是我的工作示例:我用辛普森悖论定义了一个数据集:全球呈上升趋势,但每组呈下降趋势
library(tidyverse)
library(ggplot2)
library(h2o)
library(data.table)
global_slope <- 1
global_int <- 1
Npoints_per_group <- 50
N_groups <- 10
pentes <- rnorm(N_groups,-1,.5)
centers_x <- seq(0,10,length = N_groups)
center_y <- global_slope*centers_x + global_int
group_spread <- 2
group_names <- sample(LETTERS,N_groups)
df <- lapply(1:N_groups,function(i){
x <- seq(centers_x[i]-group_spread/2,centers_x[i]+group_spread/2,length = Npoints_per_group)
y <- pentes[i]*(x- centers_x[i])+center_y[i]+rnorm(Npoints_per_group)
data.table(x = x,y = y,ID = group_names[i])
}) %>% rbindlist()
您可以识别出类似于 辛普森悖论的维基页面:
ggplot(df,aes(x,y,color = as.factor(ID)))+
geom_point()
没有随机效应的线性回归看到增加的趋势:
lm(y~x,data = df) %>%
summary()
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.28187 0.13077 9.803 <2e-16 ***
x 0.94147 0.02194 42.917 <2e-16 ***
标准的多级回归看起来像这样:
library(lme4)
library(lmerTest)
lmer( y ~ x + (1+x|ID) ,data = df) %>%
summary()
并且会正确估计下降趋势:
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 11.7192 2.6218 8.8220 4.470 0.001634 **
x -1.0418 0.1959 8.9808 -5.318 0.000486 ***
现在我用 h2o
进行测试:
library(h2o)
h2o.init()
df2 <- as.h2o(df)
test_glm <- h2o.glm(family = "gaussian",
x = "x",
y = "y",
training_frame = df2,
lambda = 0,
compute_p_values = TRUE)
test_glm
它运行良好,类似于上面的线性模型:
Coefficients: glm coefficients
names coefficients std_error z_value p_value standardized_coefficients
1 Intercept 1.281868 0.130766 9.802785 0.000000 5.989232
2 x 0.941473 0.021937 42.916536 0.000000 3.058444
但是当我想使用随机效应时:
test_glm2 <- h2o.glm(family = "gaussian",
x = "x",
y = "y",
training_frame = df2,
random_columns = "ID",
lambda = 0,
compute_p_values = TRUE)
我得到了
.h2o.checkAndUnifyModelParameters(algo = algo, allParams = ALL_PARAMS, 中的错误:random_columns 的向量必须是数字类型,但具有字符类型。
即使我强制 df2$ID <- as.numeric(df2$ID)
。
我做错了什么?使用lmer
找到类似于混合效应模型的正确方法是什么?和拦截)?
编辑
我更改为使用,如 Erin LeDell 的建议,我现在得到一个不同的错误,我也不明白:
df2$ID <- as.factor(df2$ID)
test_glm2 <- h2o.glm(family = "gaussian",
x = "x",
y = "y",
training_frame = df2,
random_columns = c(3),
HGLM = TRUE,
lambda = 0,
compute_p_values = TRUE)
DistributedException from localhost/127.0.0.1:54321: 'null', caused by java.lang.NullPointerException
DistributedException from localhost/127.0.0.1:54321: 'null', caused by java.lang.NullPointerException
at water.MRTask.getResult(MRTask.java:660)
at water.MRTask.getResult(MRTask.java:670)
at water.MRTask.doAll(MRTask.java:530)
at water.MRTask.doAll(MRTask.java:482)
at hex.glm.GLM$GLMDriver.fitCoeffs(GLM.java:1334)
at hex.glm.GLM$GLMDriver.fitHGLM(GLM.java:1505)
at hex.glm.GLM$GLMDriver.fitModel(GLM.java:2060)
at hex.glm.GLM$GLMDriver.computeSubmodel(GLM.java:2526)
at hex.glm.GLM$GLMDriver.doCompute(GLM.java:2664)
at hex.glm.GLM$GLMDriver.computeImpl(GLM.java:2561)
at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:247)
at hex.glm.GLM$GLMDriver.compute2(GLM.java:1188)
at water.H2O$H2OCountedCompleter.compute(H2O.java:1658)
at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:976)
at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479)
at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
编辑 2:
我实际上找到了一种通过添加来消除上述错误的方法。
rand_link = c("identity"),
rand_family = c("gaussian"),
到 h2o.glm
参数:
h2o.glm(family = "gaussian",
rand_link = c("identity"),
rand_family = c("gaussian"),
# compute_p_values = TRUE,
x = "x",
y = "y",
training_frame = df2,
random_columns = c(3),
HGLM = TRUE,
lambda = 0)
有效。但是当我设置 compute_p_values = TRUE
时,然后发现一个新错误:
Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = page, :
ERROR MESSAGE:
degrees of freedom (0)
I would like to use h2o in R
for glm regression but with random effects (HGLM, seems possible from this page ). I do not manage to make it work yet, and get errors I do not understand.
Is here my working example: I define a dataset with Simpson paradox: a global increasing trend, but a decreasing trend in each group
library(tidyverse)
library(ggplot2)
library(h2o)
library(data.table)
global_slope <- 1
global_int <- 1
Npoints_per_group <- 50
N_groups <- 10
pentes <- rnorm(N_groups,-1,.5)
centers_x <- seq(0,10,length = N_groups)
center_y <- global_slope*centers_x + global_int
group_spread <- 2
group_names <- sample(LETTERS,N_groups)
df <- lapply(1:N_groups,function(i){
x <- seq(centers_x[i]-group_spread/2,centers_x[i]+group_spread/2,length = Npoints_per_group)
y <- pentes[i]*(x- centers_x[i])+center_y[i]+rnorm(Npoints_per_group)
data.table(x = x,y = y,ID = group_names[i])
}) %>% rbindlist()
You can recognize something similar to the example of the wiki page of Simpson paradox:
ggplot(df,aes(x,y,color = as.factor(ID)))+
geom_point()
The linear regression without random effect sees the increasing trend:
lm(y~x,data = df) %>%
summary()
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.28187 0.13077 9.803 <2e-16 ***
x 0.94147 0.02194 42.917 <2e-16 ***
A standard multilevel regression would look like that:
library(lme4)
library(lmerTest)
lmer( y ~ x + (1+x|ID) ,data = df) %>%
summary()
And would estimate properly a decreasing trend:
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 11.7192 2.6218 8.8220 4.470 0.001634 **
x -1.0418 0.1959 8.9808 -5.318 0.000486 ***
Now I test with h2o
:
library(h2o)
h2o.init()
df2 <- as.h2o(df)
test_glm <- h2o.glm(family = "gaussian",
x = "x",
y = "y",
training_frame = df2,
lambda = 0,
compute_p_values = TRUE)
test_glm
And it works well, similar to the linear model above:
Coefficients: glm coefficients
names coefficients std_error z_value p_value standardized_coefficients
1 Intercept 1.281868 0.130766 9.802785 0.000000 5.989232
2 x 0.941473 0.021937 42.916536 0.000000 3.058444
But when I want to use random effects:
test_glm2 <- h2o.glm(family = "gaussian",
x = "x",
y = "y",
training_frame = df2,
random_columns = "ID",
lambda = 0,
compute_p_values = TRUE)
I got
Error in .h2o.checkAndUnifyModelParameters(algo = algo, allParams = ALL_PARAMS, : vector of random_columns must be of type numeric, but got character.
Even if I force df2$ID <- as.numeric(df2$ID)
.
What Am I doing wrong? What is the proper way to find something similar to the mixed effect model with lmer
(i.e. random slope and intercept)?
EDIT
I changed to use, as suggested by Erin LeDell, the column number. I now get a different error, that I do not understand either:
df2$ID <- as.factor(df2$ID)
test_glm2 <- h2o.glm(family = "gaussian",
x = "x",
y = "y",
training_frame = df2,
random_columns = c(3),
HGLM = TRUE,
lambda = 0,
compute_p_values = TRUE)
DistributedException from localhost/127.0.0.1:54321: 'null', caused by java.lang.NullPointerException
DistributedException from localhost/127.0.0.1:54321: 'null', caused by java.lang.NullPointerException
at water.MRTask.getResult(MRTask.java:660)
at water.MRTask.getResult(MRTask.java:670)
at water.MRTask.doAll(MRTask.java:530)
at water.MRTask.doAll(MRTask.java:482)
at hex.glm.GLM$GLMDriver.fitCoeffs(GLM.java:1334)
at hex.glm.GLM$GLMDriver.fitHGLM(GLM.java:1505)
at hex.glm.GLM$GLMDriver.fitModel(GLM.java:2060)
at hex.glm.GLM$GLMDriver.computeSubmodel(GLM.java:2526)
at hex.glm.GLM$GLMDriver.doCompute(GLM.java:2664)
at hex.glm.GLM$GLMDriver.computeImpl(GLM.java:2561)
at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:247)
at hex.glm.GLM$GLMDriver.compute2(GLM.java:1188)
at water.H2O$H2OCountedCompleter.compute(H2O.java:1658)
at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:976)
at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479)
at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
Edit 2:
I actually found a way to remove the above error, by adding
rand_link = c("identity"),
rand_family = c("gaussian"),
to the h2o.glm
arguments:
h2o.glm(family = "gaussian",
rand_link = c("identity"),
rand_family = c("gaussian"),
# compute_p_values = TRUE,
x = "x",
y = "y",
training_frame = df2,
random_columns = c(3),
HGLM = TRUE,
lambda = 0)
Works. But when I set compute_p_values = TRUE
, and then find a new error:
Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = page, :
ERROR MESSAGE:
degrees of freedom (0)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
代码有一些问题(我们需要更好地记录
random_columns
参数)。目前random_columns
参数仅支持列索引(不支持列名称),我创建了一个 JIRA 来改进这一点。该错误实际上并不是说该列必须是数字;而是说该列必须是数字。事实上它需要成为一个因素。最后,您需要设置
HGLM = TRUE
。要修复上面的代码,您可以执行以下操作:编辑:这仍然会导致错误,因此我提交了错误报告
There's a few things wrong with the code (we need to do a better job of documenting the
random_columns
parameter). Currently therandom_columns
parameter only supports column indexes (not column names) and I created a JIRA to improve this.The error is not actually saying that the column has to be numeric; in fact it needs to be a factor. And lastly, you need to set
HGLM = TRUE
. To fix your code above, you can do:EDIT: This still causes a bug, so I filed a bug report here.