在 R 中使用 h2o.glm 随机效应时出错

发布于 2025-01-18 09:56:48 字数 5630 浏览 3 评论 0原文

我想在 R 中使用 h2o 进行 glm 回归，但具有随机效应（HGLM，似乎可能来自此页面）。我还没有设法让它工作，并出现我不明白的错误。

这是我的工作示例：我用辛普森悖论定义了一个数据集：全球呈上升趋势，但每组呈下降趋势

library(tidyverse)
library(ggplot2)
library(h2o)
library(data.table)

global_slope <- 1
global_int <- 1

Npoints_per_group <- 50
N_groups <- 10
pentes <- rnorm(N_groups,-1,.5)

centers_x <- seq(0,10,length = N_groups)
center_y <- global_slope*centers_x + global_int

group_spread <- 2

group_names <- sample(LETTERS,N_groups)

df <- lapply(1:N_groups,function(i){
  x <- seq(centers_x[i]-group_spread/2,centers_x[i]+group_spread/2,length = Npoints_per_group)
  y <- pentes[i]*(x- centers_x[i])+center_y[i]+rnorm(Npoints_per_group)
  data.table(x = x,y = y,ID = group_names[i])
}) %>% rbindlist()

您可以识别出类似于辛普森悖论的维基页面：

ggplot(df,aes(x,y,color = as.factor(ID)))+
  geom_point()

没有随机效应的线性回归看到增加的趋势：

lm(y~x,data = df) %>% 
summary()

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.28187    0.13077   9.803   <2e-16 ***
x            0.94147    0.02194  42.917   <2e-16 ***

标准的多级回归看起来像这样：

library(lme4)
library(lmerTest)

lmer( y ~ x + (1+x|ID) ,data = df) %>% 
  summary()

并且会正确估计下降趋势：

Fixed effects:
            Estimate Std. Error      df t value Pr(>|t|)    
(Intercept)  11.7192     2.6218  8.8220   4.470 0.001634 ** 
x            -1.0418     0.1959  8.9808  -5.318 0.000486 ***

现在我用 h2o 进行测试：

library(h2o)
h2o.init()

df2 <- as.h2o(df)
test_glm <- h2o.glm(family = "gaussian",
                        x = "x",
                        y = "y",
                        training_frame = df2,
                        lambda = 0,
                        compute_p_values = TRUE)
test_glm

它运行良好，类似于上面的线性模型：

Coefficients: glm coefficients
      names coefficients std_error   z_value  p_value standardized_coefficients
1 Intercept     1.281868  0.130766  9.802785 0.000000                  5.989232
2         x     0.941473  0.021937 42.916536 0.000000                  3.058444

但是当我想使用随机效应时:

test_glm2 <- h2o.glm(family = "gaussian",
                     x = "x",
                     y = "y",
                     training_frame = df2,
                     random_columns = "ID",
                     lambda = 0,
                     compute_p_values = TRUE)

我得到了

.h2o.checkAndUnifyModelParameters(algo = algo, allParams = ALL_PARAMS, 中的错误：random_columns 的向量必须是数字类型，但具有字符类型。

即使我强制 df2$ID <- as.numeric(df2$ID)。

我做错了什么？使用lmer找到类似于混合效应模型的正确方法是什么？和拦截）？

编辑

我更改为使用，如 Erin LeDell 的建议，我现在得到一个不同的错误，我也不明白：

df2$ID  <- as.factor(df2$ID)

test_glm2 <- h2o.glm(family = "gaussian",
                     x = "x",
                     y = "y",
                     training_frame = df2,
                     random_columns = c(3),
                     HGLM = TRUE,
                     lambda = 0,
                     compute_p_values = TRUE)

DistributedException from localhost/127.0.0.1:54321: 'null', caused by java.lang.NullPointerException

DistributedException from localhost/127.0.0.1:54321: 'null', caused by java.lang.NullPointerException
    at water.MRTask.getResult(MRTask.java:660)
    at water.MRTask.getResult(MRTask.java:670)
    at water.MRTask.doAll(MRTask.java:530)
    at water.MRTask.doAll(MRTask.java:482)
    at hex.glm.GLM$GLMDriver.fitCoeffs(GLM.java:1334)
    at hex.glm.GLM$GLMDriver.fitHGLM(GLM.java:1505)
    at hex.glm.GLM$GLMDriver.fitModel(GLM.java:2060)
    at hex.glm.GLM$GLMDriver.computeSubmodel(GLM.java:2526)
    at hex.glm.GLM$GLMDriver.doCompute(GLM.java:2664)
    at hex.glm.GLM$GLMDriver.computeImpl(GLM.java:2561)
    at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:247)
    at hex.glm.GLM$GLMDriver.compute2(GLM.java:1188)
    at water.H2O$H2OCountedCompleter.compute(H2O.java:1658)
    at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
    at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
    at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:976)
    at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479)
    at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

编辑 2：

我实际上找到了一种通过添加来消除上述错误的方法。

rand_link = c("identity"),
rand_family = c("gaussian"),

到 h2o.glm 参数：

h2o.glm(family = "gaussian",
                     rand_link = c("identity"),
                     rand_family = c("gaussian"),
                     # compute_p_values = TRUE,
                     x = "x",
                     y = "y",
                     training_frame = df2,
                     random_columns = c(3),
                     HGLM = TRUE,
                     lambda = 0)

有效。但是当我设置 compute_p_values = TRUE 时，然后发现一个新错误：


Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = page,  : 
  

ERROR MESSAGE:

degrees of freedom (0)

原文

I would like to use h2o in R for glm regression but with random effects (HGLM, seems possible from this page ). I do not manage to make it work yet, and get errors I do not understand.

Is here my working example: I define a dataset with Simpson paradox: a global increasing trend, but a decreasing trend in each group

library(tidyverse)
library(ggplot2)
library(h2o)
library(data.table)

global_slope <- 1
global_int <- 1

Npoints_per_group <- 50
N_groups <- 10
pentes <- rnorm(N_groups,-1,.5)

centers_x <- seq(0,10,length = N_groups)
center_y <- global_slope*centers_x + global_int

group_spread <- 2

group_names <- sample(LETTERS,N_groups)

df <- lapply(1:N_groups,function(i){
  x <- seq(centers_x[i]-group_spread/2,centers_x[i]+group_spread/2,length = Npoints_per_group)
  y <- pentes[i]*(x- centers_x[i])+center_y[i]+rnorm(Npoints_per_group)
  data.table(x = x,y = y,ID = group_names[i])
}) %>% rbindlist()

You can recognize something similar to the example of the wiki page of Simpson paradox:

ggplot(df,aes(x,y,color = as.factor(ID)))+
  geom_point()

The linear regression without random effect sees the increasing trend:

lm(y~x,data = df) %>% 
summary()

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.28187    0.13077   9.803   <2e-16 ***
x            0.94147    0.02194  42.917   <2e-16 ***

A standard multilevel regression would look like that:

library(lme4)
library(lmerTest)

lmer( y ~ x + (1+x|ID) ,data = df) %>% 
  summary()

And would estimate properly a decreasing trend:

Fixed effects:
            Estimate Std. Error      df t value Pr(>|t|)    
(Intercept)  11.7192     2.6218  8.8220   4.470 0.001634 ** 
x            -1.0418     0.1959  8.9808  -5.318 0.000486 ***

Now I test with h2o:

library(h2o)
h2o.init()

df2 <- as.h2o(df)
test_glm <- h2o.glm(family = "gaussian",
                        x = "x",
                        y = "y",
                        training_frame = df2,
                        lambda = 0,
                        compute_p_values = TRUE)
test_glm

And it works well, similar to the linear model above:

Coefficients: glm coefficients
      names coefficients std_error   z_value  p_value standardized_coefficients
1 Intercept     1.281868  0.130766  9.802785 0.000000                  5.989232
2         x     0.941473  0.021937 42.916536 0.000000                  3.058444

But when I want to use random effects:

test_glm2 <- h2o.glm(family = "gaussian",
                     x = "x",
                     y = "y",
                     training_frame = df2,
                     random_columns = "ID",
                     lambda = 0,
                     compute_p_values = TRUE)

I got

Error in .h2o.checkAndUnifyModelParameters(algo = algo, allParams = ALL_PARAMS, : vector of random_columns must be of type numeric, but got character.

Even if I force df2$ID <- as.numeric(df2$ID).

What Am I doing wrong? What is the proper way to find something similar to the mixed effect model with lmer (i.e. random slope and intercept)?

EDIT

I changed to use, as suggested by Erin LeDell, the column number. I now get a different error, that I do not understand either:

df2$ID  <- as.factor(df2$ID)

test_glm2 <- h2o.glm(family = "gaussian",
                     x = "x",
                     y = "y",
                     training_frame = df2,
                     random_columns = c(3),
                     HGLM = TRUE,
                     lambda = 0,
                     compute_p_values = TRUE)

DistributedException from localhost/127.0.0.1:54321: 'null', caused by java.lang.NullPointerException

DistributedException from localhost/127.0.0.1:54321: 'null', caused by java.lang.NullPointerException
    at water.MRTask.getResult(MRTask.java:660)
    at water.MRTask.getResult(MRTask.java:670)
    at water.MRTask.doAll(MRTask.java:530)
    at water.MRTask.doAll(MRTask.java:482)
    at hex.glm.GLM$GLMDriver.fitCoeffs(GLM.java:1334)
    at hex.glm.GLM$GLMDriver.fitHGLM(GLM.java:1505)
    at hex.glm.GLM$GLMDriver.fitModel(GLM.java:2060)
    at hex.glm.GLM$GLMDriver.computeSubmodel(GLM.java:2526)
    at hex.glm.GLM$GLMDriver.doCompute(GLM.java:2664)
    at hex.glm.GLM$GLMDriver.computeImpl(GLM.java:2561)
    at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:247)
    at hex.glm.GLM$GLMDriver.compute2(GLM.java:1188)
    at water.H2O$H2OCountedCompleter.compute(H2O.java:1658)
    at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
    at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
    at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:976)
    at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479)
    at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

Edit 2:

I actually found a way to remove the above error, by adding

rand_link = c("identity"),
rand_family = c("gaussian"),

to the h2o.glm arguments:

h2o.glm(family = "gaussian",
                     rand_link = c("identity"),
                     rand_family = c("gaussian"),
                     # compute_p_values = TRUE,
                     x = "x",
                     y = "y",
                     training_frame = df2,
                     random_columns = c(3),
                     HGLM = TRUE,
                     lambda = 0)

Works. But when I set compute_p_values = TRUE, and then find a new error:


Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = page,  : 
  

ERROR MESSAGE:

degrees of freedom (0)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

温柔嚣张 2025-01-25 09:56:48

代码有一些问题（我们需要更好地记录 random_columns 参数）。目前 random_columns 参数仅支持列索引（不支持列名称），我创建了一个 JIRA 来改进这一点。

该错误实际上并不是说该列必须是数字；而是说该列必须是数字。事实上它需要成为一个因素。最后，您需要设置HGLM = TRUE。要修复上面的代码，您可以执行以下操作：

df2$ID2 <- as.factor(df2$ID2)

test_glm2 <- h2o.glm(family = "gaussian",
                     x = "x",
                     y = "y",
                     training_frame = df2,
                     random_columns = c(4),
                     HGLM = TRUE,
                     lambda = 0,
                     compute_p_values = TRUE)

编辑：这仍然会导致错误，因此我提交了错误报告

There's a few things wrong with the code (we need to do a better job of documenting the random_columns parameter). Currently the random_columns parameter only supports column indexes (not column names) and I created a JIRA to improve this.

The error is not actually saying that the column has to be numeric; in fact it needs to be a factor. And lastly, you need to set HGLM = TRUE. To fix your code above, you can do:

df2$ID2 <- as.factor(df2$ID2)

test_glm2 <- h2o.glm(family = "gaussian",
                     x = "x",
                     y = "y",
                     training_frame = df2,
                     random_columns = c(4),
                     HGLM = TRUE,
                     lambda = 0,
                     compute_p_values = TRUE)

EDIT: This still causes a bug, so I filed a bug report here.

回复收藏 0 原文

~没有更多了~