如何计算 R 中的经验 CDF？

发布于 2024-09-30 19:38:46 字数 587 浏览 6 评论 0原文

我正在从一个文件中读取一个稀疏表，如下所示：

1 0 7 0 0 1 0 0 0 5 0 0 0 0 2 0 0 0 0 1 0 0 0 1
1 0 0 1 0 0 0 3 0 0 0 0 1 0 0 0 1
0 0 0 1 0 0 0 2 0 0 0 0 1 0 0 0 1 0 1 0 0 1
1 0 0 1  0 3 0 0 0 0 1 0 0 0 1
0 0 0 1 0 0 0 2 0 0 0 0 1 0 0 0 1 0 1 0 0 1 1 2 1 0 1 0 1

注意行长度不同。

每一行代表一个模拟。每行第 i 列中的值表示在此模拟中观察到值 i-1 的次数。例如，在第一次模拟（第一行）中，我们得到了一个值为“0”的结果（第一列），7 个值为“2”的结果（第三列）等。

我希望创建一个平均累积分布函数（ CDF）用于所有模拟结果，因此我稍后可以使用它来计算真实结果的经验 p 值。

为此，我可以首先对每一列求和，但我需要对 undef 列取零。

如何读取这样一个具有不同行长度的表？如何汇总用 0' 替换 'undef' 值的列？最后，如何创建 CDF？（我可以手动执行此操作，但我想有一些包可以执行此操作）。

原文

I'm reading a sparse table from a file which looks like:

1 0 7 0 0 1 0 0 0 5 0 0 0 0 2 0 0 0 0 1 0 0 0 1
1 0 0 1 0 0 0 3 0 0 0 0 1 0 0 0 1
0 0 0 1 0 0 0 2 0 0 0 0 1 0 0 0 1 0 1 0 0 1
1 0 0 1  0 3 0 0 0 0 1 0 0 0 1
0 0 0 1 0 0 0 2 0 0 0 0 1 0 0 0 1 0 1 0 0 1 1 2 1 0 1 0 1

Note row lengths are different.

Each row represents a single simulation. The value in the i-th column in each row says how many times value i-1 was observed in this simulation. For example, in the first simulation (first row), we got a single result with value '0' (first column), 7 results with value '2' (third column) etc.

I wish to create an average cumulative distribution function (CDF) for all the simulation results, so I could later use it to calculate an empirical p-value for true results.

To do this I can first sum up each column, but I need to take zeros for the undef columns.

How do I read such a table with different row lengths? How do I sum up columns replacing 'undef' values with 0'? And finally, how do I create the CDF? (I can do this manually but I guess there is some package which can do that).

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

九八野马 2024-10-07 19:38:46

这将读取以下数据：

dat <- textConnection("1 0 7 0 0 1 0 0 0 5 0 0 0 0 2 0 0 0 0 1 0 0 0 1
1 0 0 1 0 0 0 3 0 0 0 0 1 0 0 0 1
0 0 0 1 0 0 0 2 0 0 0 0 1 0 0 0 1 0 1 0 0 1
1 0 0 1  0 3 0 0 0 0 1 0 0 0 1
0 0 0 1 0 0 0 2 0 0 0 0 1 0 0 0 1 0 1 0 0 1 1 2 1 0 1 0 1")
df <- data.frame(scan(dat, fill = TRUE, what = as.list(rep(1, 29))))
names(df) <- paste("Val", 1:29)
close(dat)

结果：

> head(df)
  Val 1 Val 2 Val 3 Val 4 Val 5 Val 6 Val 7 Val 8 Val 9 Val 10 Val 11 Val 12
1     1     0     7     0     0     1     0     0     0      5      0      0
2     1     0     0     1     0     0     0     3     0      0      0      0
3     0     0     0     1     0     0     0     2     0      0      0      0
4     1     0     0     1     0     3     0     0     0      0      1      0
5     0     0     0     1     0     0     0     2     0      0      0      0
....

如果数据位于文件中，请提供文件名而不是 dat。根据您提供的数据，此代码假定最多有 29 列。更改 29 以适应真实数据。

我们使用 ecdf() 函数获取列总和，

df.csum <- colSums(df, na.rm = TRUE)

生成所需的 ECDF，

df.ecdf <- ecdf(df.csum)

并且我们可以使用 plot() 方法将其绘制出来：

plot(df.ecdf, verticals = TRUE)

This will read the data in:

dat <- textConnection("1 0 7 0 0 1 0 0 0 5 0 0 0 0 2 0 0 0 0 1 0 0 0 1
1 0 0 1 0 0 0 3 0 0 0 0 1 0 0 0 1
0 0 0 1 0 0 0 2 0 0 0 0 1 0 0 0 1 0 1 0 0 1
1 0 0 1  0 3 0 0 0 0 1 0 0 0 1
0 0 0 1 0 0 0 2 0 0 0 0 1 0 0 0 1 0 1 0 0 1 1 2 1 0 1 0 1")
df <- data.frame(scan(dat, fill = TRUE, what = as.list(rep(1, 29))))
names(df) <- paste("Val", 1:29)
close(dat)

Resulting in:

> head(df)
  Val 1 Val 2 Val 3 Val 4 Val 5 Val 6 Val 7 Val 8 Val 9 Val 10 Val 11 Val 12
1     1     0     7     0     0     1     0     0     0      5      0      0
2     1     0     0     1     0     0     0     3     0      0      0      0
3     0     0     0     1     0     0     0     2     0      0      0      0
4     1     0     0     1     0     3     0     0     0      0      1      0
5     0     0     0     1     0     0     0     2     0      0      0      0
....

If the data are in a file, provide the file name instead of dat. This code presumes that there are a maximum of 29 columns, as per the data you supplied. Alter the 29 to suit the real data.

We get the column sums using