如何获得列联表？

发布于 2024-12-04 15:38:21 字数 1588 浏览 0 评论 0原文

我正在尝试根据特定类型的数据创建列联表。这可以通过循环等实现...但是因为我的最终表将包含超过 10E5 个单元格，所以我正在寻找一个预先存在的函数。

我的初始数据如下：

PLANT                  ANIMAL                          INTERACTIONS
---------------------- ------------------------------- ------------
Tragopogon_pratensis   Propylea_quatuordecimpunctata         1
Anthriscus_sylvestris  Rhagonycha_nigriventris               3
Anthriscus_sylvestris  Sarcophaga_carnaria                   2
Heracleum_sphondylium  Sarcophaga_carnaria                   1
Anthriscus_sylvestris  Sarcophaga_variegata                  4
Anthriscus_sylvestris  Sphaerophoria_interrupta_Gruppe       3
Cerastium_holosteoides Sphaerophoria_interrupta_Gruppe       1

我想创建一个这样的表格：

                       Propylea_quatuordecimpunctata Rhagonycha_nigriventris Sarcophaga_carnaria Sarcophaga_variegata Sphaerophoria_interrupta_Gruppe
---------------------- ----------------------------- ----------------------- ------------------- -------------------- -------------------------------
Tragopogon_pratensis   1                             0                       0                   0                    0
Anthriscus_sylvestris  0                             3                       2                   4                    3
Heracleum_sphondylium  0                             0                       1                   0                    0
Cerastium_holosteoides 0                             0                       0                   0                    1

即所有植物物种在行中，所有动物物种在列中，有时没有相互作用（而我的初始数据仅列出发生的相互作用）。

原文

I am trying to create a contingency table from a particular type of data. This would be doable with loops etc... but because my final table would contain more than 10E5 cells, I am looking for a pre-existing function.

My initial data are as follow:

PLANT                  ANIMAL                          INTERACTIONS
---------------------- ------------------------------- ------------
Tragopogon_pratensis   Propylea_quatuordecimpunctata         1
Anthriscus_sylvestris  Rhagonycha_nigriventris               3
Anthriscus_sylvestris  Sarcophaga_carnaria                   2
Heracleum_sphondylium  Sarcophaga_carnaria                   1
Anthriscus_sylvestris  Sarcophaga_variegata                  4
Anthriscus_sylvestris  Sphaerophoria_interrupta_Gruppe       3
Cerastium_holosteoides Sphaerophoria_interrupta_Gruppe       1

I would like to create a table like this:

                       Propylea_quatuordecimpunctata Rhagonycha_nigriventris Sarcophaga_carnaria Sarcophaga_variegata Sphaerophoria_interrupta_Gruppe
---------------------- ----------------------------- ----------------------- ------------------- -------------------- -------------------------------
Tragopogon_pratensis   1                             0                       0                   0                    0
Anthriscus_sylvestris  0                             3                       2                   4                    3
Heracleum_sphondylium  0                             0                       1                   0                    0
Cerastium_holosteoides 0                             0                       0                   0                    1

That is, all plant species in row, all animal species in columns, and sometimes there are no interactions (while my initial data only list interactions that occur).

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

无悔心 2024-12-11 15:38:21

在基础 R 中，使用 table 或 xtabs：

with(warpbreaks, table(wool, tension))

    tension
wool L M H
   A 9 9 9
   B 9 9 9

xtabs(~wool+tension, data=warpbreaks)

    tension
wool L M H
   A 9 9 9
   B 9 9 9

gmodels 包有一个函数 CrossTable，它提供的输出类似于SPSS 或 SAS 用户期望：

library(gmodels)
with(warpbreaks, CrossTable(wool, tension))


   Cell Contents
|-------------------------|
|                       N |
| Chi-square contribution |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|


Total Observations in Table:  54 


             | tension 
        wool |         L |         M |         H | Row Total | 
-------------|-----------|-----------|-----------|-----------|
           A |         9 |         9 |         9 |        27 | 
             |     0.000 |     0.000 |     0.000 |           | 
             |     0.333 |     0.333 |     0.333 |     0.500 | 
             |     0.500 |     0.500 |     0.500 |           | 
             |     0.167 |     0.167 |     0.167 |           | 
-------------|-----------|-----------|-----------|-----------|
           B |         9 |         9 |         9 |        27 | 
             |     0.000 |     0.000 |     0.000 |           | 
             |     0.333 |     0.333 |     0.333 |     0.500 | 
             |     0.500 |     0.500 |     0.500 |           | 
             |     0.167 |     0.167 |     0.167 |           | 
-------------|-----------|-----------|-----------|-----------|
Column Total |        18 |        18 |        18 |        54 | 
             |     0.333 |     0.333 |     0.333 |           | 
-------------|-----------|-----------|-----------|-----------|

In base R, use table or xtabs:

with(warpbreaks, table(wool, tension))

    tension
wool L M H
   A 9 9 9
   B 9 9 9

xtabs(~wool+tension, data=warpbreaks)

    tension
wool L M H
   A 9 9 9
   B 9 9 9

The gmodels packages has a function CrossTable that gives output similar to what users of SPSS or SAS expects:

library(gmodels)
with(warpbreaks, CrossTable(wool, tension))


   Cell Contents
|-------------------------|
|                       N |
| Chi-square contribution |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|


Total Observations in Table:  54 


             | tension 
        wool |         L |         M |         H | Row Total | 
-------------|-----------|-----------|-----------|-----------|
           A |         9 |         9 |         9 |        27 | 
             |     0.000 |     0.000 |     0.000 |           | 
             |     0.333 |     0.333 |     0.333 |     0.500 | 
             |     0.500 |     0.500 |     0.500 |           | 
             |     0.167 |     0.167 |     0.167 |           | 
-------------|-----------|-----------|-----------|-----------|
           B |         9 |         9 |         9 |        27 | 
             |     0.000 |     0.000 |     0.000 |           | 
             |     0.333 |     0.333 |     0.333 |     0.500 | 
             |     0.500 |     0.500 |     0.500 |           | 
             |     0.167 |     0.167 |     0.167 |           | 
-------------|-----------|-----------|-----------|-----------|
Column Total |        18 |        18 |        18 |        54 | 
             |     0.333 |     0.333 |     0.333 |           | 
-------------|-----------|-----------|-----------|-----------|

回复收藏 0 原文

掩耳倾听 2024-12-11 15:38:21

reshape 包应该可以解决问题。

> library(reshape)

> df <- data.frame(PLANT = c("Tragopogon_pratensis","Anthriscus_sylvestris","Anthriscus_sylvestris","Heracleum_sphondylium","Anthriscus_sylvestris","Anthriscus_sylvestris","Cerastium_holosteoides"),
                   ANIMAL= c("Propylea_quatuordecimpunctata","Rhagonycha_nigriventris","Sarcophaga_carnaria","Sarcophaga_carnaria","Sarcophaga_variegata","Sphaerophoria_interrupta_Gruppe","Sphaerophoria_interrupta_Gruppe"),
                   INTERACTIONS = c(1,3,2,1,4,3,1),
                   stringsAsFactors=FALSE)

> df <- melt(df,id.vars=c("PLANT","ANIMAL"))    
> df <- cast(df,formula=PLANT~ANIMAL)
> df <- replace(df,is.na(df),0)

> df
                   PLANT Propylea_quatuordecimpunctata Rhagonycha_nigriventris
1  Anthriscus_sylvestris                             0                       3
2 Cerastium_holosteoides                             0                       0
3  Heracleum_sphondylium                             0                       0
4   Tragopogon_pratensis                             1                       0
  Sarcophaga_carnaria Sarcophaga_variegata Sphaerophoria_interrupta_Gruppe
1                   2                    4                               3
2                   0                    0                               1
3                   1                    0                               0
4                   0                    0                               0

我仍在研究如何解决订单问题，有什么建议吗？

the reshape package should do the trick.

> library(reshape)

> df <- data.frame(PLANT = c("Tragopogon_pratensis","Anthriscus_sylvestris","Anthriscus_sylvestris","Heracleum_sphondylium","Anthriscus_sylvestris","Anthriscus_sylvestris","Cerastium_holosteoides"),
                   ANIMAL= c("Propylea_quatuordecimpunctata","Rhagonycha_nigriventris","Sarcophaga_carnaria","Sarcophaga_carnaria","Sarcophaga_variegata","Sphaerophoria_interrupta_Gruppe","Sphaerophoria_interrupta_Gruppe"),
                   INTERACTIONS = c(1,3,2,1,4,3,1),
                   stringsAsFactors=FALSE)

> df <- melt(df,id.vars=c("PLANT","ANIMAL"))    
> df <- cast(df,formula=PLANT~ANIMAL)
> df <- replace(df,is.na(df),0)

> df
                   PLANT Propylea_quatuordecimpunctata Rhagonycha_nigriventris
1  Anthriscus_sylvestris                             0                       3
2 Cerastium_holosteoides                             0                       0
3  Heracleum_sphondylium                             0                       0
4   Tragopogon_pratensis                             1                       0
  Sarcophaga_carnaria Sarcophaga_variegata Sphaerophoria_interrupta_Gruppe
1                   2                    4                               3
2                   0                    0                               1
3                   1                    0                               0
4                   0                    0                               0

I'm still figuring out how to fix the order issue, any suggestion?

回复收藏 0 原文

酷到爆炸 2024-12-11 15:38:21

我想指出的是，我们可以在不使用函数 with 的情况下获得与 Andrie 发布的相同结果：

R 基础

# 3 options
table(warpbreaks[, 2:3])
table(warpbreaks[, c("wool", "tension")])
table(warpbreaks$wool, warpbreaks$tension, dnn = c("wool", "tension"))

    tension
wool L M H
   A 9 9 9
   B 9 9 9

包包 gmodels：

library(gmodels)
# 2 options    
CrossTable(warpbreaks$wool, warpbreaks$tension)
CrossTable(warpbreaks$wool, warpbreaks$tension, dnn = c("Wool", "Tension"))


   Cell Contents
|-------------------------|
|                       N |
| Chi-square contribution |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|


Total Observations in Table:  54 


                | warpbreaks$tension 
warpbreaks$wool |         L |         M |         H | Row Total | 
----------------|-----------|-----------|-----------|-----------|
              A |         9 |         9 |         9 |        27 | 
                |     0.000 |     0.000 |     0.000 |           | 
                |     0.333 |     0.333 |     0.333 |     0.500 | 
                |     0.500 |     0.500 |     0.500 |           | 
                |     0.167 |     0.167 |     0.167 |           | 
----------------|-----------|-----------|-----------|-----------|
              B |         9 |         9 |         9 |        27 | 
                |     0.000 |     0.000 |     0.000 |           | 
                |     0.333 |     0.333 |     0.333 |     0.500 | 
                |     0.500 |     0.500 |     0.500 |           | 
                |     0.167 |     0.167 |     0.167 |           | 
----------------|-----------|-----------|-----------|-----------|
   Column Total |        18 |        18 |        18 |        54 | 
                |     0.333 |     0.333 |     0.333 |           | 
----------------|-----------|-----------|-----------|-----------|

I'd like to point out that we can get the same results Andrie posted without using the function with:

R Base Package

# 3 options
table(warpbreaks[, 2:3])
table(warpbreaks[, c("wool", "tension")])
table(warpbreaks$wool, warpbreaks$tension, dnn = c("wool", "tension"))

    tension
wool L M H
   A 9 9 9
   B 9 9 9

Package gmodels:

library(gmodels)
# 2 options    
CrossTable(warpbreaks$wool, warpbreaks$tension)
CrossTable(warpbreaks$wool, warpbreaks$tension, dnn = c("Wool", "Tension"))


   Cell Contents
|-------------------------|
|                       N |
| Chi-square contribution |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|


Total Observations in Table:  54 


                | warpbreaks$tension 
warpbreaks$wool |         L |         M |         H | Row Total | 
----------------|-----------|-----------|-----------|-----------|
              A |         9 |         9 |         9 |        27 | 
                |     0.000 |     0.000 |     0.000 |           | 
                |     0.333 |     0.333 |     0.333 |     0.500 | 
                |     0.500 |     0.500 |     0.500 |           | 
                |     0.167 |     0.167 |     0.167 |           | 
----------------|-----------|-----------|-----------|-----------|
              B |         9 |         9 |         9 |        27 | 
                |     0.000 |     0.000 |     0.000 |           | 
                |     0.333 |     0.333 |     0.333 |     0.500 | 
                |     0.500 |     0.500 |     0.500 |           | 
                |     0.167 |     0.167 |     0.167 |           | 
----------------|-----------|-----------|-----------|-----------|
   Column Total |        18 |        18 |        18 |        54 | 
                |     0.333 |     0.333 |     0.333 |           | 
----------------|-----------|-----------|-----------|-----------|

回复收藏 0 原文

帅的被狗咬 2024-12-11 15:38:21

基础 R 中的 xtabs 应该可以工作，例如：

dat <- data.frame(PLANT = c("p1", "p2", "p2", "p4", "p5", "p5", "p6"),
                  ANIMAL = c("a1", "a2", "a3", "a3", "a4", "a5", "a5"),
                  INTERACTIONS = c(1,3,2,1,4,3,1),
                  stringsAsFactors = FALSE)

(x2.table <- xtabs(dat$INTERACTIONS ~ dat$PLANT + dat$ANIMAL))

     dat$ANIMAL
dat$PLANT a1 a2 a3 a4 a5
       p1  1  0  0  0  0
       p2  0  3  2  0  0
       p4  0  0  1  0  0
       p5  0  0  0  4  3
       p6  0  0  0  0  1

chisq.test(x2.table, simulate.p.value = TRUE)

我认为这应该相当容易地完成您正在寻找的任务。我不确定它如何在效率方面扩展到 10E5 列联表，但这可能是一个单独的统计问题。

xtabs in base R should work, for example:

dat <- data.frame(PLANT = c("p1", "p2", "p2", "p4", "p5", "p5", "p6"),
                  ANIMAL = c("a1", "a2", "a3", "a3", "a4", "a5", "a5"),
                  INTERACTIONS = c(1,3,2,1,4,3,1),
                  stringsAsFactors = FALSE)

(x2.table <- xtabs(dat$INTERACTIONS ~ dat$PLANT + dat$ANIMAL))

     dat$ANIMAL
dat$PLANT a1 a2 a3 a4 a5
       p1  1  0  0  0  0
       p2  0  3  2  0  0
       p4  0  0  1  0  0
       p5  0  0  0  4  3
       p6  0  0  0  0  1

chisq.test(x2.table, simulate.p.value = TRUE)

I think that should do what you're looking for fairly easily. I'm not sure how it scales up in terms of efficiency to a 10E5 contingency table, but that might be a separate issue statistically.

回复收藏 0 原文

银河中√捞星星 2024-12-11 15:38:21

使用 dplyr / tidyr ：

df <- read.table(text='PLANT                  ANIMAL                          INTERACTIONS
                 Tragopogon_pratensis   Propylea_quatuordecimpunctata         1
                 Anthriscus_sylvestris  Rhagonycha_nigriventris               3
                 Anthriscus_sylvestris  Sarcophaga_carnaria                   2
                 Heracleum_sphondylium  Sarcophaga_carnaria                   1
                 Anthriscus_sylvestris  Sarcophaga_variegata                  4
                 Anthriscus_sylvestris  Sphaerophoria_interrupta_Gruppe       3
                 Cerastium_holosteoides Sphaerophoria_interrupta_Gruppe       1', header=TRUE)
library(dplyr)
library(tidyr)
df %>% spread(ANIMAL, INTERACTIONS, fill=0)

#                    PLANT Propylea_quatuordecimpunctata Rhagonycha_nigriventris Sarcophaga_carnaria Sarcophaga_variegata Sphaerophoria_interrupta_Gruppe
# 1  Anthriscus_sylvestris                             0                       3                   2                    4                               3
# 2 Cerastium_holosteoides                             0                       0                   0                    0                               1
# 3  Heracleum_sphondylium                             0                       0                   1                    0                               0
# 4   Tragopogon_pratensis                             1                       0                   0                    0                               0

With dplyr / tidyr:

df <- read.table(text='PLANT                  ANIMAL                          INTERACTIONS
                 Tragopogon_pratensis   Propylea_quatuordecimpunctata         1
                 Anthriscus_sylvestris  Rhagonycha_nigriventris               3
                 Anthriscus_sylvestris  Sarcophaga_carnaria                   2
                 Heracleum_sphondylium  Sarcophaga_carnaria                   1
                 Anthriscus_sylvestris  Sarcophaga_variegata                  4
                 Anthriscus_sylvestris  Sphaerophoria_interrupta_Gruppe       3
                 Cerastium_holosteoides Sphaerophoria_interrupta_Gruppe       1', header=TRUE)
library(dplyr)
library(tidyr)
df %>% spread(ANIMAL, INTERACTIONS, fill=0)

#                    PLANT Propylea_quatuordecimpunctata Rhagonycha_nigriventris Sarcophaga_carnaria Sarcophaga_variegata Sphaerophoria_interrupta_Gruppe
# 1  Anthriscus_sylvestris                             0                       3                   2                    4                               3
# 2 Cerastium_holosteoides                             0                       0                   0                    0                               1
# 3  Heracleum_sphondylium                             0                       0                   1                    0                               0
# 4   Tragopogon_pratensis                             1                       0                   0                    0                               0

回复收藏 0 原文

逆夏时光 2024-12-11 15:38:21

只需使用“reshape2”包中的dcast()函数即可：

ans = dcast( df, PLANT~ ANIMAL,value.var = "INTERACTIONS", fill = 0 )

这里“PLANT”将位于左列，“ANIMALS”将位于顶行，表格的填充将使用“INTERACTIONS”发生，“NULL”值将使用 0 填充。

Simply use dcast() function of "reshape2" package:

ans = dcast( df, PLANT~ ANIMAL,value.var = "INTERACTIONS", fill = 0 )

Here "PLANT" will be on the left column, "ANIMALS" on the top row, filling of the table will happen using "INTERACTIONS" and "NULL" values will be filled using 0's.

回复收藏 0 原文

~没有更多了~