R 中与 data.tables 的交叉表

发布于 2025-01-17 13:38:44 字数 2887 浏览 3 评论 0原文

抱歉，如果有人问这个问题，我玩了我的玩具数据来学习操作 data.tables。我的目标是从这些数据中得出：

toy_data = data.table(from=c("A","A","A","C","E","E","A","A","A","C","E","E"),
                        to=c("B","C","A","D","F","E","E","A","A","A","C",NA))

得到这个结果：（

final_matrix
     L    A    B    C    D    E    F
1:   A    3    1    2 <NA>    1 <NA>
2:   B    1    0 <NA> <NA> <NA> <NA>
3:   C    2 <NA>    0    1    1 <NA>
4:   D <NA> <NA>    1    0 <NA> <NA>
5:   E    1 <NA>    1 <NA>    1    1
6:   F <NA> <NA> <NA> <NA>    1    0
7: tot    7    1    4    1    4    1

最终也用零而不是 NA，但感到无聊）。我想在 STATA 中这将是一个简单的交叉表，我构建了一个函数，然后循环遍历列中的唯一值（叹息：/）合并表格，然后添加最后一行的总计。现在，虽然我已经学到了很多东西，但我想知道获得此类交叉表的干净 R 方法是什么？因为以下方法不起作用：

table(toy_data$from,toy_data$to)
   
    A B C D E F
  A 3 1 1 0 1 0
  C 1 0 0 1 0 0
  E 0 0 1 0 1 1

谢谢。如果您有一般改进或最佳实践，我的功能我非常高兴：

create_edge_cols<- function(dt,column){
  #this function takes a df and a column, 
  #computes the number of edges among this column and all the other in dt
  #returns a column (list) with the cross-tabulation of columns
  tot_edges_i = dim(dt[from==column|to==column][,.(to=na.omit(to))])[1] # E better! without NAs
  print(tot_edges_i)
  # now tabulate links of column
  tab = data.table(table(unlist(dt[(from==column&to!=column)|
                                           (from!=column&to==column)])))
  setnames(tab, "V1", "L")
  setnames(tab, "N", column)
  setorder(tab,"L")
  tab[L==column,column] = length(dt[to==column & to == from,from])
  #tab[,`:=`(L=L,column=column/as.numeric(tot_edges_i))]
  return(tab)
}

#this should be the first column of our table
first_column = data.table("L"=unique(toy_data[,c(to[!is.na(to)],from)]))

#loop through the values of the columns and merge to a unique df
for (col in sort(unique(toy_data[!is.na(to),c(to,from)]))){
  info_column = copy(create_edge_cols(toy_data,col))
  first_column = merge.data.table(first_column,info_column,all.x = TRUE,all.y = TRUE)
}

## function to set first row as name
header.true <- function(df) {
  names(df) <- as.character(unlist(df[1,]))
  df[-1,]
}
# this should be the last row of our matrix:
last_row = transpose(data.table(table(unlist(toy_data[!is.na(toy_data$to),c(from,to[to!=from])]))))
last_row = cbind(data.table(matrix(c("L","tot"), ncol=1)),last_row)
last_row = header.true(last_row)
last_row

# let's concatenate
final_matrix = rbind(first_column,last_row)
final_matrix

编辑：以前的答案建议的解决方案现已删除：

library(igraph)
g <- graph_from_data_frame(na.omit(toy_data), directed = F)
am <- as_adjacency_matrix(g, type = "both")
addmargins(as.matrix(am[order(rownames(am)), order(colnames(am))]), 1)

原文

Sorry if this question has been asked, I played with my toy data to learn to manipulate data.tables. My goal was from this data:

toy_data = data.table(from=c("A","A","A","C","E","E","A","A","A","C","E","E"),
                        to=c("B","C","A","D","F","E","E","A","A","A","C",NA))

to arrive at this result:

final_matrix
     L    A    B    C    D    E    F
1:   A    3    1    2 <NA>    1 <NA>
2:   B    1    0 <NA> <NA> <NA> <NA>
3:   C    2 <NA>    0    1    1 <NA>
4:   D <NA> <NA>    1    0 <NA> <NA>
5:   E    1 <NA>    1 <NA>    1    1
6:   F <NA> <NA> <NA> <NA>    1    0
7: tot    7    1    4    1    4    1

(eventually also with zeros instead of NAs, but got bored). I suppose in STATA this would be an easy cross-tab, I have built a function then looped over the unique values in the cols (sigh :/) merged the tables and then added a final line with the totals. Now although I've learned a lot, I wonder what would the clean R way to obtain such cross-tabs be? since the following doesn't work:

table(toy_data$from,toy_data$to)
   
    A B C D E F
  A 3 1 1 0 1 0
  C 1 0 0 1 0 0
  E 0 0 1 0 1 1

Thanks. My function if you have general improvements or best practices I am super happy:

create_edge_cols<- function(dt,column){
  #this function takes a df and a column, 
  #computes the number of edges among this column and all the other in dt
  #returns a column (list) with the cross-tabulation of columns
  tot_edges_i = dim(dt[from==column|to==column][,.(to=na.omit(to))])[1] # E better! without NAs
  print(tot_edges_i)
  # now tabulate links of column
  tab = data.table(table(unlist(dt[(from==column&to!=column)|
                                           (from!=column&to==column)])))
  setnames(tab, "V1", "L")
  setnames(tab, "N", column)
  setorder(tab,"L")
  tab[L==column,column] = length(dt[to==column & to == from,from])
  #tab[,`:=`(L=L,column=column/as.numeric(tot_edges_i))]
  return(tab)
}

#this should be the first column of our table
first_column = data.table("L"=unique(toy_data[,c(to[!is.na(to)],from)]))

#loop through the values of the columns and merge to a unique df
for (col in sort(unique(toy_data[!is.na(to),c(to,from)]))){
  info_column = copy(create_edge_cols(toy_data,col))
  first_column = merge.data.table(first_column,info_column,all.x = TRUE,all.y = TRUE)
}

## function to set first row as name
header.true <- function(df) {
  names(df) <- as.character(unlist(df[1,]))
  df[-1,]
}
# this should be the last row of our matrix:
last_row = transpose(data.table(table(unlist(toy_data[!is.na(toy_data$to),c(from,to[to!=from])]))))
last_row = cbind(data.table(matrix(c("L","tot"), ncol=1)),last_row)
last_row = header.true(last_row)
last_row

# let's concatenate
final_matrix = rbind(first_column,last_row)
final_matrix

EDIT: solution suggested by previous answer now deleted:

library(igraph)
g <- graph_from_data_frame(na.omit(toy_data), directed = F)
am <- as_adjacency_matrix(g, type = "both")
addmargins(as.matrix(am[order(rownames(am)), order(colnames(am))]), 1)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

赠我空喜 2025-01-24 13:38:44

这是一种方式。问题的表语句中缺少的是因子级别，表仅处理数据中的内容。将列胁到具有相同级别的因素，并将na分配给等于零的计数。

还有一个打印问题，请参阅最后两个说明。 S＃类“表”方法打印的默认值不是打印na'。这可以手动更改。

library(data.table)

toy_data = data.table(from=c("A","A","A","C","E","E","A","A","A","C","E","E"),
                      to=c("B","C","A","D","F","E","E","A","A","A","C",NA))

levels <- sort(unique(unlist(toy_data)))
levels <- levels[!is.na(levels)]
toy_data[, c("from", "to") := lapply(.SD, factor, levels = levels)]
tbl <- table(toy_data)
is.na(tbl) <- tbl == 0
tbl
#>     to
#> from  A  B  C  D  E  F
#>    A  3  1  1     1   
#>    B                  
#>    C  1        1      
#>    D                  
#>    E        1     1  1
#>    F

print(tbl, na.print = NA)
#>     to
#> from    A    B    C    D    E    F
#>    A    3    1    1 <NA>    1 <NA>
#>    B <NA> <NA> <NA> <NA> <NA> <NA>
#>    C    1 <NA> <NA>    1 <NA> <NA>
#>    D <NA> <NA> <NA> <NA> <NA> <NA>
#>    E <NA> <NA>    1 <NA>    1    1
#>    F <NA> <NA> <NA> <NA> <NA> <NA>

^由

创建

要添加一个列在横桌底部的行，rbind上面的结果colsums。请注意，不再需要print（tbl，na.print = na），方法print（Autoprint）被调用为现在是矩阵方法。

library(data.table)

toy_data = data.table(from=c("A","A","A","C","E","E","A","A","A","C","E","E"),
                      to=c("B","C","A","D","F","E","E","A","A","A","C",NA))

levels <- sort(unique(unlist(toy_data)))
levels <- levels[!is.na(levels)]
toy_data[, c("from", "to") := lapply(.SD, factor, levels = levels)]
tbl <- table(toy_data)

class(tbl)  # check the output object class
#> [1] "table"

tbl <- rbind(tbl, tot = colSums(tbl, na.rm = TRUE))
is.na(tbl) <- tbl == 0

class(tbl)  # check the output object class, it's no longer "table"
#> [1] "matrix" "array"

tbl
#>      A  B  C  D  E  F
#> A    3  1  1 NA  1 NA
#> B   NA NA NA NA NA NA
#> C    1 NA NA  1 NA NA
#> D   NA NA NA NA NA NA
#> E   NA NA  1 NA  1  1
#> F   NA NA NA NA NA NA
#> tot  4  1  2  1  2  1

^由

Here is a way. What is missing in the question's table statement are factor levels, table is only processing what is in the data. Coerce the columns to factors with the same levels and assign NA to counts equal to zero.

There is also a print issue, see the final two instructions. The default for S# class "table" method print is not to print NA's. This can be changed manually.

library(data.table)

toy_data = data.table(from=c("A","A","A","C","E","E","A","A","A","C","E","E"),
                      to=c("B","C","A","D","F","E","E","A","A","A","C",NA))

levels <- sort(unique(unlist(toy_data)))
levels <- levels[!is.na(levels)]
toy_data[, c("from", "to") := lapply(.SD, factor, levels = levels)]
tbl <- table(toy_data)
is.na(tbl) <- tbl == 0
tbl
#>     to
#> from  A  B  C  D  E  F
#>    A  3  1  1     1   
#>    B                  
#>    C  1        1      
#>    D                  
#>    E        1     1  1
#>    F

print(tbl, na.print = NA)
#>     to
#> from    A    B    C    D    E    F
#>    A    3    1    1 <NA>    1 <NA>
#>    B <NA> <NA> <NA> <NA> <NA> <NA>
#>    C    1 <NA> <NA>    1 <NA> <NA>
#>    D <NA> <NA> <NA> <NA> <NA> <NA>
#>    E <NA> <NA>    1 <NA>    1    1
#>    F <NA> <NA> <NA> <NA> <NA> <NA>

^{Created on 2022-03-28 by the reprex package (v2.0.1)}

Edit

To add a column sums row at the bottom of the cross table, rbind the result above with colSums. Note that there's no longer need for print(tbl, na.print = NA), the method print (autoprint) being called is now the matrix method.

library(data.table)

toy_data = data.table(from=c("A","A","A","C","E","E","A","A","A","C","E","E"),
                      to=c("B","C","A","D","F","E","E","A","A","A","C",NA))

levels <- sort(unique(unlist(toy_data)))
levels <- levels[!is.na(levels)]
toy_data[, c("from", "to") := lapply(.SD, factor, levels = levels)]
tbl <- table(toy_data)

class(tbl)  # check the output object class
#> [1] "table"

tbl <- rbind(tbl, tot = colSums(tbl, na.rm = TRUE))
is.na(tbl) <- tbl == 0

class(tbl)  # check the output object class, it's no longer "table"
#> [1] "matrix" "array"

tbl
#>      A  B  C  D  E  F
#> A    3  1  1 NA  1 NA
#> B   NA NA NA NA NA NA
#> C    1 NA NA  1 NA NA
#> D   NA NA NA NA NA NA
#> E   NA NA  1 NA  1  1
#> F   NA NA NA NA NA NA
#> tot  4  1  2  1  2  1

^{Created on 2022-03-29 by the reprex package (v2.0.1)}

回复收藏 0 原文

~没有更多了~