列中非零或特定数字的频率

发布于 2024-11-04 10:22:14 字数 890 浏览 3 评论 0原文

我的输入文件：

 x <- read.table(textConnection('
      t0  t1  t2  t3  t4
  aa  0   1   0   1   0
  bb  1   0   1   0   1
  cc  0   0   0   0   0
  dd  1   1   1   0   1
  ee  1   1   1   0   0
  ff  0   0   1   0   1
  gg  -1  -1  -1  -1  0
  hh  -1  1   -1  1   -1
 '), header=TRUE)

我想首先计算每列的频率，即

          t0   t1   t2   t3   t4
freqency  5/8  5/8  6/8  3/8  4/8

然后将频率乘回到矩阵x，以获得新矩阵，如下：

       t0    t1     t2     t3     t4
  aa   0     5/8    0      3/8    0
  bb   5/8   0      6/8    0      4/8
  cc   0     0      0      0      0
  dd   5/8   5/8    6/8    0      4/8
  ee   5/8   5/8    6/8    0      0
  ff   0     0      6/8    0      4/8
  gg  -5/8  -5/8   -6/8   -3/8    0
  hh  -5/8   5/8   -6/8    3/8   -4/8

如何用R来做？我从手册中了解到 prop.table(x) 可用于获取整个表的总体概率，我如何单独为每一列执行此操作？请帮忙。

原文

My input file:

 x <- read.table(textConnection('
      t0  t1  t2  t3  t4
  aa  0   1   0   1   0
  bb  1   0   1   0   1
  cc  0   0   0   0   0
  dd  1   1   1   0   1
  ee  1   1   1   0   0
  ff  0   0   1   0   1
  gg  -1  -1  -1  -1  0
  hh  -1  1   -1  1   -1
 '), header=TRUE)

I want to firstly calculate the frequency of each columns, i.e.

          t0   t1   t2   t3   t4
freqency  5/8  5/8  6/8  3/8  4/8

And then multiply the frequency back to matrix x, to obtain the new matrix as follows:

       t0    t1     t2     t3     t4
  aa   0     5/8    0      3/8    0
  bb   5/8   0      6/8    0      4/8
  cc   0     0      0      0      0
  dd   5/8   5/8    6/8    0      4/8
  ee   5/8   5/8    6/8    0      0
  ff   0     0      6/8    0      4/8
  gg  -5/8  -5/8   -6/8   -3/8    0
  hh  -5/8   5/8   -6/8    3/8   -4/8

How to do it with R? I learnt from manuals that prop.table(x) could be used to get the overall probability for the whole table, how can I do it for each column individually? Pls kindly help.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

随风而去 2024-11-11 10:22:14

本着与 @Joris 的答案相同的精神，这就是精彩的 sweep() 函数发挥作用的地方：

> sweep(x, MARGIN = 2, colMeans(abs(x)), "*")
       t0     t1    t2     t3   t4
aa  0.000  0.625  0.00  0.375  0.0
bb  0.625  0.000  0.75  0.000  0.5
cc  0.000  0.000  0.00  0.000  0.0
dd  0.625  0.625  0.75  0.000  0.5
ee  0.625  0.625  0.75  0.000  0.0
ff  0.000  0.000  0.75  0.000  0.5
gg -0.625 -0.625 -0.75 -0.375  0.0
hh -0.625  0.625 -0.75  0.375 -0.5

这里发生的是 colMeans(abs(x)) code> 是长度为 5 的向量。我们按列（由调用中的 MARGIN = 2 指示）sweep() 对数据 x 在我们进行过程中应用函数*。因此，t0 列中的值全部乘以 colMeans(abs(x))[1]，t1 列中的值全部乘以乘以 colMeans(abs(x))[2] 等等。

sweep() 的优点是，当给定一个矩阵时，它非常快：

X <- data.matrix(x)
> system.time(replicate(1000, sweep(X, 2, means, "*")))
   user  system elapsed 
  0.115   0.000   0.118 
> system.time(replicate(1000, mapply(`*`, x, means)))
   user  system elapsed 
  0.308   0.001   0.309 
> system.time(replicate(1000, mapply(`*`, X, means)))
   user  system elapsed 
  0.204   0.000   0.205

当给定一个数据帧时，它要慢得多：

> system.time(replicate(1000, sweep(x, 2, means, "*")))
   user  system elapsed 
  2.072   0.000   2.074

但这就是事情的样子R。

In the same spirit as the answer from @Joris, this is where the wonderful sweep() function comes into it's own:

> sweep(x, MARGIN = 2, colMeans(abs(x)), "*")
       t0     t1    t2     t3   t4
aa  0.000  0.625  0.00  0.375  0.0
bb  0.625  0.000  0.75  0.000  0.5
cc  0.000  0.000  0.00  0.000  0.0
dd  0.625  0.625  0.75  0.000  0.5
ee  0.625  0.625  0.75  0.000  0.0
ff  0.000  0.000  0.75  0.000  0.5
gg -0.625 -0.625 -0.75 -0.375  0.0
hh -0.625  0.625 -0.75  0.375 -0.5

What is happening here is that colMeans(abs(x)) is a vector of length 5. We sweep() these values, column-wise (indicated by the MARGIN = 2 in the call), over the data x applying the function * as we go. So, the values in column t0 all get multiplied by colMeans(abs(x))[1], the values in column t1 all get multiplied by colMeans(abs(x))[2] and so on.

The advantage of sweep() is that it is very fast when given a matrix:

X <- data.matrix(x)
> system.time(replicate(1000, sweep(X, 2, means, "*")))
   user  system elapsed 
  0.115   0.000   0.118 
> system.time(replicate(1000, mapply(`*`, x, means)))
   user  system elapsed 
  0.308   0.001   0.309 
> system.time(replicate(1000, mapply(`*`, X, means)))
   user  system elapsed 
  0.204   0.000   0.205

It is much slower when given a data frame:

> system.time(replicate(1000, sweep(x, 2, means, "*")))
   user  system elapsed 
  2.072   0.000   2.074

But that is just the way things are in R.

回复收藏 0 原文

橘亓 2024-11-11 10:22:14

试试这个：

> colMeans(abs(x))
   t0    t1    t2    t3    t4 
0.625 0.625 0.750 0.375 0.500

获取频率并

> mapply(`*`,x,colMeans(abs(x)))
         t0     t1    t2     t3   t4
[1,]  0.000  0.625  0.00  0.375  0.0
[2,]  0.625  0.000  0.75  0.000  0.5
[3,]  0.000  0.000  0.00  0.000  0.0
[4,]  0.625  0.625  0.75  0.000  0.5
[5,]  0.625  0.625  0.75  0.000  0.0
[6,]  0.000  0.000  0.75  0.000  0.5
[7,] -0.625 -0.625 -0.75 -0.375  0.0
[8,] -0.625  0.625 -0.75  0.375 -0.5

获取数据帧。 mapply 将函数 * 应用于每一列，并采用提到的参数。另请参阅?mapply

Try this :

> colMeans(abs(x))
   t0    t1    t2    t3    t4 
0.625 0.625 0.750 0.375 0.500

for the frequencies and

> mapply(`*`,x,colMeans(abs(x)))
         t0     t1    t2     t3   t4
[1,]  0.000  0.625  0.00  0.375  0.0
[2,]  0.625  0.000  0.75  0.000  0.5
[3,]  0.000  0.000  0.00  0.000  0.0
[4,]  0.625  0.625  0.75  0.000  0.5
[5,]  0.625  0.625  0.75  0.000  0.0
[6,]  0.000  0.000  0.75  0.000  0.5
[7,] -0.625 -0.625 -0.75 -0.375  0.0
[8,] -0.625  0.625 -0.75  0.375 -0.5

to get the dataframe. mapply applies the function * on every column, taking the arguments mentioned. See also ?mapply

回复收藏 0 原文

~没有更多了~