求矩阵行中位数和绝对偏差

发布于 2024-11-10 06:23:06 字数 956 浏览 2 评论 0原文

我有一个包含 22239 行和 22239 行的数据框。 200 列。第一列 - NAME - 是字符,其他列是数字。我的目标是通过以下方式对行的所有元素进行操作:

  • 查找行的中位数;
  • 行元素(值)减去中位数;
  • 求行的中值绝对偏差(mad);
  • 疯狂地按行划分行元素。

我尝试了这种方式

edata <- read.delim("a.txt", header=TRUE, sep="\t")

## Converting dataframe into Matrix
## Taking all rows but starting from 2 column to 200
data <- as.matrix(edata[,2:200]) 
for(i in 1:22239){  #rows below columns
    for(j in 1:200) {
        m <- median(data[i,]) # median of rows
        md <- mad(normdata[i,]) # mad of rows
        a <- data[i,j]  # assigning matrix element value to a
        subs = a-m    # substracting
        escore <- subs/md  # final score
        data[i,j] <- escore  # assigning final score to row elements

,在为行的每个元素获取新值后,我想根据 NAME 列的 75% 分位数对其进行排序。但是,我不知道该怎么做。

我知道我的代码内存效率不高。当我运行上面的代码时,循环非常慢。尝试了foreach,但无法成功。你们能给我建议处理此类问题的好方法吗?

I have a data frame with 22239 rows & 200 columns. The first column - NAME - is a character and the other columns are numeric. My goal is to operate on all elements of rows by:

  • Finding the rows' median;
  • Subtracting the median from the row element (value);
  • Finding the rows` median absolute deviation (mad);
  • Dividing rows elements by rows mad.

I tried this way

edata <- read.delim("a.txt", header=TRUE, sep="\t")

## Converting dataframe into Matrix
## Taking all rows but starting from 2 column to 200
data <- as.matrix(edata[,2:200]) 
for(i in 1:22239){  #rows below columns
    for(j in 1:200) {
        m <- median(data[i,]) # median of rows
        md <- mad(normdata[i,]) # mad of rows
        a <- data[i,j]  # assigning matrix element value to a
        subs = a-m    # substracting
        escore <- subs/md  # final score
        data[i,j] <- escore  # assigning final score to row elements

After getting new values for every elements of the rows I want to sort it according to the 75% quantiles on the basis of the NAME column. But, I am not sure how to do this.

I know my code isn't memory efficient. When I run the above code, the looping is very slow. Tried foreach, but couldn't succeed it. Can you guys suggest me the good way to deal with these kind of problems?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

等你爱我 2024-11-17 06:23:07

您可以将所有步骤放入函数中并仅使用一个应用循环。

rfun <- function(x) {
         me<- median(x)
         md<-mad(x,center=me,constant=1)
         return((x-me)/md)}

dat_s <- apply(dat,1,rfun)

You can put all steps in function and use only one apply loop.

rfun <- function(x) {
         me<- median(x)
         md<-mad(x,center=me,constant=1)
         return((x-me)/md)}

dat_s <- apply(dat,1,rfun)
半衾梦 2024-11-17 06:23:06

这是 sweep() 的理想工作。

set.seed(47)
dat <- matrix(rnorm(22239 * 200), ncol = 200)
rmeds <- apply(dat, 1, median)     ## row medians
rmads <- apply(dat, 1, mad)        ## row mads
dat2 <- sweep(dat, 1, rmeds, "-")  ## sweep out the medians
dat2 <- sweep(dat2, 1, rmads, "/") ## sweep out the mads

通过不使用 mad() 可以加快速度,因为它会再次计算中位数:

rmeds <- apply(dat, 1, median)     ## row medians
dat3 <- sweep(dat, 1, rmeds, "-")  ## sweep out the medians
rmads <- 1.4826 * apply(abs(dat3), 1, median)        ## row mads
dat3 <- sweep(dat3, 1, rmads, "/") ## sweep out the mads

R> all.equal(dat2, dat3)
[1] TRUE

请注意,R 的 mad() 乘以常数 1.4826 以实现渐近正态一致性,因此第二个示例中有额外的位。

我的系统上的一些计时:

## first version
   user  system elapsed 
  6.215   0.183   6.412 

## second version
   user  system elapsed 
  4.365   0.167   4.535 

对于@Nick的答案,我得到:

## @Nick's Version
   user  system elapsed 
  5.900   0.032   5.955

它始终比我的第一个版本快,但比第二个版本慢一点,同样是因为中位数被计算了两次。

This is an ideal job for sweep().

set.seed(47)
dat <- matrix(rnorm(22239 * 200), ncol = 200)
rmeds <- apply(dat, 1, median)     ## row medians
rmads <- apply(dat, 1, mad)        ## row mads
dat2 <- sweep(dat, 1, rmeds, "-")  ## sweep out the medians
dat2 <- sweep(dat2, 1, rmads, "/") ## sweep out the mads

This can be speeded up a bit by not using mad() as it computes the medians again:

rmeds <- apply(dat, 1, median)     ## row medians
dat3 <- sweep(dat, 1, rmeds, "-")  ## sweep out the medians
rmads <- 1.4826 * apply(abs(dat3), 1, median)        ## row mads
dat3 <- sweep(dat3, 1, rmads, "/") ## sweep out the mads

R> all.equal(dat2, dat3)
[1] TRUE

Notice that R's mad() multiplies by a constant 1.4826 to achieve asymptotically normal consistency, hence the extra bit in the second example.

Some timings on my system:

## first version
   user  system elapsed 
  6.215   0.183   6.412 

## second version
   user  system elapsed 
  4.365   0.167   4.535 

For @Nick's Answer I get:

## @Nick's Version
   user  system elapsed 
  5.900   0.032   5.955

which is consistently faster than my first version, but a little slower than the second version, again because the medians are being computed twice.

夜司空 2024-11-17 06:23:06

这个怎么样:
(我创建了另一个矩阵作为开始,但方法是相同的)

dta<-matrix(rnorm(200), nrow=20)
dta.perrow<-apply(dta, 1, function(currow){c(med=median(currow), mad=mad(currow))})
result<-(dta - dta.perrow[1,])/dta.perrow[2,]

我确信还有更好的方法,但是HTH。

How about this:
(I created another matrix to start from, but the method is the same)

dta<-matrix(rnorm(200), nrow=20)
dta.perrow<-apply(dta, 1, function(currow){c(med=median(currow), mad=mad(currow))})
result<-(dta - dta.perrow[1,])/dta.perrow[2,]

I'm sure there are still better ways, but HTH.

思念绕指尖 2024-11-17 06:23:06

R 与 matlab 一样,针对向量运算进行了优化。 for 循环可能是实现此目的最慢的方法。每行的中位数可以使用 apply 函数计算,而不是使用 for 循环。这将为您提供中位数的列向量。例如,

apply(edata,1,median)

类似的方法可用于其他措施。请记住,避免 R/matlab 中的 for 循环通常会加快代码速度。

R, like matlab, is optimised for vector operations. Your for loops are probably the slowest way of achieving this. The medians of each row can be calculated using the apply function, rather than a for loop. This will gives you a column vector of medians. e.g.

apply(edata,1,median)

Similar approaches can be used for the other measures. Remember, avoiding for loops in R/matlab will generally speed up your code.

残龙傲雪 2024-11-17 06:23:06

你有特殊的函数来处理行数据,但我喜欢使用 apply。您可以将 apply 视为 for 循环(本质上是)一次处理一行。

my.m <- matrix(runif(100), ncol = 5)
my.median <- apply(X = my.m, MARGIN = 1, FUN = median) #1
my.m - my.median #2
my.mad <- apply(X = my.m, MARGIN = 1, FUN = mad) #3
my.m/my.mad #4

You have special functions to deal with row data, but I like to use apply. You can think of apply as a for loop (which essentially is) working on a row at a time.

my.m <- matrix(runif(100), ncol = 5)
my.median <- apply(X = my.m, MARGIN = 1, FUN = median) #1
my.m - my.median #2
my.mad <- apply(X = my.m, MARGIN = 1, FUN = mad) #3
my.m/my.mad #4
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文