求矩阵行中位数和绝对偏差
我有一个包含 22239 行和 22239 行的数据框。 200 列。第一列 - NAME
- 是字符,其他列是数字。我的目标是通过以下方式对行的所有元素进行操作:
- 查找行的中位数;
- 行元素(值)减去中位数;
- 求行的中值绝对偏差(mad);
- 疯狂地按行划分行元素。
我尝试了这种方式
edata <- read.delim("a.txt", header=TRUE, sep="\t")
## Converting dataframe into Matrix
## Taking all rows but starting from 2 column to 200
data <- as.matrix(edata[,2:200])
for(i in 1:22239){ #rows below columns
for(j in 1:200) {
m <- median(data[i,]) # median of rows
md <- mad(normdata[i,]) # mad of rows
a <- data[i,j] # assigning matrix element value to a
subs = a-m # substracting
escore <- subs/md # final score
data[i,j] <- escore # assigning final score to row elements
,在为行的每个元素获取新值后,我想根据 NAME 列的 75% 分位数对其进行排序。但是,我不知道该怎么做。
我知道我的代码内存效率不高。当我运行上面的代码时,循环非常慢。尝试了foreach
,但无法成功。你们能给我建议处理此类问题的好方法吗?
I have a data frame with 22239 rows & 200 columns. The first column - NAME
- is a character and the other columns are numeric. My goal is to operate on all elements of rows by:
- Finding the rows' median;
- Subtracting the median from the row element (value);
- Finding the rows` median absolute deviation (mad);
- Dividing rows elements by rows mad.
I tried this way
edata <- read.delim("a.txt", header=TRUE, sep="\t")
## Converting dataframe into Matrix
## Taking all rows but starting from 2 column to 200
data <- as.matrix(edata[,2:200])
for(i in 1:22239){ #rows below columns
for(j in 1:200) {
m <- median(data[i,]) # median of rows
md <- mad(normdata[i,]) # mad of rows
a <- data[i,j] # assigning matrix element value to a
subs = a-m # substracting
escore <- subs/md # final score
data[i,j] <- escore # assigning final score to row elements
After getting new values for every elements of the rows I want to sort it according to the 75% quantiles on the basis of the NAME column. But, I am not sure how to do this.
I know my code isn't memory efficient. When I run the above code, the looping is very slow. Tried foreach
, but couldn't succeed it. Can you guys suggest me the good way to deal with these kind of problems?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
您可以将所有步骤放入函数中并仅使用一个应用循环。
You can put all steps in function and use only one apply loop.
这是
sweep()
的理想工作。通过不使用
mad()
可以加快速度,因为它会再次计算中位数:请注意,R 的
mad()
乘以常数 1.4826 以实现渐近正态一致性,因此第二个示例中有额外的位。我的系统上的一些计时:
对于@Nick的答案,我得到:
它始终比我的第一个版本快,但比第二个版本慢一点,同样是因为中位数被计算了两次。
This is an ideal job for
sweep()
.This can be speeded up a bit by not using
mad()
as it computes the medians again:Notice that R's
mad()
multiplies by a constant 1.4826 to achieve asymptotically normal consistency, hence the extra bit in the second example.Some timings on my system:
For @Nick's Answer I get:
which is consistently faster than my first version, but a little slower than the second version, again because the medians are being computed twice.
这个怎么样:
(我创建了另一个矩阵作为开始,但方法是相同的)
我确信还有更好的方法,但是HTH。
How about this:
(I created another matrix to start from, but the method is the same)
I'm sure there are still better ways, but HTH.
R 与 matlab 一样,针对向量运算进行了优化。 for 循环可能是实现此目的最慢的方法。每行的中位数可以使用 apply 函数计算,而不是使用 for 循环。这将为您提供中位数的列向量。例如,
类似的方法可用于其他措施。请记住,避免 R/matlab 中的 for 循环通常会加快代码速度。
R, like matlab, is optimised for vector operations. Your for loops are probably the slowest way of achieving this. The medians of each row can be calculated using the apply function, rather than a for loop. This will gives you a column vector of medians. e.g.
Similar approaches can be used for the other measures. Remember, avoiding for loops in R/matlab will generally speed up your code.
你有特殊的函数来处理行数据,但我喜欢使用 apply。您可以将 apply 视为 for 循环(本质上是)一次处理一行。
You have special functions to deal with row data, but I like to use apply. You can think of apply as a for loop (which essentially is) working on a row at a time.