按第一列定义的间隔有效地平均第二列
数据文件中有两个数字列。我需要通过第一列的间隔(例如100)计算第二列的平均值。
我可以在 R 中编写此任务,但对于相对较大的数据文件(数百万行,第一列的值在 1 到 33132539 之间变化),我的 R 代码确实很慢。
这里我展示我的 R 代码。我怎样才能将其调整得更快?基于 perl、python、awk 或 shell 的其他解决方案也值得赞赏。
提前致谢。
(1)我的数据文件(制表符分隔,数百万行)
5380 30.07383\n
5390 30.87\n
5393 0.07383\n
5404 6\n
5428 30.07383\n
5437 1\n
5440 9\n
5443 30.07383\n
5459 6\n
5463 30.07383\n
5480 7\n
5521 30.07383\n
5538 0\n
5584 20\n
5673 30.07383\n
5720 30.07383\n
5841 3\n
5880 30.07383\n
5913 4\n
5958 30.07383\n
(2)我想要得到什么,这里间隔= 100
intervals_of_first_columns, average_of_2nd column_by_the_interval
100, 0\n
200, 0\n
300, 20.34074\n
400, 14.90325\n
.....
(3)R代码
chr1 <- 33132539 # set the limit for the interval
window <- 100 # set the size of interval
spe <- read.table("my_data_file", header=F) # read my data in
names(spe) <- c("pos", "rho") # name my data
interval.chr1 <- data.frame(pos=seq(0, chr1, window)) # setup intervals
meanrho.chr1 <- NULL # object for the mean I want to get
# real calculation, really slow on my own data.
for(i in 1:nrow(interval.chr1)){
count.sub<-subset(spe, chrom==1 & pos>=interval.chr1$pos[i] & pos<=interval.chr1$pos[i+1])
meanrho.chr1[i]<-mean(count.sub$rho)
}
There are two numeric columns in a data file. I need to calculate the average of the second column by intervals (such as 100) of the first column.
I can program this task in R, but my R code is really slow for a relatively large data file (millions of rows, with the value of first column changing between 1 to 33132539).
Here I show my R code. How could I tune it to be faster? Other solutions that are perl, python, awk or shell based are appreciated.
Thanks in advance.
(1) my data file (tab-delimited, millions of rows)
5380 30.07383\n
5390 30.87\n
5393 0.07383\n
5404 6\n
5428 30.07383\n
5437 1\n
5440 9\n
5443 30.07383\n
5459 6\n
5463 30.07383\n
5480 7\n
5521 30.07383\n
5538 0\n
5584 20\n
5673 30.07383\n
5720 30.07383\n
5841 3\n
5880 30.07383\n
5913 4\n
5958 30.07383\n
(2) what I want to get, here interval = 100
intervals_of_first_columns, average_of_2nd column_by_the_interval
100, 0\n
200, 0\n
300, 20.34074\n
400, 14.90325\n
.....
(3) R code
chr1 <- 33132539 # set the limit for the interval
window <- 100 # set the size of interval
spe <- read.table("my_data_file", header=F) # read my data in
names(spe) <- c("pos", "rho") # name my data
interval.chr1 <- data.frame(pos=seq(0, chr1, window)) # setup intervals
meanrho.chr1 <- NULL # object for the mean I want to get
# real calculation, really slow on my own data.
for(i in 1:nrow(interval.chr1)){
count.sub<-subset(spe, chrom==1 & pos>=interval.chr1$pos[i] & pos<=interval.chr1$pos[i+1])
meanrho.chr1[i]<-mean(count.sub$rho)
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
您实际上并不需要设置输出 data.frame,但如果您愿意,也可以这样做。这是我的编码方式,我保证它会很快。
您甚至可以完成更少的设置(使用以下代码跳过 incrmt 变量:
如果您希望结果可用于某些内容:
You don't really need to set up an output data.frame but you can if you want. Here is how I would have coded it, and I guarantee it will be fast.
You could have done even less setup (skip the incrmt variable with this code:
And if you want the result to be available for something:
考虑到问题的规模,您需要使用速度快如闪电的
data.table
。在我的 2.53Ghz 4GB RAM 规格的 Macbook Pro 上,这花了 20 秒。如果您的第二列中没有任何
NA
,则可以通过将mean
替换为.Internal(mean)
来获得 10 倍的加速。以下是使用 rbenchmark 和 5 次复制的速度比较。请注意,带有
.Internal(mean)
的data.table
速度快了 10 倍。来自 Matthew 的更新:
v1.8.2 中的新功能,现在会自动进行此优化(将
mean
替换为.Internal(mean)
);即,常规DT[,mean(somecol),by=]
现在运行速度提高了 10 倍。将来我们将尝试进行更多类似的便利更改,以便用户无需了解那么多技巧即可充分利用data.table
。Given the size of your problem, you need to use
data.table
which is lightening fast.This took 20 seconds on my Macbook Pro with specs 2.53Ghz 4GB RAM. If you don't have any
NA
in your second column, you can obtain a 10x speedup by replacingmean
with.Internal(mean)
.Here is the speed comparison using
rbenchmark
and 5 replications. Note thatdata.table
with.Internal(mean)
is 10x faster.Update from Matthew :
New in v1.8.2, this optimization (replacing
mean
with.Internal(mean)
) is now automatically made; i.e., regularDT[,mean(somecol),by=]
now runs at the 10x faster speed. We'll try and make more convenience changes like this in future, so that users don't need to know as many tricks in order to get the best fromdata.table
.根据您的代码,我猜测这将适用于整个数据集(取决于您系统的内存):
我认为您需要一个因子来定义第一列中每 100 个间隔的组(
rho
),然后您可以使用标准 apply 系列函数来获取组内的平均值。这是您以可复制形式发布的数据。
使用
cut
定义间隔,我们只需要每第 100 个值(但您可能希望根据实际数据集的代码调整详细信息)。现在将所需的函数 (
mean
) 传递给每个组。(由于我们不是从 0 开始,所以有很多 NA)
(根据需要向 FUN 添加其他参数,例如 na.rm,例如:)
请参阅应用于向量中的组的
?tapply
(参差不齐的数组),以及?cut
了解生成分组因子的方法。Based on your code, I would guess that this would work the full data set (depending on your system's memory):
I think you want a factor that defines groups of intervals for every 100 within the first column (
rho
), and then you can use the standard apply family of functions to get means within groups.Here is the data you posted in reproducible form.
Define the intervals with
cut
, we just want every 100th value (but you might want the details tweaked as per your code for your real data set).Now pass the desired function (
mean
) over each group.(Lots of NAs since we didn't start at 0, then)
(Add other arguments to FUN, such as na.rm as necessary, e.g:)
See
?tapply
applying over groups in a vector (ragged array), and?cut
for ways to generate grouping factors.这是一个 Perl 程序,可以实现我认为您想要的功能。它假设行按第一列排序。
Here is a Perl program that does what I think you want. It assumes the rows are sorted by the first column.
首先想到的是 python 生成器,它内存效率高。
然后将一些逻辑放入另一个函数中(假设您将结果保存在文件中)
编辑:上述解决方案假设第一列中的数字是从 1 到 N 的所有数字。根据您的情况不遵循这种模式(来自评论中的额外细节),这是正确的函数:
The first thing that comes in mind is a python generator, which is memory efficient.
Then put some logic in another function (and supposing that you save the results in a file)
EDIT : The above solution assumed that the numbers in the first column are ALL numbers from 1 to N. As your case does not follow this pattern ( from the extra details in the comments), here is the correct function:
Perl 中的 Oneliner 一如既往地简单高效:
Oneliner in Perl is simple and efficient as usual: