按第一列定义的间隔有效地平均第二列

发布于 2024-12-06 13:07:30 字数 1390 浏览 0 评论 0原文

数据文件中有两个数字列。我需要通过第一列的间隔(例如100)计算第二列的平均值。

我可以在 R 中编写此任务,但对于相对较大的数据文件(数百万行,第一列的值在 1 到 33132539 之间变化),我的 R 代码确实很慢。

这里我展示我的 R 代码。我怎样才能将其调整得更快?基于 perl、python、awk 或 shell 的其他解决方案也值得赞赏。

提前致谢。

(1)我的数据文件(制表符分隔,数百万行)

5380    30.07383\n
5390    30.87\n
5393    0.07383\n
5404    6\n
5428    30.07383\n
5437    1\n
5440    9\n
5443    30.07383\n
5459    6\n
5463    30.07383\n
5480    7\n
5521    30.07383\n
5538    0\n
5584    20\n
5673    30.07383\n
5720    30.07383\n
5841    3\n
5880    30.07383\n
5913    4\n
5958    30.07383\n

(2)我想要得到什么,这里间隔= 100

intervals_of_first_columns, average_of_2nd column_by_the_interval
100, 0\n
200, 0\n
300, 20.34074\n
400, 14.90325\n
.....

(3)R代码

chr1 <- 33132539 # set the limit for the interval
window <- 100 # set the size of interval

spe <- read.table("my_data_file", header=F) # read my data in
names(spe) <- c("pos", "rho") # name my data 

interval.chr1 <- data.frame(pos=seq(0, chr1, window)) # setup intervals
meanrho.chr1 <- NULL # object for the mean I want to get

# real calculation, really slow on my own data.
for(i in 1:nrow(interval.chr1)){
  count.sub<-subset(spe, chrom==1 & pos>=interval.chr1$pos[i] & pos<=interval.chr1$pos[i+1])
  meanrho.chr1[i]<-mean(count.sub$rho)
}

There are two numeric columns in a data file. I need to calculate the average of the second column by intervals (such as 100) of the first column.

I can program this task in R, but my R code is really slow for a relatively large data file (millions of rows, with the value of first column changing between 1 to 33132539).

Here I show my R code. How could I tune it to be faster? Other solutions that are perl, python, awk or shell based are appreciated.

Thanks in advance.

(1) my data file (tab-delimited, millions of rows)

5380    30.07383\n
5390    30.87\n
5393    0.07383\n
5404    6\n
5428    30.07383\n
5437    1\n
5440    9\n
5443    30.07383\n
5459    6\n
5463    30.07383\n
5480    7\n
5521    30.07383\n
5538    0\n
5584    20\n
5673    30.07383\n
5720    30.07383\n
5841    3\n
5880    30.07383\n
5913    4\n
5958    30.07383\n

(2) what I want to get, here interval = 100

intervals_of_first_columns, average_of_2nd column_by_the_interval
100, 0\n
200, 0\n
300, 20.34074\n
400, 14.90325\n
.....

(3) R code

chr1 <- 33132539 # set the limit for the interval
window <- 100 # set the size of interval

spe <- read.table("my_data_file", header=F) # read my data in
names(spe) <- c("pos", "rho") # name my data 

interval.chr1 <- data.frame(pos=seq(0, chr1, window)) # setup intervals
meanrho.chr1 <- NULL # object for the mean I want to get

# real calculation, really slow on my own data.
for(i in 1:nrow(interval.chr1)){
  count.sub<-subset(spe, chrom==1 & pos>=interval.chr1$pos[i] & pos<=interval.chr1$pos[i+1])
  meanrho.chr1[i]<-mean(count.sub$rho)
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

栀子花开つ 2024-12-13 13:07:30

您实际上并不需要设置输出 data.frame,但如果您愿意,也可以这样做。这是我的编码方式,我保证它会很快。

> dat$incrmt <- dat$V1 %/% 100
> dat
     V1       V2 incrmt
1  5380 30.07383     53
2  5390 30.87000     53
3  5393  0.07383     53
4  5404  6.00000     54
5  5428 30.07383     54
6  5437  1.00000     54
7  5440  9.00000     54
8  5443 30.07383     54
9  5459  6.00000     54
10 5463 30.07383     54
11 5480  7.00000     54
12 5521 30.07383     55
13 5538  0.00000     55
14 5584 20.00000     55
15 5673 30.07383     56
16 5720 30.07383     57
17 5841  3.00000     58
18 5880 30.07383     58
19 5913  4.00000     59
20 5958 30.07383     59

> with(dat, tapply(V2, incrmt, mean, na.rm=TRUE))
      53       54       55       56       57       58       59 
20.33922 14.90269 16.69128 30.07383 30.07383 16.53692 17.03692 

您甚至可以完成更少的设置(使用以下代码跳过 incrmt 变量:

    > with(dat, tapply(V2, V1 %/% 100, mean, na.rm=TRUE))
      53       54       55       56       57       58       59 
20.33922 14.90269 16.69128 30.07383 30.07383 16.53692 17.03692 

如果您希望结果可用于某些内容:

by100MeanV2 <- with(dat, tapply(V2, V1 %/% 100, mean, na.rm=TRUE))

You don't really need to set up an output data.frame but you can if you want. Here is how I would have coded it, and I guarantee it will be fast.

> dat$incrmt <- dat$V1 %/% 100
> dat
     V1       V2 incrmt
1  5380 30.07383     53
2  5390 30.87000     53
3  5393  0.07383     53
4  5404  6.00000     54
5  5428 30.07383     54
6  5437  1.00000     54
7  5440  9.00000     54
8  5443 30.07383     54
9  5459  6.00000     54
10 5463 30.07383     54
11 5480  7.00000     54
12 5521 30.07383     55
13 5538  0.00000     55
14 5584 20.00000     55
15 5673 30.07383     56
16 5720 30.07383     57
17 5841  3.00000     58
18 5880 30.07383     58
19 5913  4.00000     59
20 5958 30.07383     59

> with(dat, tapply(V2, incrmt, mean, na.rm=TRUE))
      53       54       55       56       57       58       59 
20.33922 14.90269 16.69128 30.07383 30.07383 16.53692 17.03692 

You could have done even less setup (skip the incrmt variable with this code:

    > with(dat, tapply(V2, V1 %/% 100, mean, na.rm=TRUE))
      53       54       55       56       57       58       59 
20.33922 14.90269 16.69128 30.07383 30.07383 16.53692 17.03692 

And if you want the result to be available for something:

by100MeanV2 <- with(dat, tapply(V2, V1 %/% 100, mean, na.rm=TRUE))
只有影子陪我不离不弃 2024-12-13 13:07:30
use strict;
use warnings;

my $BIN_SIZE = 100;
my %freq;

while (<>){
    my ($k, $v) = split;
    my $bin = $BIN_SIZE * int($k / $BIN_SIZE);
    $freq{$bin}{n} ++;
    $freq{$bin}{sum} += $v;
}

for my $bin (sort { $a <=> $b  } keys %freq){
    my ($n, $sum) = map $freq{$bin}{$_}, qw(n sum);
    print join("\t", $bin, $n, $sum, $sum / $n), "\n";
}
use strict;
use warnings;

my $BIN_SIZE = 100;
my %freq;

while (<>){
    my ($k, $v) = split;
    my $bin = $BIN_SIZE * int($k / $BIN_SIZE);
    $freq{$bin}{n} ++;
    $freq{$bin}{sum} += $v;
}

for my $bin (sort { $a <=> $b  } keys %freq){
    my ($n, $sum) = map $freq{$bin}{$_}, qw(n sum);
    print join("\t", $bin, $n, $sum, $sum / $n), "\n";
}
歌入人心 2024-12-13 13:07:30

考虑到问题的规模,您需要使用速度快如闪电的 data.table

require(data.table)
N = 10^6; M = 33132539
mydt = data.table(V1 = runif(N, 1, M), V2 = rpois(N, lambda = 10))
ans  = mydt[,list(avg_V2 = mean(V2)),'V1 %/% 100']

在我的 2.53Ghz 4GB RAM 规格的 Macbook Pro 上,这花了 20 秒。如果您的第二列中没有任何 NA,则可以通过将 mean 替换为 .Internal(mean) 来获得 10 倍的加速。

以下是使用 rbenchmark 和 5 次复制的速度比较。请注意,带有 .Internal(mean)data.table 速度快了 10 倍。

test        replications   elapsed   relative 
f_dt()            5         113.752   10.30736   
f_tapply()        5         147.664   13.38021   
f_dt_internal()   5          11.036    1.00000  

来自 Matthew 的更新:

v1.8.2 中的新功能,现在会自动进行此优化(将 mean 替换为 .Internal(mean));即,常规 DT[,mean(somecol),by=] 现在运行速度提高了 10 倍。将来我们将尝试进行更多类似的便利更改,以便用户无需了解那么多技巧即可充分利用 data.table

Given the size of your problem, you need to use data.table which is lightening fast.

require(data.table)
N = 10^6; M = 33132539
mydt = data.table(V1 = runif(N, 1, M), V2 = rpois(N, lambda = 10))
ans  = mydt[,list(avg_V2 = mean(V2)),'V1 %/% 100']

This took 20 seconds on my Macbook Pro with specs 2.53Ghz 4GB RAM. If you don't have any NA in your second column, you can obtain a 10x speedup by replacing mean with .Internal(mean).

Here is the speed comparison using rbenchmark and 5 replications. Note that data.table with .Internal(mean) is 10x faster.

test        replications   elapsed   relative 
f_dt()            5         113.752   10.30736   
f_tapply()        5         147.664   13.38021   
f_dt_internal()   5          11.036    1.00000  

Update from Matthew :

New in v1.8.2, this optimization (replacing mean with .Internal(mean)) is now automatically made; i.e., regular DT[,mean(somecol),by=] now runs at the 10x faster speed. We'll try and make more convenience changes like this in future, so that users don't need to know as many tricks in order to get the best from data.table.

苦行僧 2024-12-13 13:07:30

根据您的代码,我猜测这将适用于整个数据集(取决于您系统的内存):

chr1 <- 33132539 
window <- 100 

pos <- cut(1:chr1, seq(0, chr1, window))

meanrho.chr1 <- tapply(spe$rho, INDEX = pos, FUN = mean)

我认为您需要一个因子来定义第一列中每 100 个间隔的组(rho),然后您可以使用标准 apply 系列函数来获取组内的平均值。

这是您以可复制形式发布的数据。

spe <- structure(list(pos = c(5380L, 5390L, 5393L, 5404L, 5428L, 5437L, 
5440L, 5443L, 5459L, 5463L, 5480L, 5521L, 5538L, 5584L, 5673L, 
5720L, 5841L, 5880L, 5913L, 5958L), rho = c(30.07383, 30.87, 0.07383, 
6, 30.07383, 1, 9, 30.07383, 6, 30.07383, 7, 30.07383, 0, 20, 
30.07383, 30.07383, 3, 30.07383, 4, 30.07383)), .Names = c("pos", 
"rho"), row.names = c(NA, -20L), class = "data.frame")

使用 cut 定义间隔,我们只需要每第 100 个值(但您可能希望根据实际数据集的代码调整详细信息)。

pos.index <- cut(spe$pos, seq(0, max(spe$pos), by = 100))

现在将所需的函数 (mean) 传递给每个组。

tapply(spe$rho, INDEX = pos.index, FUN = mean)

(由于我们不是从 0 开始,所以有很多 NA)

(5.2e+03,5.3e+03] (5.3e+03,5.4e+03] (5.4e+03,5.5e+03] (5.5e+03,5.6e+03] (5.6e+03,5.7e+03] (5.7e+03,5.8e+03] (5.8e+03,5.9e+03] 
   20.33922          14.90269          16.69128          30.07383          30.07383          16.53692 

(根据需要向 FUN 添加其他参数,例如 na.rm,例如:)

## tapply(spe$rho, INDEX = pos.index, FUN = mean, na.rm = TRUE)

请参阅应用于向量中的组的 ?tapply (参差不齐的数组),以及 ?cut 了解生成分组因子的方法。

Based on your code, I would guess that this would work the full data set (depending on your system's memory):

chr1 <- 33132539 
window <- 100 

pos <- cut(1:chr1, seq(0, chr1, window))

meanrho.chr1 <- tapply(spe$rho, INDEX = pos, FUN = mean)

I think you want a factor that defines groups of intervals for every 100 within the first column (rho), and then you can use the standard apply family of functions to get means within groups.

Here is the data you posted in reproducible form.

spe <- structure(list(pos = c(5380L, 5390L, 5393L, 5404L, 5428L, 5437L, 
5440L, 5443L, 5459L, 5463L, 5480L, 5521L, 5538L, 5584L, 5673L, 
5720L, 5841L, 5880L, 5913L, 5958L), rho = c(30.07383, 30.87, 0.07383, 
6, 30.07383, 1, 9, 30.07383, 6, 30.07383, 7, 30.07383, 0, 20, 
30.07383, 30.07383, 3, 30.07383, 4, 30.07383)), .Names = c("pos", 
"rho"), row.names = c(NA, -20L), class = "data.frame")

Define the intervals with cut, we just want every 100th value (but you might want the details tweaked as per your code for your real data set).

pos.index <- cut(spe$pos, seq(0, max(spe$pos), by = 100))

Now pass the desired function (mean) over each group.

tapply(spe$rho, INDEX = pos.index, FUN = mean)

(Lots of NAs since we didn't start at 0, then)

(5.2e+03,5.3e+03] (5.3e+03,5.4e+03] (5.4e+03,5.5e+03] (5.5e+03,5.6e+03] (5.6e+03,5.7e+03] (5.7e+03,5.8e+03] (5.8e+03,5.9e+03] 
   20.33922          14.90269          16.69128          30.07383          30.07383          16.53692 

(Add other arguments to FUN, such as na.rm as necessary, e.g:)

## tapply(spe$rho, INDEX = pos.index, FUN = mean, na.rm = TRUE)

See ?tapply applying over groups in a vector (ragged array), and ?cut for ways to generate grouping factors.

郁金香雨 2024-12-13 13:07:30

这是一个 Perl 程序,可以实现我认为您想要的功能。它假设行按第一列排序。

#!/usr/bin/perl
use strict;
use warnings;

my $input_name       = "t.dat";
my $output_name      = "t_out.dat";
my $initial_interval = 1;

my $interval_size    = 100;
my $start_interval   = $initial_interval;
my $end_interval     = $start_interval + $interval_size;

my $interval_total   = 0;
my $interval_count   = 0;

open my $DATA, "<", $input_name  or die "$input_name: $!";
open my $AVGS, ">", $output_name or die "$output_name: $!";

my $rows_in  = 0;
my $rows_out = 0;
$| = 1;

for (<$DATA>) {
    $rows_in++;

    # progress indicator, nice for big data
    print "*" unless $rows_in % 1000;
    print "\n" unless $rows_in % 50000;

    my ($key, $value) = split /\t/;

    # handle possible missing intervals
    while ($key >= $end_interval) {

        # put your value for an empty interval here...
        my $interval_avg = "empty";

        if ($interval_count) {
            $interval_avg = $interval_total/$interval_count;
        }
        print $AVGS $start_interval,"\t", $interval_avg, "\n";
        $rows_out++;

        $interval_count = 0;
        $interval_total = 0;

        $start_interval = $end_interval;
        $end_interval   += $interval_size;
    }

    $interval_count++;
    $interval_total += $value;
}

# handle the last interval
if ($interval_count) {
    my $interval_avg = $interval_total/$interval_count;
    print $AVGS $start_interval,"\t", $interval_avg, "\n";
    $rows_out++;
}

print "\n";
print "Rows in:  $rows_in\n";
print "Rows out: $rows_out\n";

exit 0;

Here is a Perl program that does what I think you want. It assumes the rows are sorted by the first column.

#!/usr/bin/perl
use strict;
use warnings;

my $input_name       = "t.dat";
my $output_name      = "t_out.dat";
my $initial_interval = 1;

my $interval_size    = 100;
my $start_interval   = $initial_interval;
my $end_interval     = $start_interval + $interval_size;

my $interval_total   = 0;
my $interval_count   = 0;

open my $DATA, "<", $input_name  or die "$input_name: $!";
open my $AVGS, ">", $output_name or die "$output_name: $!";

my $rows_in  = 0;
my $rows_out = 0;
$| = 1;

for (<$DATA>) {
    $rows_in++;

    # progress indicator, nice for big data
    print "*" unless $rows_in % 1000;
    print "\n" unless $rows_in % 50000;

    my ($key, $value) = split /\t/;

    # handle possible missing intervals
    while ($key >= $end_interval) {

        # put your value for an empty interval here...
        my $interval_avg = "empty";

        if ($interval_count) {
            $interval_avg = $interval_total/$interval_count;
        }
        print $AVGS $start_interval,"\t", $interval_avg, "\n";
        $rows_out++;

        $interval_count = 0;
        $interval_total = 0;

        $start_interval = $end_interval;
        $end_interval   += $interval_size;
    }

    $interval_count++;
    $interval_total += $value;
}

# handle the last interval
if ($interval_count) {
    my $interval_avg = $interval_total/$interval_count;
    print $AVGS $start_interval,"\t", $interval_avg, "\n";
    $rows_out++;
}

print "\n";
print "Rows in:  $rows_in\n";
print "Rows out: $rows_out\n";

exit 0;
一个人的旅程 2024-12-13 13:07:30

首先想到的是 python 生成器,它内存效率高。

def cat(data_file): # cat generator
    f = open(data_file, "r")
    for line in f:
        yield line

然后将一些逻辑放入另一个函数中(假设您将结果保存在文件中)

def foo(data_file, output_file):
    f = open(output_file, "w")
    cnt = 0
    suma = 0
    for line in cat(data_file):
        suma += line.split()[-1]
        cnt += 1
        if cnt%100 == 0:
            f.write("%s\t%s\n" %( cnt, suma/100.0)
            suma = 0
    f.close()

编辑:上述解决方案假设第一列中的数字是从 1 到 N 的所有数字。根据您的情况不遵循这种模式(来自评论中的额外细节),这是正确的函数:

def foo_for_your_case(data_file, output_file):
    f = open(output_file, "w")
    interval = 100
    suma = 0.0
    cnt = 0 # keep track of number of elements in the interval

    for line in cat(data_file):
        spl = line.split()

        while int(spl[0]) > interval:
            if cnt > 0 : f.write("%s\t%s\n" %( interval, suma/cnt)
            else: f.write("%s\t0\n" %( interval )
            interval += 100   
            suma = 0.0
            cnt = 0

        suma += float(spl[-1])
        cnt += 1

    f.close()

The first thing that comes in mind is a python generator, which is memory efficient.

def cat(data_file): # cat generator
    f = open(data_file, "r")
    for line in f:
        yield line

Then put some logic in another function (and supposing that you save the results in a file)

def foo(data_file, output_file):
    f = open(output_file, "w")
    cnt = 0
    suma = 0
    for line in cat(data_file):
        suma += line.split()[-1]
        cnt += 1
        if cnt%100 == 0:
            f.write("%s\t%s\n" %( cnt, suma/100.0)
            suma = 0
    f.close()

EDIT : The above solution assumed that the numbers in the first column are ALL numbers from 1 to N. As your case does not follow this pattern ( from the extra details in the comments), here is the correct function:

def foo_for_your_case(data_file, output_file):
    f = open(output_file, "w")
    interval = 100
    suma = 0.0
    cnt = 0 # keep track of number of elements in the interval

    for line in cat(data_file):
        spl = line.split()

        while int(spl[0]) > interval:
            if cnt > 0 : f.write("%s\t%s\n" %( interval, suma/cnt)
            else: f.write("%s\t0\n" %( interval )
            interval += 100   
            suma = 0.0
            cnt = 0

        suma += float(spl[-1])
        cnt += 1

    f.close()
赠我空喜 2024-12-13 13:07:30

Perl 中的 Oneliner 一如既往地简单高效:

perl -F\\t -lane'BEGIN{$l=33132539;$i=100;$,=", "}sub p(){print$r*$i,$s/$n if$n;$r=int($F[0]/$i);$s=$n=0}last if$F[0]>$l;p if int($F[0]/$i)!=$r;$s+=$F[1];$n++}{p'

Oneliner in Perl is simple and efficient as usual:

perl -F\\t -lane'BEGIN{$l=33132539;$i=100;$,=", "}sub p(){print$r*$i,$s/$n if$n;$r=int($F[0]/$i);$s=$n=0}last if$F[0]>$l;p if int($F[0]/$i)!=$r;$s+=$F[1];$n++}{p'
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文