如何使用GSL实现Pearson相关系数?

发布于 2024-12-28 17:07:13 字数 1812 浏览 2 评论 0原文

我有两个浮点向量 x 和 y,我想计算 Pearson 相关性系数。由于我必须对大量数据(例如 1000 万个不同的向量 x 和 20000 个不同的向量 y)执行此操作,因此我使用 C++,更具体地说是 gsl_stats_correlation 函数。

这是我的 C++ 代码:

#include <iostream>
#include <vector>
using namespace std;

#include <gsl/gsl_vector.h>
#include <gsl/gsl_statistics.h>

int main (int argc, char ** argv)
{
  vector<double> x, y;
  size_t n = 5;
  x.push_back(1.0); y.push_back(1.0);
  x.push_back(3.1); y.push_back(3.2);
  x.push_back(2.0); y.push_back(1.9);
  x.push_back(5.0); y.push_back(4.9);
  x.push_back(2.0); y.push_back(2.1);
  for(size_t i=0; i<n; ++i)
      printf ("x[%ld]=%.1f y[%ld]=%.1f\n", i, x[i], i, y[i]);

  gsl_vector_const_view gsl_x = gsl_vector_const_view_array( &x[0], x.size() );
  gsl_vector_const_view gsl_y = gsl_vector_const_view_array( &y[0], y.size() );

  double pearson = gsl_stats_correlation( (double*) gsl_x.vector.data, sizeof(double),
                                          (double*) gsl_y.vector.data, sizeof(double),
                                          n );
  printf ("Pearson correlation = %f\n", pearson);

  return 0;
}

它编译成功(gcc -Wall -g pearson.cpp -lstdc++ -lgsl -lgslcblas -o pearson),但是当我在这里运行它时,输出是:

x[0]=1.0 y[0]=1.0
x[1]=3.1 y[1]=3.2
x[2]=2.0 y[2]=1.9
x[3]=5.0 y[3]=4.9
x[4]=2.0 y[4]=2.1
Pearson correlation = 1.000000

虽然显然结果不应该恰好是 1,如图所示使用以下 R 代码:

x <- c(1.0,3.1,2.0,5.0,2.0); y <-c(1.0,3.2,1.9,4.9,2.1)
cor(x, y, method="pearson")  # 0.99798

我缺少什么?

I have two vectors of floats, x and y, and I want to compute the Pearson correlation coefficients. As I have to do it on a lot of data (for instance 10 millions different vectors x and 20 thousand different vectors y), I am using C++, and more specifically the gsl_stats_correlation function from the GSL.

Here is my C++ code:

#include <iostream>
#include <vector>
using namespace std;

#include <gsl/gsl_vector.h>
#include <gsl/gsl_statistics.h>

int main (int argc, char ** argv)
{
  vector<double> x, y;
  size_t n = 5;
  x.push_back(1.0); y.push_back(1.0);
  x.push_back(3.1); y.push_back(3.2);
  x.push_back(2.0); y.push_back(1.9);
  x.push_back(5.0); y.push_back(4.9);
  x.push_back(2.0); y.push_back(2.1);
  for(size_t i=0; i<n; ++i)
      printf ("x[%ld]=%.1f y[%ld]=%.1f\n", i, x[i], i, y[i]);

  gsl_vector_const_view gsl_x = gsl_vector_const_view_array( &x[0], x.size() );
  gsl_vector_const_view gsl_y = gsl_vector_const_view_array( &y[0], y.size() );

  double pearson = gsl_stats_correlation( (double*) gsl_x.vector.data, sizeof(double),
                                          (double*) gsl_y.vector.data, sizeof(double),
                                          n );
  printf ("Pearson correlation = %f\n", pearson);

  return 0;
}

It compiles successfully (gcc -Wall -g pearson.cpp -lstdc++ -lgsl -lgslcblas -o pearson) but when I run it here is the output:

x[0]=1.0 y[0]=1.0
x[1]=3.1 y[1]=3.2
x[2]=2.0 y[2]=1.9
x[3]=5.0 y[3]=4.9
x[4]=2.0 y[4]=2.1
Pearson correlation = 1.000000

While obviously the results should not be exactly 1, as shown with the following R code:

x <- c(1.0,3.1,2.0,5.0,2.0); y <-c(1.0,3.2,1.9,4.9,2.1)
cor(x, y, method="pearson")  # 0.99798

What am I missing?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

和我恋爱吧 2025-01-04 17:07:13

将行:更改

  double pearson = gsl_stats_correlation( (double*) gsl_x.vector.data, sizeof(double),
                                          (double*) gsl_y.vector.data, sizeof(double),
                                          n );

为:

  double pearson = gsl_stats_correlation( (double*) gsl_x.vector.data, 1,
                                          (double*) gsl_y.vector.data, 1,
                                          n );

或者,如果您想避免重复“幻数”1:

  const size_t stride = 1;
  double pearson = gsl_stats_correlation( (double*) gsl_x.vector.data, stride,
                                          (double*) gsl_y.vector.data, stride,
                                          n );

gsl_stats_correlation 假设 double,第二个和第四个参数是“跨步”的​​双精度数,因此通过给出sizeof(double) 它跳跃了 sizeof(double)*sizeof(double) 字节。

Change the line:

  double pearson = gsl_stats_correlation( (double*) gsl_x.vector.data, sizeof(double),
                                          (double*) gsl_y.vector.data, sizeof(double),
                                          n );

to:

  double pearson = gsl_stats_correlation( (double*) gsl_x.vector.data, 1,
                                          (double*) gsl_y.vector.data, 1,
                                          n );

or, if you want to avoid repeating the "magic number" 1:

  const size_t stride = 1;
  double pearson = gsl_stats_correlation( (double*) gsl_x.vector.data, stride,
                                          (double*) gsl_y.vector.data, stride,
                                          n );

gsl_stats_correlation assumes double, the second and fourth argument are the number of doubles to "stride", so by giving it sizeof(double) it was jumping by sizeof(double)*sizeof(double) bytes.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文