Hoeffding 的“D”的 GPU 实现想法（相关）系数？

发布于 2025-01-05 05:09:16 字数 1260 浏览 3 评论 0原文

我正在尝试提出一种非常快速的算法来计算这个非常有趣的统计数据，该统计数据充分利用了强大 GPU 的功能。理想情况下，我将使用 Jacket 在 Matlab 中执行此操作，但 CUDA 代码或 OpenCL 中的其他想法也将非常感激。基本上，我想众包许多聪明的想法，我可以尝试将它们组合在一起，然后我将尝试将结果开源，以便其他人可以使用它。

尽管这种依赖系数很强大（它甚至能够检测“一对多”的依赖关系），但网上几乎没有任何关于它的信息，除了两个来源：SAS统计软件和Frank的优秀R包Hmisc哈勒尔.您可以在此处阅读该算法的描述：

http://support.sas.com/documentation/cdl/en/procstat/63104/HTML/default/viewer.htm#procstat_corr_sect016.htm

这是 Harrell 的 Fortran 代码（如果您已经了解计算，那么非常容易理解）：

http://hmisc.sourcearchive.com/documentation/3.7-0/hoeffd_8f-source .html

（此外，Hmisc pdf 文档的第 128 页有更多详细信息。）

这是一个计算要求非常高的算法 - 如果您想将其应用于数据如果设置了数千行和几千列，即使使用快速的 Fortran 实现，您也会等待很多天才能得到结果——即使是在一台新机器上。我希望使用 Nvidia GTX 580 级别的显卡，或者更好的是 Tesla，可以将时间缩短到几个小时。在这种情况下，无论是识别基因还是在实验数据中寻找相关性，这种组合都将成为一股不可忽视的分析力量。

不管怎样，我期待着人们的回应，并希望我们能够为 Hoeffding's D 打造一个基于 GPU 的快速算法成为现实。

预先感谢您提供任何想法 - 请随时提供部分或半生不熟的想法！

更新：因此，Jascha 慷慨地提供了 Hoeffding's D 在 Matlab 中的有效实现，这实现了我的目标之一。另一种是通过使用 GPU（最好是使用 Jacket）来从根本上提高速度。有没有人看到在 GPU 上以巧妙的方式做到这一点的路径或策略？

原文

I am trying to come up with a very fast algorithm for calculating this very interesting statistic that takes full advantage of the capabilities of a powerful GPU. Ideally I will do this in Matlab using Jacket, but other ideas in CUDA code or OpenCL would also be much appreciated. Basically, I want to crowd-source lots of clever ideas that I can try to put together, and then I will attempt to open-source the result so others can use it.

Despite the power of this dependency coefficient (it is capable of detecting even "one-to-many" dependency relationships), there is almost nothing about it online, except for two sources: SAS statistical software, and the excellent R package Hmisc by Frank Harrell. You can read a description of the algorithm here:

http://support.sas.com/documentation/cdl/en/procstat/63104/HTML/default/viewer.htm#procstat_corr_sect016.htm

And here is Harrell's code in Fortran (which is surprisingly easy to follow if you understand the calculation already):

http://hmisc.sourcearchive.com/documentation/3.7-0/hoeffd_8f-source.html

(also, page 128 of the pdf documentation for Hmisc has some more details.)

This is a very computationally demanding algorithm-- if you wanted to apply it to a data set that is thousands of rows and a few thousand columns, even with the fast Fortran implementation, you would be waiting many days for the result-- even on a new machine. My hope is that using an Nvidia GTX 580 level card, or better yet, a Tesla, would bring that down to a couple hours. In which case, the combination would be an analytical force to be reckoned with, whether identifying genes or finding correlations in experimental data.

Anyway, I look forward to peoples' responses and hope we can make a fast, GPU based algorithm for Hoeffding's D a reality.

Thanks in advance for any ideas-- and please don't hesitate to give partial or half-baked ideas!

Update: So, Jascha has generously provided a working implementation of Hoeffding's D in Matlab, which fulfilled one of my objectives. The other was to radically enhance the speed by using a GPU, preferably using Jacket. Does anyone see a path or strategy to do this in a clever way on the GPU?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

幸福还没到 2025-01-12 05:09:16

下面是一个代码示例，其中包含 MATLAB 中 Hoeffding D 依赖性度量的简单实现。这不是 GPU 化的，但对于那些想要计算此统计数据但不使用 Fortran 的人来说可能很有用，或者作为将其放在 GPU 上的起点。（扩展标题说明了几种联合分布类型的 Hoeffding D 值。）

function [ D ] = hoeffdingsD( x, y )
%Compute's Hoeffding's D measure of dependence between x and y
%   inputs x and y are both N x 1 arrays
%   output D is a scalar
%     The formula for Hoeffding's D is taken from
%     http://support.sas.com/documentation/cdl/en/procstat/63104/HTML/default/viewer.htm#procstat_corr_sect016.htm
% Below is demonstration code for several types of dependencies.
%
% % this case should be 0 - there are no dependencies
% x = randn(1000,1);
% y = randn(1000,1);
% D = hoeffdingsD( x, y );
% desc = 'x = N( 0, 1 ), y and x independent';
% desc = sprintf( '%s, Hoeffding''s D = %f', desc, D );
% fprintf( '%s\n', desc );
% figure; plot(x,y,'.'); title( desc );
% 
% % the rest of these cases have dependencies of different types
% x = randn(1000,1);
% y = x;
% D = hoeffdingsD( x, y );
% desc = 'x = N( 0, 1 ), y = x';
% desc = sprintf( '%s, Hoeffding''s D = %f', desc, D );
% fprintf( '%s\n', desc );
% figure; plot(x,y,'.'); title( desc );
% 
% x = randn(1000,1);
% y = cos(10*x);
% D = hoeffdingsD( x, y );
% desc = 'x = N( 0, 1 ), y = cos(10x)';
% desc = sprintf( '%s, Hoeffding''s D = %f', desc, D );
% fprintf( '%s\n', desc );
% figure; plot(x,y,'.'); title( desc );
% 
% x = randn(1000,1);
% y = x.^2;
% D = hoeffdingsD( x, y );
% desc = 'x = N( 0, 1 ), y = x^2';
% desc = sprintf( '%s, Hoeffding''s D = %f', desc, D );
% fprintf( '%s\n', desc );
% figure; plot(x,y,'.'); title( desc );
% 
% x = randn(1000,1);
% y = x.^2 + randn(1000,1);
% D = hoeffdingsD( x, y );
% desc = 'x = N( 0, 1 ), y = x^2 + N( 0, 1 )';
% desc = sprintf( '%s, Hoeffding''s D = %f', desc, D );
% fprintf( '%s\n', desc );
% figure; plot(x,y,'.'); title( desc );
% 
% x = rand(1000,1);
% y = rand(1000,1);
% z = sign(randn(1000,1));
% x = z.*x; y = z.*y;
% D = hoeffdingsD( x, y );
% desc = 'x = z U( [0, 1) ), y = z U( [0, 1) ), z = U( {-1,1} )';
% desc = sprintf( '%s, Hoeffding''s D = %f', desc, D );
% fprintf( '%s\n', desc );
% figure; plot(x,y,'.'); title( desc );
% 
% x = rand(1000,1);
% y = rand(1000,1);
% z = sign(randn(1000,1));
% x = z.*x; y = -z.*y;
% D = hoeffdingsD( x, y );
% desc = 'x = z U( [0, 1) ), y = -z U( [0, 1) ), z = U( {-1,1} )';
% desc = sprintf( '%s, Hoeffding''s D = %f', desc, D );
% fprintf( '%s\n', desc );
% figure; plot(x,y,'.'); title( desc );

N = size(x,1);

R = tiedrank( x );
S = tiedrank( y );

Q = zeros(N,1);
parfor i = 1:N
    Q(i) = 1 + sum( R < R(i) & S < S(i) );
    % and deal with cases where one or both values are ties, which contribute less
    Q(i) = Q(i) + 1/4 * (sum( R == R(i) & S == S(i) ) - 1); % both indices tie.  -1 because we know point i matches
    Q(i) = Q(i) + 1/2 * sum( R == R(i) & S < S(i) ); % one index ties.
    Q(i) = Q(i) + 1/2 * sum( R < R(i) & S == S(i) ); % one index ties.
end

D1 = sum( (Q-1).*(Q-2) );
D2 = sum( (R-1).*(R-2).*(S-1).*(S-2) );
D3 = sum( (R-2).*(S-2).*(Q-1) );

D = 30*((N-2)*(N-3)*D1 + D2 - 2*(N-2)*D3) / (N*(N-1)*(N-2)*(N-3)*(N-4));

end

Below is a code example with a simple implementation of the Hoeffding's D measure of dependency in MATLAB. This is NOT GPU-ized, but may prove useful to people who want to calculate this statistic and are not using Fortran, or as a starting point for putting it on the GPU. (The extended header illustrates Hoeffding's D values for several types of joint distributions.)

function [ D ] = hoeffdingsD( x, y )
%Compute's Hoeffding's D measure of dependence between x and y
%   inputs x and y are both N x 1 arrays
%   output D is a scalar
%     The formula for Hoeffding's D is taken from
%     http://support.sas.com/documentation/cdl/en/procstat/63104/HTML/default/viewer.htm#procstat_corr_sect016.htm
% Below is demonstration code for several types of dependencies.
%
% % this case should be 0 - there are no dependencies
% x = randn(1000,1);
% y = randn(1000,1);
% D = hoeffdingsD( x, y );
% desc = 'x = N( 0, 1 ), y and x independent';
% desc = sprintf( '%s, Hoeffding''s D = %f', desc, D );
% fprintf( '%s\n', desc );
% figure; plot(x,y,'.'); title( desc );
% 
% % the rest of these cases have dependencies of different types
% x = randn(1000,1);
% y = x;
% D = hoeffdingsD( x, y );
% desc = 'x = N( 0, 1 ), y = x';
% desc = sprintf( '%s, Hoeffding''s D = %f', desc, D );
% fprintf( '%s\n', desc );
% figure; plot(x,y,'.'); title( desc );
% 
% x = randn(1000,1);
% y = cos(10*x);
% D = hoeffdingsD( x, y );
% desc = 'x = N( 0, 1 ), y = cos(10x)';
% desc = sprintf( '%s, Hoeffding''s D = %f', desc, D );
% fprintf( '%s\n', desc );
% figure; plot(x,y,'.'); title( desc );
% 
% x = randn(1000,1);
% y = x.^2;
% D = hoeffdingsD( x, y );
% desc = 'x = N( 0, 1 ), y = x^2';
% desc = sprintf( '%s, Hoeffding''s D = %f', desc, D );
% fprintf( '%s\n', desc );
% figure; plot(x,y,'.'); title( desc );
% 
% x = randn(1000,1);
% y = x.^2 + randn(1000,1);
% D = hoeffdingsD( x, y );
% desc = 'x = N( 0, 1 ), y = x^2 + N( 0, 1 )';
% desc = sprintf( '%s, Hoeffding''s D = %f', desc, D );
% fprintf( '%s\n', desc );
% figure; plot(x,y,'.'); title( desc );
% 
% x = rand(1000,1);
% y = rand(1000,1);
% z = sign(randn(1000,1));
% x = z.*x; y = z.*y;
% D = hoeffdingsD( x, y );
% desc = 'x = z U( [0, 1) ), y = z U( [0, 1) ), z = U( {-1,1} )';
% desc = sprintf( '%s, Hoeffding''s D = %f', desc, D );
% fprintf( '%s\n', desc );
% figure; plot(x,y,'.'); title( desc );
% 
% x = rand(1000,1);
% y = rand(1000,1);
% z = sign(randn(1000,1));
% x = z.*x; y = -z.*y;
% D = hoeffdingsD( x, y );
% desc = 'x = z U( [0, 1) ), y = -z U( [0, 1) ), z = U( {-1,1} )';
% desc = sprintf( '%s, Hoeffding''s D = %f', desc, D );
% fprintf( '%s\n', desc );
% figure; plot(x,y,'.'); title( desc );

N = size(x,1);

R = tiedrank( x );
S = tiedrank( y );

Q = zeros(N,1);
parfor i = 1:N
    Q(i) = 1 + sum( R < R(i) & S < S(i) );
    % and deal with cases where one or both values are ties, which contribute less
    Q(i) = Q(i) + 1/4 * (sum( R == R(i) & S == S(i) ) - 1); % both indices tie.  -1 because we know point i matches
    Q(i) = Q(i) + 1/2 * sum( R == R(i) & S < S(i) ); % one index ties.
    Q(i) = Q(i) + 1/2 * sum( R < R(i) & S == S(i) ); % one index ties.
end

D1 = sum( (Q-1).*(Q-2) );
D2 = sum( (R-1).*(R-2).*(S-1).*(S-2) );
D3 = sum( (R-2).*(S-2).*(Q-1) );

D = 30*((N-2)*(N-3)*D1 + D2 - 2*(N-2)*D3) / (N*(N-1)*(N-2)*(N-3)*(N-4));

end

回复收藏 0 原文

也只是曾经 2025-01-12 05:09:16

我本想这只是一条评论，但它太长了。如果您不想接受它作为答案或其他任何内容，请不用担心。

首先，值得称赞的是你正在考虑如何将其映射到GPU。更多的科学家应该花时间为他们的算法考虑这样的想法。然而，在阅读描述后，我不相信 GPU 是并行化此特定方法的理想方式。我这么说的原因是，GP-GPU 编程往往在“批处理”方面非常有效，即自然适合按元素进行基本操作（线程级别）的算法。

目前尚不清楚该算法是否沿着这些思路进行有用的划分，除了其他人已经提到的求和/排名计算级别（并且这些类型的子功能在 GPU 上已经得到很好的理解）。但是，如果您缩小范围并开始尝试考虑如何使用 GPU 在列与列的比较级别上加快速度，那么您无能为力。了解特定条目小于给定点值的位置并不能让您避免在更改一列或另一列时执行相同的计算。

简而言之，您将必须在顶层对算法进行 N(N+1)/2 次不同的迭代，并且没有办法避免这种情况。在每个算法中，都有足够的空间将列数组发送到 GPU 并在线程级别进行比较，以快速计算各种排名统计数据。

更好的方法可能是在多处理器设置中使用 MPI，以便将每个高级迭代分配给不同的处理器。如果您没有，主从并行模型（名字很糟糕）似乎很合适足够的处理器和/或如果您希望每个处理器在单个高级迭代中利用 GPU。只要您尝试计算的 Hoeffding“协方差”矩阵的上三角元素仍然存在，您就会不断将任务发送给可用的处理器。

其次，我认为 Matlab 和 Jacket 在这里更多的是时间问题，比你想象的要多。是的，Matlab 针对某些线性代数运算进行了优化，但几乎总是明显比任何“真正的”编程语言慢。代价是您可以通过商业级文档获得许多便利的功能，这有时很有价值。

我建议您的另一种选择是将 PyCUDA 与 mpi4py。使用 NumPy 和 Cython ，Python 严格来说比 Matlab 更好、更快，即使是在便利功能和易用性方面也是如此。如果您使用 PyDev 插件。 org/" rel="nofollow noreferrer">Eclipse IDE，甚至Matlab终端的用户体验也基本相同。另外，您无需为 Jacket 支付许可证或任何额外费用。

您必须做一些工作才能让 PyCUDA 和 mpi4py 协同工作，但最终我认为开源性将使您的努力对更多人来说更有价值，并且代码将可能会快得多。

另外，如果您确实坚持使用 Matlab，您是否对单对列情况下现有的 Fortran 代码（针对数千个条目但仅在两列中）进行了计时？如果速度相当快，您应该能够使用 mex 文件为该 Fortran 编写一个包装器。在这种情况下，Matlab 的简单多处理器会话就可以满足您的需求。

I intended this to be just a comment, but it is too long. No worries if you don't want to accept it as an answer or anything.

First, it is commendable that you are thinking about how to map this to the GPU. Many more scientists should take the time to consider ideas like this for their algorithms. However, after reading the description, I am not convinced that a GPU is the ideal way in which to parallelize this particular method. The reason I say this is that GP-GPU programming tends to be very effective at "batch processing", i.e. algorithms that naturally lend themselves to a element-wise basic manipulation (the thread level).

It isn't clear that there's a useful division of this algorithm along those lines except at the summation / rank-calculation level that others have already mentioned (and these sorts of sub-functions are already well-understood on the GPU). But if you zoom out and start trying to think about how you can use the GPU to speed things up at the level of column-to-column comparisons, there's not much you can do. Knowing the places where particular entries are less than a given point's value won't let you avoid doing that same computation when one or the other of the columns are changed.

In short, you're going to have to do N(N+1)/2 different iterations of the algorithm at the top level, and there's no way to avoid that. Within each of those algorithms, there's plenty of room to send your column arrays to the GPU and do comparisons at the thread level to calculate the various rank statistics quickly.

A better approach might be to use MPI in a multiprocessor setup, so that you farm out each high-level iteration to different processors. The master-slave parallel model (terribly named) seems appropriate if you don't have enough processors and/or if you want each processor to utilize the GPU within a single high-level iteration. For as long as there continues to be upper-triangular elements of the Hoeffding "covariance" matrix you're trying to compute, you keep pumping out tasks to the available processors.

Secondly, I think Matlab and Jacket are more of a time-problem here than you might believe. Yes, Matlab is optimized for some linear algebra operations, but almost always it is unequivocally slower than any 'real' programming language. The tradeoff is that you get a lot of convenience functions with commercial-grade documentation, and this is valuable sometimes.

An alternative that I suggest to you is to use PyCUDA together with mpi4py in the Python programming language. With NumPy and Cython, Python is strictly better and faster than Matlab, even in terms of convenience functions and ease of use. If you use the PyDev plug-in for the Eclipse IDE, even the user experience of the Matlab terminal is basically identical. Plus you don't have to pay for a license or any extra for Jacket.

You'd have to do a bit of work to get the PyCUDA and mpi4py to work together, but in the end I think the open-source-ness would make your effort far more valuable to a greater number of people, and the code would likely be much faster.

Also, if you do indeed stick with Matlab, have you timed the existing Fortran code in the single-pair-of-columns case, for thousands of entries but just in two columns? If that is reasonably fast, you should be able to write a wrapper for that Fortran with a mex file. And in that case, simple multi-processor sessions of Matlab could be all you need.

回复收藏 0 原文

~没有更多了~