CUDA:获取数组中的最大值及其索引

发布于 2024-11-02 11:08:26 字数 307 浏览 1 评论 0原文

我有几个块,每个块在整数数组的单独部分上执行。举个例子:第一个块从数组[0]到数组[9],第二个块从数组[10]到数组[20]。

我可以获得每个块的数组最大值的索引的最佳方法是什么?

示例块一 a[0] 到 a[10] 具有以下值:
5 10 2 3 4 34 56 3 9 10

所以 56 是索引 6 处的最大值。

我无法使用共享内存,因为数组的大小可能非常大。因此它不会适合。有没有任何库可以让我做得这么快?

我知道缩减算法,但我认为我的情况有所不同,因为我想获取最大元素的索引。

I have several blocks were each block executes on separate part of an integer array. As an example: block one from array[0] to array[9] and block two from array[10] to array[20].

What is the best way i can get the index of the max value of the array for each block?

Example block one a[0] to a[10] have the following values:
5 10 2 3 4 34 56 3 9 10

So 56 is the largest value at index 6.

I cannot use the shared memory because the size of the array may be very big. Therefore it won't fit. Are there any libraries that allows me to do so fast?

I know about the reduction algorithm, but i think my case is different because i want to get the index of the largest element.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

吾性傲以野 2024-11-09 11:08:26

如果我确切地理解你想要的是:获取数组 A 中最大值的索引。

如果这是真的,那么我建议您使用推力库:

以下是您的操作方法:

#include <thrust/device_vector.h>
#include <thrust/tuple.h>
#include <thrust/reduce.h>
#include <thrust/fill.h>
#include <thrust/generate.h>
#include <thrust/sort.h>
#include <thrust/sequence.h>
#include <thrust/copy.h>
#include <cstdlib>
#include <time.h>

using namespace thrust;

// return the biggest of two tuples
template <class T>
struct bigger_tuple {
    __device__ __host__
    tuple<T,int> operator()(const tuple<T,int> &a, const tuple<T,int> &b) 
    {
        if (a > b) return a;
        else return b;
    } 

};

template <class T>
int max_index(device_vector<T>& vec) {

    // create implicit index sequence [0, 1, 2, ... )
    counting_iterator<int> begin(0); counting_iterator<int> end(vec.size());
    tuple<T,int> init(vec[0],0); 
    tuple<T,int> smallest;

    smallest = reduce(make_zip_iterator(make_tuple(vec.begin(), begin)), make_zip_iterator(make_tuple(vec.end(), end)),
                      init, bigger_tuple<T>());
    return get<1>(smallest);
}

int main(){

    thrust::host_vector<int> h_vec(1024);
    thrust::sequence(h_vec.begin(), h_vec.end()); // values = indices

    // transfer data to the device
    thrust::device_vector<int> d_vec = h_vec;

    int index = max_index(d_vec);

    std::cout <<  "Max index is:" << index <<std::endl;
    std::cout << "Value is: " << h_vec[index] <<std::endl;

    return 0;
}

If I understood exactly what you want is : Get the index for the array A of the max value inside it.

If that is true then I would suggest you to use the thrust library:

Here is how you would do it:

#include <thrust/device_vector.h>
#include <thrust/tuple.h>
#include <thrust/reduce.h>
#include <thrust/fill.h>
#include <thrust/generate.h>
#include <thrust/sort.h>
#include <thrust/sequence.h>
#include <thrust/copy.h>
#include <cstdlib>
#include <time.h>

using namespace thrust;

// return the biggest of two tuples
template <class T>
struct bigger_tuple {
    __device__ __host__
    tuple<T,int> operator()(const tuple<T,int> &a, const tuple<T,int> &b) 
    {
        if (a > b) return a;
        else return b;
    } 

};

template <class T>
int max_index(device_vector<T>& vec) {

    // create implicit index sequence [0, 1, 2, ... )
    counting_iterator<int> begin(0); counting_iterator<int> end(vec.size());
    tuple<T,int> init(vec[0],0); 
    tuple<T,int> smallest;

    smallest = reduce(make_zip_iterator(make_tuple(vec.begin(), begin)), make_zip_iterator(make_tuple(vec.end(), end)),
                      init, bigger_tuple<T>());
    return get<1>(smallest);
}

int main(){

    thrust::host_vector<int> h_vec(1024);
    thrust::sequence(h_vec.begin(), h_vec.end()); // values = indices

    // transfer data to the device
    thrust::device_vector<int> d_vec = h_vec;

    int index = max_index(d_vec);

    std::cout <<  "Max index is:" << index <<std::endl;
    std::cout << "Value is: " << h_vec[index] <<std::endl;

    return 0;
}
一世旳自豪 2024-11-09 11:08:26

这不会使原始发帖人受益,但对于那些来到此页面寻找答案的人来说,我会赞同使用推力的建议,该推力已经具有函数推力::max_element,该函数正是这样做的 - 返回最大元素的索引。还提供了 min_element 和 minmax_element 函数。有关详细信息,请参阅此处的推力文档。

This will not benefit the original poster but for those who came to this page looking for an answer I would second the recommendation to use thrust that already has a function thrust::max_element that does exactly that - returns an index of the largest element. min_element and minmax_element functions are also provided. See thrust documentation for details here.

一袭水袖舞倾城 2024-11-09 11:08:26

除了使用 Thrust 的建议之外,您还可以使用 CUBLAS cublasIsamax 函数。

As well as the suggestion to use Thrust, you could also use the CUBLAS cublasIsamax function.

赤濁 2024-11-09 11:08:26

与共享内存相比,数组的大小几乎无关紧要,因为每个块中的线程数是限制因素,而不是数组的大小。一种解决方案是让每个线程块处理与线程块大小相同的数组大小。也就是说,如果有 512 个线程,则块 n 将查看 array[ n ] 到 array[ n + 511 ]。每个块都会进行归约以找到数组该部分中的最高成员。然后,将每个部分的最大值带回主机并进行简单的线性搜索以找到整个数组中的最高值。 GPU 的每次缩减都会将线性搜索减少 512 倍。根据数组的大小,您可能需要在返回数据之前进行更多缩减。 (如果您的数组大小为 3*512^10,您可能需要在 GPU 上进行 10 次缩减,并让主机搜索剩余的 3 个数据点。)

The size of your array in comparison to shared memory is almost irrelevant, since the number of threads in each block is the limiting factor rather than the size of the array. One solution is to have each thread block work on a size of the array the same size as the thread block. That is, if you have 512 threads, then block n will be looking at array[ n ] thru array[ n + 511 ]. Each block does a reduction to find the highest member in that portion of the array. Then you bring the max of each section back to the host and do a simple linear search to locate the highest value in the overall array. Each reduction no the GPU reduces the linear search by a factor of 512. Depending on the size of the array, you might want to do more reductions before you bring the data back. (If your array is 3*512^10 in size, you might want to do 10 reductions on the gpu, and have the host search through the 3 remaining data points.)

硪扪都還晓 2024-11-09 11:08:26

在进行最大值加索引缩减时要注意的一件事是,如果数组中存在多个相同值的最大元素,即在您的示例中,如果有 2 个或更多值等于 56,则索引为返回的值不会是唯一的,并且每次运行代码时可能会有所不同,因为 GPU 上的线程排序的时间是不确定的。

要解决此类问题,您可以使用唯一的排序索引,例如 threadid +threadsperblock * blockid,或者元素索引位置(如果它是唯一的)。然后最大测试是沿着这些线:(

if(a>max_so_far || a==max_so_far && order_a>order_max_so_far)
{ 
    max_so_far = a;
    index_max_so_far = index_a;
    order_max_so_far = order_a;
}

索引和顺序可以是相同的变量,具体取决于应用程序。)

One thing to watch out for when doing a max value plus index reduction is that if there is more than one identical valued maximum element in your array, i.e. in your example if there were 2 or more values equal to 56, then the index which is returned would not be unique and possibly be different on every run of the code because the timing of the thread ordering over the GPU is not deterministic.

To get around this kind of problem you can use a unique ordering index such as threadid + threadsperblock * blockid, or else the element index location if that is unique. Then the max test is along these lines:

if(a>max_so_far || a==max_so_far && order_a>order_max_so_far)
{ 
    max_so_far = a;
    index_max_so_far = index_a;
    order_max_so_far = order_a;
}

(index and order can be the same variable, depending on the application.)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文