GTS 250 和 Fermi 器件之间的 CUDA 块同步差异

发布于 2024-10-29 23:10:51 字数 2164 浏览 8 评论 0原文

所以我一直在研究一个在全局内存中创建哈希表的程序。该代码在 GTS250（Compute 1.1 设备）上完全可用（尽管速度较慢）。但是，在 Compute 2.0 设备（C2050 或 C2070）上，哈希表已损坏（数据不正确，指针有时错误）。

基本上，当仅使用一个块（两个设备）时，代码可以正常工作。但是，当使用 2 个或更多块时，它仅适用于 GTS250，不适用于任何 Fermi 器件。

我知道两个平台之间的扭曲调度和内存架构是不同的，我在开发代码时考虑到了这一点。根据我的理解，使用 __theadfence() 应该确保任何全局写入都已提交并且对其他块可见，但是，从损坏的哈希表来看，它们似乎并非如此。

我还在 NVIDIA CUDA 开发者论坛上发布了该问题，可以在此处。

相关代码如下：

__device__ void lock(int *mutex) {
    while(atomicCAS(mutex, 0, 1) != 0);
}

__device__ void unlock(int *mutex) {
    atomicExch(mutex, 0);
}

__device__ void add_to_global_hash_table(unsigned int key, unsigned int count, unsigned int sum, unsigned int sumSquared, Table table, int *globalHashLocks, int *globalFreeLock, int *globalFirstFree)
{
    // Find entry if it exists
    unsigned int hashValue = hash(key, table.count);

    lock(&globalHashLocks[hashValue]);

    int bucketHead = table.entries[hashValue];
    int currentLocation = bucketHead;

    bool found = false;
    Entry currentEntry;

    while (currentLocation != -1 && !found) {
        currentEntry = table.pool[currentLocation];
        if (currentEntry.data.x == key) {
            found = true;
        } else {
            currentLocation = currentEntry.next;
        }
    }

    if (currentLocation == -1) {
        // If entry does not exist, create entry
        lock(globalFreeLock);
        int newLocation = (*globalFirstFree)++;
        __threadfence();
        unlock(globalFreeLock);

        Entry newEntry;
        newEntry.data.x = key;
        newEntry.data.y = count;
        newEntry.data.z = sum;
        newEntry.data.w = sumSquared;
        newEntry.next = bucketHead;

        // Add entry to table
        table.pool[newLocation] = newEntry;
        table.entries[hashValue] = newLocation;
    } else {
        currentEntry.data.y += count;
        currentEntry.data.z += sum;
        currentEntry.data.w += sumSquared;
        table.pool[currentLocation] = currentEntry;
    }

    __threadfence();
    unlock(&globalHashLocks[hashValue]);
}

原文

So I've been working on program in which I'm creating a hash table in global memory. The code is completely functional (albeit slower) on a GTS250 which is a Compute 1.1 device. However, on a Compute 2.0 device (C2050 or C2070) the hash table is corrupt (data is incorrect and pointers are sometimes wrong).

Basically the code works fine when only one block is utilized (both devices). However, when 2 or more blocks are used, it works only on the GTS250 and not on any Fermi devices.

I understand that the warp scheduling and memory architecture between the two platforms are different and I am taking that into account when developing the code. From my understanding, using __theadfence() should make sure any global writes are committed and visible to other blocks, however, from the corrupt hash table, it appears that they are not.

I've also posted the problem on the NVIDIA CUDA developer forum and it can be found here.

Relevant code below:

__device__ void lock(int *mutex) {
    while(atomicCAS(mutex, 0, 1) != 0);
}

__device__ void unlock(int *mutex) {
    atomicExch(mutex, 0);
}

__device__ void add_to_global_hash_table(unsigned int key, unsigned int count, unsigned int sum, unsigned int sumSquared, Table table, int *globalHashLocks, int *globalFreeLock, int *globalFirstFree)
{
    // Find entry if it exists
    unsigned int hashValue = hash(key, table.count);

    lock(&globalHashLocks[hashValue]);

    int bucketHead = table.entries[hashValue];
    int currentLocation = bucketHead;

    bool found = false;
    Entry currentEntry;

    while (currentLocation != -1 && !found) {
        currentEntry = table.pool[currentLocation];
        if (currentEntry.data.x == key) {
            found = true;
        } else {
            currentLocation = currentEntry.next;
        }
    }

    if (currentLocation == -1) {
        // If entry does not exist, create entry
        lock(globalFreeLock);
        int newLocation = (*globalFirstFree)++;
        __threadfence();
        unlock(globalFreeLock);

        Entry newEntry;
        newEntry.data.x = key;
        newEntry.data.y = count;
        newEntry.data.z = sum;
        newEntry.data.w = sumSquared;
        newEntry.next = bucketHead;

        // Add entry to table
        table.pool[newLocation] = newEntry;
        table.entries[hashValue] = newLocation;
    } else {
        currentEntry.data.y += count;
        currentEntry.data.z += sum;
        currentEntry.data.w += sumSquared;
        table.pool[currentLocation] = currentEntry;
    }

    __threadfence();
    unlock(&globalHashLocks[hashValue]);
}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

你不是我要的菜∠ 2024-11-05 23:10:51

正如 LSChien 在此 post，问题在于 L1 缓存一致性。虽然使用 __threadfence() 将保证共享和全局内存写入对其他线程可见，因为它不是原子的，块 1 中的线程 x可能会达到缓存的内存值，直到块0中的线程y执行到threadfence指令。相反，LSChien 在他的帖子中建议使用 atomicCAS() 强制线程从全局内存中读取而不是缓存的值。执行此操作的正确方法是将内存声明为易失性，要求对该内存的每次写入都立即对网格中的所有其他线程可见。