GTS 250 和 Fermi 器件之间的 CUDA 块同步差异
所以我一直在研究一个在全局内存中创建哈希表的程序。该代码在 GTS250(Compute 1.1 设备)上完全可用(尽管速度较慢)。但是,在 Compute 2.0 设备(C2050 或 C2070)上,哈希表已损坏(数据不正确,指针有时错误)。
基本上,当仅使用一个块(两个设备)时,代码可以正常工作。但是,当使用 2 个或更多块时,它仅适用于 GTS250,不适用于任何 Fermi 器件。
我知道两个平台之间的扭曲调度和内存架构是不同的,我在开发代码时考虑到了这一点。根据我的理解,使用 __theadfence() 应该确保任何全局写入都已提交并且对其他块可见,但是,从损坏的哈希表来看,它们似乎并非如此。
我还在 NVIDIA CUDA 开发者论坛上发布了该问题,可以在此处。
相关代码如下:
__device__ void lock(int *mutex) {
while(atomicCAS(mutex, 0, 1) != 0);
}
__device__ void unlock(int *mutex) {
atomicExch(mutex, 0);
}
__device__ void add_to_global_hash_table(unsigned int key, unsigned int count, unsigned int sum, unsigned int sumSquared, Table table, int *globalHashLocks, int *globalFreeLock, int *globalFirstFree)
{
// Find entry if it exists
unsigned int hashValue = hash(key, table.count);
lock(&globalHashLocks[hashValue]);
int bucketHead = table.entries[hashValue];
int currentLocation = bucketHead;
bool found = false;
Entry currentEntry;
while (currentLocation != -1 && !found) {
currentEntry = table.pool[currentLocation];
if (currentEntry.data.x == key) {
found = true;
} else {
currentLocation = currentEntry.next;
}
}
if (currentLocation == -1) {
// If entry does not exist, create entry
lock(globalFreeLock);
int newLocation = (*globalFirstFree)++;
__threadfence();
unlock(globalFreeLock);
Entry newEntry;
newEntry.data.x = key;
newEntry.data.y = count;
newEntry.data.z = sum;
newEntry.data.w = sumSquared;
newEntry.next = bucketHead;
// Add entry to table
table.pool[newLocation] = newEntry;
table.entries[hashValue] = newLocation;
} else {
currentEntry.data.y += count;
currentEntry.data.z += sum;
currentEntry.data.w += sumSquared;
table.pool[currentLocation] = currentEntry;
}
__threadfence();
unlock(&globalHashLocks[hashValue]);
}
So I've been working on program in which I'm creating a hash table in global memory. The code is completely functional (albeit slower) on a GTS250 which is a Compute 1.1 device. However, on a Compute 2.0 device (C2050 or C2070) the hash table is corrupt (data is incorrect and pointers are sometimes wrong).
Basically the code works fine when only one block is utilized (both devices). However, when 2 or more blocks are used, it works only on the GTS250 and not on any Fermi devices.
I understand that the warp scheduling and memory architecture between the two platforms are different and I am taking that into account when developing the code. From my understanding, using __theadfence()
should make sure any global writes are committed and visible to other blocks, however, from the corrupt hash table, it appears that they are not.
I've also posted the problem on the NVIDIA CUDA developer forum and it can be found here.
Relevant code below:
__device__ void lock(int *mutex) {
while(atomicCAS(mutex, 0, 1) != 0);
}
__device__ void unlock(int *mutex) {
atomicExch(mutex, 0);
}
__device__ void add_to_global_hash_table(unsigned int key, unsigned int count, unsigned int sum, unsigned int sumSquared, Table table, int *globalHashLocks, int *globalFreeLock, int *globalFirstFree)
{
// Find entry if it exists
unsigned int hashValue = hash(key, table.count);
lock(&globalHashLocks[hashValue]);
int bucketHead = table.entries[hashValue];
int currentLocation = bucketHead;
bool found = false;
Entry currentEntry;
while (currentLocation != -1 && !found) {
currentEntry = table.pool[currentLocation];
if (currentEntry.data.x == key) {
found = true;
} else {
currentLocation = currentEntry.next;
}
}
if (currentLocation == -1) {
// If entry does not exist, create entry
lock(globalFreeLock);
int newLocation = (*globalFirstFree)++;
__threadfence();
unlock(globalFreeLock);
Entry newEntry;
newEntry.data.x = key;
newEntry.data.y = count;
newEntry.data.z = sum;
newEntry.data.w = sumSquared;
newEntry.next = bucketHead;
// Add entry to table
table.pool[newLocation] = newEntry;
table.entries[hashValue] = newLocation;
} else {
currentEntry.data.y += count;
currentEntry.data.z += sum;
currentEntry.data.w += sumSquared;
table.pool[currentLocation] = currentEntry;
}
__threadfence();
unlock(&globalHashLocks[hashValue]);
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
正如 LSChien 在此 post,问题在于 L1 缓存一致性。虽然使用 __threadfence() 将保证共享和全局内存写入对其他线程可见,因为它不是原子的,
块 1
中的线程 x
可能会达到缓存的内存值,直到块0
中的线程y
执行到threadfence指令。相反,LSChien 在他的帖子中建议使用atomicCAS()
强制线程从全局内存中读取而不是缓存的值。执行此操作的正确方法是将内存声明为易失性,要求对该内存的每次写入都立即对网格中的所有其他线程可见。As pointed out by LSChien in this post, the issue is with L1 cache coherency. While using
__threadfence()
will guarantee shared and global memory writes are visible to other threads, since it is not atomic,thread x
inblock 1
may reach a cached memory value untilthread y
inblock 0
has executed to the threadfence instruction. Instead LSChien suggested a hack in his post of using anatomicCAS()
to force the thread to read from global memory instead of a cached value. The proper way to do this is by declaring the memory asvolatile
, requiring that every write to that memory be visible to all other threads in the grid immediately.__threadfence 保证在返回之前对全局内存的写入对于当前块中的其他线程可见。这和“全局内存写操作完成”不一样!考虑每个多核上的缓存。
__threadfence guarantees that writes to global memory are visible to other threads in the current block before returning. That is not the same as "write operation on global memory is complete"! Think caching on each multicore.