CUDA 内核在 2 个不同的 GPU 上抛出不同的结果（GeForce 8600M GT 与 Quadro FX 770M）

发布于 2024-09-28 10:36:10 字数 3199 浏览 8 评论 0原文

我一直在开发 AES CUDA 应用程序，并且有一个在 GPU 上执行 ECB 加密的内核。为了确保并行运行时算法的逻辑不会被修改，我发送 NIST 提供的已知输入测试向量，然后从主机代码将输出与 NIST 提供的已知测试向量输出与断言进行比较。我在我的 NVIDIA GPU（8600M GT）上运行了这个测试。它在Windows 7下运行，驱动程序版本为3.0。在这种情况下，一切都很完美，断言成功。

现在，当应用程序在 Quadro FX 770M 上运行时。启动相同的应用程序，发送相同的测试向量，但获得的结果不正确并且断言失败！它在具有相同驱动程序版本的 Linux 上运行内核由 256 个线程执行。在内核中，为了跳过算术，使用了 256 个元素的预先计算的查找表。这些表最初加载在全局内存中，启动内核的 256 个线程中的 1 个线程协作加载查找表的 1 个元素，并将该元素移动到共享内存中的新查找表中，从而减少访问延迟。

最初，我想到了由于 GPU 之间的时钟速度差异而导致的同步问题。因此，线程可能正在使用尚未加载到共享内存中的值，或者以某种方式尚未处理的值，从而使输出弄乱并最终弄错。

在这里声明了已知的测试向量，因此基本上它们被发送到负责设置内核的 AES_set_encrption 。

void test_vectors ()
{ 

  unsigned char testPlainText[]  = {0x6b, 0xc1, 0xbe, 0xe2, 0x2e, 0x40, 0x9f, 0x96, 0xe9, 0x3d, 0x7e, 0x11, 0x73, 0x93, 0x17, 0x2a}; 
     unsigned char testKeyText[] =  {0x60, 0x3d, 0xeb, 0x10, 0x15, 0xca, 0x71, 0xbe, 0x2b, 0x73, 0xae, 0xf0, 0x85, 0x7d, 0x77,0x1f, 0x35, 0x2c, 0x07, 0x3b, 0x61, 0x08, 0xd7, 0x2d, 0x98, 0x10, 0xa3, 0x09, 0x14, 0xdf, 0xf4};
     unsigned char testCipherText[] = {0xf3, 0xee, 0xd1, 0xbd, 0xb5, 0xd2, 0xa0, 0x3c, 0x06, 0x4b, 0x5a, 0x7e, 0x3d, 0xb1, 0x81, 0xf8};

 unsigned char out[16] = {0x0};
     //AES Encryption
AES_set_encrption( testPlainText, out, 16, (u32*)testKeyText);

 //Display encrypted data
 printf("\n  GPU Encryption: "); 
 for (int i = 0; i < AES_BLOCK_SIZE; i++)
         printf("%x", out[i]);

 //Assert that the encrypted output is the same as the NIST testCipherText vector 
 assert (memcmp (out, testCipherText, 16) == 0);
}

在这里，设置函数负责分配内存，调用内核并将结果发送回主机。请注意，在发送回主机之前我已经进行了同步，因此此时一切都应该完成，这让我认为问题出在内核中。

__host__ double AES_set_encrption (... *input_data,...*output_data, .. input_length, ... ckey )

 //Allocate memory in the device and copy the input buffer from the host to the GPU
  CUDA_SAFE_CALL( cudaMalloc( (void **) &d_input_data,input_length ) ); 
  CUDA_SAFE_CALL( cudaMemcpy( (void*)d_input_data, (void*)input_data, input_length, cudaMemcpyHostToDevice ) ); 

     dim3 dimGrid(1);
     dim3 dimBlock(THREAD_X,THREAD_Y); // THREAD_X = 4 & THREAD_Y = 64
  AES_encrypt<<<dimGrid,dimBlock>>>(d_input_data);

     cudaThreadSynchronize();

     //Copy the data processed by the GPU back to the host 
  cudaMemcpy(output_data, d_input_data, input_length, cudaMemcpyDeviceToHost);

  //Free CUDA resources
  CUDA_SAFE_CALL( cudaFree(d_input_data) );
}

最后，在内核中我计算了一组 AES 轮次。因为我认为同步问题是在内核中，所以我设置了 __syncthreads();在每一轮或计算操作之后，以确保所有线程同时移动，这样就不会评估未计算的值。但这仍然没有解决问题。

这是我使用 8600M GT GPU 时的输出，工作正常：

AES 256 位密钥

NIST 测试向量：

明文：6bc1bee22e409f96e93d7e117393172a

密钥：603deb1015ca71be2b73aef0857d7781

密文：f3eed1bdb5d2a03c64b5a7e3db181f8

GPU 加密：f3eed 1bdb5d2a03c64b5a7e3db181f8

测试状态：通过

这是当我使用 Quadro FX 770M 并失败时！

AES 256 位密钥 NIST 测试向量：

明文：6bc1bee22e409f96e93d7e117393172a

密钥：603deb1015ca71be2b73aef0857d7781

密文：f3eed1bdb5d2a03c64b5a7e3db181f8

GPU 加密：c837204eb4c1063 ed79c77946893b0

通用断言 memcmp (out, testCipherText, 16) == 0 已引发错误

测试状态：失败

2 个 GPU 计算不同结果的原因可能是什么当他们处理相同的内核时？？？我将不胜感激你们任何人可以给我的任何提示或故障排除或解决此问题的任何步骤，

提前致谢！

原文

I've been working on an AES CUDA application and I have a kernel which performs ECB encryption on the GPU. In order to assure the logic of the algorithm is not modified when running in parallel I send a known input test vector provided by NIST and then from host code compare the output with the know test vector output provided by NIST with an assert.
I have run this test on my NVIDIA GPU which is a 8600M GT. This is running under Windows 7 and the driver version is 3.0. Under this sceneario everything works perfect and the assert succeeds.

Now, when the application is run on a Quadro FX 770M. The same application is launched, the same test vectors are sent but the result obtained is incorrect and the assert fails!!. This runs on Linux with the same driver version
The kernels are executed by 256 threads. Within the kernels and to skip arithmetic pre computed lookup tables of 256 elements are used. These tables are originally loaded in global memory, 1 thread out of the 256 threads launching the kernel colaborate in loading 1 element of the lookup table and moves the element into a new lookup table in shared memory so access latency is decreased.

Originally, I thought about syncrhonization problems due to Clock speed differences between GPUs. So may be threads were using values still not loaded into shared memory or somehow values which were still not processed, making the output
to mess up and finally get it incorrect.

In here the known test vectors are declared, so basically they are sent to AES_set_encrption which is in charge to setup the kernel

void test_vectors ()
{ 

  unsigned char testPlainText[]  = {0x6b, 0xc1, 0xbe, 0xe2, 0x2e, 0x40, 0x9f, 0x96, 0xe9, 0x3d, 0x7e, 0x11, 0x73, 0x93, 0x17, 0x2a}; 
     unsigned char testKeyText[] =  {0x60, 0x3d, 0xeb, 0x10, 0x15, 0xca, 0x71, 0xbe, 0x2b, 0x73, 0xae, 0xf0, 0x85, 0x7d, 0x77,0x1f, 0x35, 0x2c, 0x07, 0x3b, 0x61, 0x08, 0xd7, 0x2d, 0x98, 0x10, 0xa3, 0x09, 0x14, 0xdf, 0xf4};
     unsigned char testCipherText[] = {0xf3, 0xee, 0xd1, 0xbd, 0xb5, 0xd2, 0xa0, 0x3c, 0x06, 0x4b, 0x5a, 0x7e, 0x3d, 0xb1, 0x81, 0xf8};

 unsigned char out[16] = {0x0};
     //AES Encryption
AES_set_encrption( testPlainText, out, 16, (u32*)testKeyText);

 //Display encrypted data
 printf("\n  GPU Encryption: "); 
 for (int i = 0; i < AES_BLOCK_SIZE; i++)
         printf("%x", out[i]);

 //Assert that the encrypted output is the same as the NIST testCipherText vector 
 assert (memcmp (out, testCipherText, 16) == 0);
}

In here the setup function is in charge to allocate memory, call the kernel and send the results back to the hos. Notice I have syncrhonize before sending back to the host so at that point everything should be finished, which makes me think the problem is within the kernel..

__host__ double AES_set_encrption (... *input_data,...*output_data, .. input_length, ... ckey )

 //Allocate memory in the device and copy the input buffer from the host to the GPU
  CUDA_SAFE_CALL( cudaMalloc( (void **) &d_input_data,input_length ) ); 
  CUDA_SAFE_CALL( cudaMemcpy( (void*)d_input_data, (void*)input_data, input_length, cudaMemcpyHostToDevice ) ); 

     dim3 dimGrid(1);
     dim3 dimBlock(THREAD_X,THREAD_Y); // THREAD_X = 4 & THREAD_Y = 64
  AES_encrypt<<<dimGrid,dimBlock>>>(d_input_data);

     cudaThreadSynchronize();

     //Copy the data processed by the GPU back to the host 
  cudaMemcpy(output_data, d_input_data, input_length, cudaMemcpyDeviceToHost);

  //Free CUDA resources
  CUDA_SAFE_CALL( cudaFree(d_input_data) );
}

And finally within the kernel I have a set of AES rounds computed. Since I was thinking that the syncrhonization problem was then within the kernel I set __syncthreads(); after each round or computational operation to make sure all threads were moving at the same time so no uncomputed values migth be evaluated.. But still that did not solved the problem..

Here is the output when I use the 8600M GT GPU which works fine:

AES 256 Bit key

NIST Test Vectors:

PlaintText: 6bc1bee22e409f96e93d7e117393172a

Key: 603deb1015ca71be2b73aef0857d7781

CipherText: f3eed1bdb5d2a03c64b5a7e3db181f8

GPU Encryption: f3eed1bdb5d2a03c64b5a7e3db181f8

Test status: Passed

And here is when I use the Quadro FX 770M and fails!!

AES 256 Bit key
NIST Test Vectors:

PlaintText: 6bc1bee22e409f96e93d7e117393172a

Key: 603deb1015ca71be2b73aef0857d7781

CipherText: f3eed1bdb5d2a03c64b5a7e3db181f8

GPU Encryption: c837204eb4c1063ed79c77946893b0

Generic assert memcmp (out, testCipherText, 16) == 0 has thrown an error

Test status: Failed

What might be the reason why 2 GPUs compute different results even when they process same kernels???
I will appreciate any hint or troubleshooting any of you could give me or any step in order to fix this issue

Thanks in advance!!

分享到QQ

分享到微博