生成 CUDA device 内核内的 char 数组的所有组合

发布于 2024-08-13 01:48:04 字数 3865 浏览 6 评论 0原文

我需要帮助。我开始使用 CUDA (2.3 / 3.0beta) 编写一个常见的暴力破解器/密码猜测器。我尝试了不同的方法来生成定义的 ASCII 字符集的所有可能的纯文本“候选者”。

在此示例代码中，我想生成所有 74^4 种可能的组合（并将结果输出回主机/标准输出）。

$ ./combinations
Total number of combinations    : 29986576

Maximum output length   : 4
ASCII charset length    : 74

ASCII charset   : 0x30 - 0x7a
 "0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy"

CUDA 代码（使用 2.3 和 3.0b 编译 - sm_10）-combinaions.cu：

#include <stdio.h>
#include <cuda.h>

__device__ uchar4 charset_global = {0x30, 0x30, 0x30, 0x30};
__shared__ __device__ uchar4 charset[128];

__global__ void combo_kernel(uchar4 * result_d, unsigned int N)
{
 int totalThreads = blockDim.x * gridDim.x ;
 int tasksPerThread = (N % totalThreads) == 0 ? N / totalThreads : N/totalThreads + 1;
 int myThreadIdx = blockIdx.x * blockDim.x + threadIdx.x ;
 int endIdx = myThreadIdx + totalThreads * tasksPerThread ;
 if( endIdx > N) endIdx = N;

 const unsigned int m = 74 + 0x30;

 for(int idx = myThreadIdx ; idx < endIdx ; idx += totalThreads) {
  charset[threadIdx.x].x = charset_global.x;
  charset[threadIdx.x].y = charset_global.y;
  charset[threadIdx.x].z = charset_global.z;
  charset[threadIdx.x].w = charset_global.w;
  __threadfence();

  if(charset[threadIdx.x].x < m) {
   charset[threadIdx.x].x++;

  } else if(charset[threadIdx.x].y < m) {
   charset[threadIdx.x].x = 0x30; // = 0
   charset[threadIdx.x].y++;

  } else if(charset[threadIdx.x].z < m) {
   charset[threadIdx.x].y = 0x30; // = 0
   charset[threadIdx.x].z++;

  } else if(charset[threadIdx.x].w < m) {
   charset[threadIdx.x].z = 0x30;
   charset[threadIdx.x].w++;; // = 0
  }

  charset_global.x = charset[threadIdx.x].x;
  charset_global.y = charset[threadIdx.x].y;
  charset_global.z = charset[threadIdx.x].z;
  charset_global.w = charset[threadIdx.x].w;

  result_d[idx].x = charset_global.x;
  result_d[idx].y = charset_global.y;
  result_d[idx].z = charset_global.z;
  result_d[idx].w = charset_global.w;
 }
}

#define BLOCKS 65535
#define THREADS 128

int main(int argc, char **argv)
{
 const int ascii_chars = 74;
 const int max_len = 4;
 const unsigned int N = pow((float)ascii_chars, max_len);
 size_t size = N * sizeof(uchar4);

 uchar4 *result_d, *result_h;
 result_h = (uchar4 *)malloc(size );
 cudaMalloc((void **)&result_d, size );
 cudaMemset(result_d, 0, size);

 printf("Total number of combinations\t: %d\n\n", N); 
 printf("Maximum output length\t: %d\n", max_len);
 printf("ASCII charset length\t: %d\n\n", ascii_chars);

 printf("ASCII charset\t: 0x30 - 0x%02x\n ", 0x30 + ascii_chars);
 for(int i=0; i < ascii_chars; i++)
  printf("%c",i + 0x30);
 printf("\n\n");

 combo_kernel <<< BLOCKS, THREADS >>> (result_d, N);
 cudaThreadSynchronize();

 printf("CUDA kernel done\n");
 printf("hit key to continue...\n");
 getchar();

 cudaMemcpy(result_h, result_d, size, cudaMemcpyDeviceToHost);

 for (unsigned int i=0; i<N; i++)
  printf("result[%06u]\t%c%c%c%c\n",i, result_h[i].x, result_h[i].y, result_h[i].z, result_h[i].w);

 free(result_h);
 cudaFree(result_d);
}

代码应该编译没有任何问题，但输出不是我所期望的。

在模拟模式上：

CUDA kernel done hit
key to continue...

    result[000000]  1000 
...
    result[000128]  5000

在发布模式上：

CUDA kernel done hit
key to continue...

    result[000000]  1000 
...
    result[012288]  5000

我还在代码的不同行上使用了 __threadfence() 和/或 __syncthreads() 也没有成功...

ps。如果可能的话，我想生成内核函数内的所有内容。我还尝试在 host main 函数和 memcpy 到 device 中“预”生成可能的纯文本候选，这只适用于非常有限的字符集大小（因为设备内存有限））。

关于输出的任何想法，为什么重复（即使使用 __threadfence() 或 __syncthreads()）？
关于
任何其他方法可以在 CUDA 内核中快速生成纯文本（候选）:-) (~75^8)？

谢谢一百万

问候一月

原文

I need help please. I started to program a common brute forcer / password guesser with CUDA (2.3 / 3.0beta).
I tried different ways to generate all possible plain text "candidates" of a defined ASCII char set.

In this sample code I want to generate all 74^4 possible combinations (and just output the result back to host/stdout).

$ ./combinations
Total number of combinations    : 29986576

Maximum output length   : 4
ASCII charset length    : 74

ASCII charset   : 0x30 - 0x7a
 "0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy"

CUDA code (compiled with 2.3 and 3.0b - sm_10) - combinaions.cu:

#include <stdio.h>
#include <cuda.h>

__device__ uchar4 charset_global = {0x30, 0x30, 0x30, 0x30};
__shared__ __device__ uchar4 charset[128];

__global__ void combo_kernel(uchar4 * result_d, unsigned int N)
{
 int totalThreads = blockDim.x * gridDim.x ;
 int tasksPerThread = (N % totalThreads) == 0 ? N / totalThreads : N/totalThreads + 1;
 int myThreadIdx = blockIdx.x * blockDim.x + threadIdx.x ;
 int endIdx = myThreadIdx + totalThreads * tasksPerThread ;
 if( endIdx > N) endIdx = N;

 const unsigned int m = 74 + 0x30;

 for(int idx = myThreadIdx ; idx < endIdx ; idx += totalThreads) {
  charset[threadIdx.x].x = charset_global.x;
  charset[threadIdx.x].y = charset_global.y;
  charset[threadIdx.x].z = charset_global.z;
  charset[threadIdx.x].w = charset_global.w;
  __threadfence();

  if(charset[threadIdx.x].x < m) {
   charset[threadIdx.x].x++;

  } else if(charset[threadIdx.x].y < m) {
   charset[threadIdx.x].x = 0x30; // = 0
   charset[threadIdx.x].y++;

  } else if(charset[threadIdx.x].z < m) {
   charset[threadIdx.x].y = 0x30; // = 0
   charset[threadIdx.x].z++;

  } else if(charset[threadIdx.x].w < m) {
   charset[threadIdx.x].z = 0x30;
   charset[threadIdx.x].w++;; // = 0
  }

  charset_global.x = charset[threadIdx.x].x;
  charset_global.y = charset[threadIdx.x].y;
  charset_global.z = charset[threadIdx.x].z;
  charset_global.w = charset[threadIdx.x].w;

  result_d[idx].x = charset_global.x;
  result_d[idx].y = charset_global.y;
  result_d[idx].z = charset_global.z;
  result_d[idx].w = charset_global.w;
 }
}

#define BLOCKS 65535
#define THREADS 128

int main(int argc, char **argv)
{
 const int ascii_chars = 74;
 const int max_len = 4;
 const unsigned int N = pow((float)ascii_chars, max_len);
 size_t size = N * sizeof(uchar4);

 uchar4 *result_d, *result_h;
 result_h = (uchar4 *)malloc(size );
 cudaMalloc((void **)&result_d, size );
 cudaMemset(result_d, 0, size);

 printf("Total number of combinations\t: %d\n\n", N); 
 printf("Maximum output length\t: %d\n", max_len);
 printf("ASCII charset length\t: %d\n\n", ascii_chars);

 printf("ASCII charset\t: 0x30 - 0x%02x\n ", 0x30 + ascii_chars);
 for(int i=0; i < ascii_chars; i++)
  printf("%c",i + 0x30);
 printf("\n\n");

 combo_kernel <<< BLOCKS, THREADS >>> (result_d, N);
 cudaThreadSynchronize();

 printf("CUDA kernel done\n");
 printf("hit key to continue...\n");
 getchar();

 cudaMemcpy(result_h, result_d, size, cudaMemcpyDeviceToHost);

 for (unsigned int i=0; i<N; i++)
  printf("result[%06u]\t%c%c%c%c\n",i, result_h[i].x, result_h[i].y, result_h[i].z, result_h[i].w);

 free(result_h);
 cudaFree(result_d);
}

The code should compile without any problems but the output is not what i expected.

On emulation mode:

CUDA kernel done hit
key to continue...

    result[000000]  1000 
...
    result[000128]  5000

On release mode:

CUDA kernel done hit
key to continue...

    result[000000]  1000 
...
    result[012288]  5000

I also used __threadfence() and or __syncthreads() on different lines of the code also without success...

ps. if possible I want to generate everything inside of the kernel function . I also tried "pre" generating of possible plain text candidates inside host main function and memcpy to device, this works only with a very limited charset size (because of limited device memory).

any idea about the output, why the repeating (even with __threadfence() or __syncthreads()) ?
any other method to generate plain text (candidates) inside CUDA kernel fast :-) (~75^8) ?

thanks a million

greets jan

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

海之角 2024-08-20 01:48:04

顺便说一句，您的循环界限过于复杂。您不需要执行所有这些工作来计算 endIdx，而是可以执行以下操作，使代码更简单。

for(int idx = myThreadIdx ; idx < N ; idx += totalThreads)

Incidentally, your loop bound is overly complex. You don't need to do all that work to compute the endIdx, instead you can do the following, making the code simpler.

for(int idx = myThreadIdx ; idx < N ; idx += totalThreads)

回复收藏 0 原文

玩世 2024-08-20 01:48:04

让我们看看：

填充字符集数组时，__syncthreads() 就足够了，因为您对写入全局内存不感兴趣（稍后会详细介绍）
您的 if 语句不是正确重置循环迭代器：
- 在 z < m，则 x == m 和 y == m 都必须设置为 0。
- 与 w 类似
每个线程负责写入一组charset 中有 4 个字符，但每个线程都写入相同的 4 个值。没有线程做任何独立的工作。
您正在将每个线程结果写入全局内存，而没有原子操作，这是不安全的。无法保证结果在读回之前不会立即被另一个线程破坏。
将计算结果写入全局内存后，您将立即从全局内存中读回计算结果。目前还不清楚你为什么要这样做，而且这是非常不安全的。
最后，CUDA 中没有可靠的方法来实现所有块之间的同步，这似乎正是您所希望的。调用 __threadfence 仅适用于当前在设备上执行的块，该块可以是应为内核调用运行的所有块的子集。因此它不能作为同步原语工作。

计算每个线程的 x、y、z 和 w 的初始值可能更容易。然后每个线程可以从其初始值开始循环，直到执行完tasksPerThread 迭代。写出这些值可能或多或少可以像您现在所拥有的那样进行。

编辑：这是一个简单的测试程序，用于演示循环迭代中的逻辑错误：

int m = 2;
int x = 0, y = 0, z = 0, w = 0;

for (int i = 0; i < m * m * m * m; i++)
{
    printf("x: %d y: %d z: %d w: %d\n", x, y, z, w);
    if(x < m) {
        x++;
    } else if(y < m) {
        x = 0; // = 0
        y++;
    } else if(z < m) {
        y = 0; // = 0
        z++;
    } else if(w < m) {
        z = 0;
        w++;; // = 0
    }
}

其输出如下：

x: 0 y: 0 z: 0 w: 0
x: 1 y: 0 z: 0 w: 0
x: 2 y: 0 z: 0 w: 0
x: 0 y: 1 z: 0 w: 0
x: 1 y: 1 z: 0 w: 0
x: 2 y: 1 z: 0 w: 0
x: 0 y: 2 z: 0 w: 0
x: 1 y: 2 z: 0 w: 0
x: 2 y: 2 z: 0 w: 0
x: 2 y: 0 z: 1 w: 0
x: 0 y: 1 z: 1 w: 0
x: 1 y: 1 z: 1 w: 0
x: 2 y: 1 z: 1 w: 0
x: 0 y: 2 z: 1 w: 0
x: 1 y: 2 z: 1 w: 0
x: 2 y: 2 z: 1 w: 0

Let's see:

When filling your charset array, __syncthreads() will be sufficient as you are not interested in writes to global memory (more on this later)
Your if statements are not correctly resetting your loop iterators:
- In z < m, then both x == m and y == m and must both be set to 0.
- Similar for w
Each thread is responsible for writing one set of 4 characters in charset, but every thread writes the same 4 values. No thread does any independent work.
You are writing each threads results to global memory without atomics, which is unsafe. There is no guarantee that the results won't be immediately clobbered by another thread before reading them back.
You are reading the results of computation back from global memory immediately after writing them to global memory. It's unclear why you are doing this and this is very unsafe.
Finally, there is no reliable way in CUDA to to a synchronization between all blocks, which seems to be what you are hoping for. Calling __threadfence only applies to blocks currently executing on the device, which can be subset of all blocks that should run for a kernel call. Thus it doesn't work as a synchronization primitive.

It's probably easier to calculate initial values of x, y, z and w for each thread. Then each thread can start looping from its initial values until it has performed tasksPerThread iterations. Writing the values out can probably proceed more or less as you have it now.

EDIT: Here is a simple test program to demonstrate the logic errors in your loop iteration:

int m = 2;
int x = 0, y = 0, z = 0, w = 0;

for (int i = 0; i < m * m * m * m; i++)
{
    printf("x: %d y: %d z: %d w: %d\n", x, y, z, w);
    if(x < m) {
        x++;
    } else if(y < m) {
        x = 0; // = 0
        y++;
    } else if(z < m) {
        y = 0; // = 0
        z++;
    } else if(w < m) {
        z = 0;
        w++;; // = 0
    }
}

The output of which is this:

x: 0 y: 0 z: 0 w: 0
x: 1 y: 0 z: 0 w: 0
x: 2 y: 0 z: 0 w: 0
x: 0 y: 1 z: 0 w: 0
x: 1 y: 1 z: 0 w: 0
x: 2 y: 1 z: 0 w: 0
x: 0 y: 2 z: 0 w: 0
x: 1 y: 2 z: 0 w: 0
x: 2 y: 2 z: 0 w: 0
x: 2 y: 0 z: 1 w: 0
x: 0 y: 1 z: 1 w: 0
x: 1 y: 1 z: 1 w: 0
x: 2 y: 1 z: 1 w: 0
x: 0 y: 2 z: 1 w: 0
x: 1 y: 2 z: 1 w: 0
x: 2 y: 2 z: 1 w: 0

回复收藏 0 原文

~没有更多了~