当前位置：文江博客话题详情

我们可以并行化这个任务吗？

发布于 2024-12-15 05:44:48 字数 185 浏览 2 评论 0原文

给定一个 C 字符串（以 NULL 字符常量结尾的字符数组），我们必须找到该字符串的长度。您能否建议一些方法来并行化 N 个执行线程。我在划分子问题时遇到问题，因为访问不存在的数组位置会产生分段错误。

编辑：我并不担心并行执行此任务可能会产生更大的开销。只是想知道这是否可以完成（使用 openmp 等）

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

会傲 2024-12-22 05:44:48

不，不能。因为每个步骤都需要知道之前的状态（我们是否在前一个字符上遇到了空值）。您一次只能安全地检查 1 个字符。

想象一下，您正在翻转岩石，并且必须停在下面有白色油漆的岩石处（空），否则您会死（又名分段故障等）。

你不能让人们互相“领先”，因为白色油漆岩石可能介于两者之间。

拥有多个人（线程/进程）只是让他们轮流翻开下一块石头。他们永远不会同时翻动石头。

回复收藏 0 原文

森林很绿却致人迷途 2024-12-22 05:44:48

这可能甚至不值得尝试。如果字符串很短，开销将大于处理速度的增益。如果字符串真的很长，速度可能会受到内存速度的限制，而不是受到CPU处理速度的限制。

回复收藏 0 原文

陈甜 2024-12-22 05:44:48

我想说仅使用标准 C 字符串这是无法完成的。但是，如果您可以定义一个包含与进程一样多的字符的个人终止字符串，那就很简单了。

回复收藏 0 原文

无力看清 2024-12-22 05:44:48

您知道该 char 数组的最大大小吗？如果是这样，您可以在不同的垃圾中进行并行搜索，并返回具有最小索引的终止符的索引。
因此，您只能处理分配的内存，不会出现段错误。

当然，这并不像 s_nairs 的回答那么复杂，但非常简单。
例子：

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <omp.h>

int main(int argc, char **argv)
{
    int N=1000;
    char *str = calloc(N, sizeof(char));
    strcpy(str, "This is a test string!");  
    fprintf(stdout, "%s\n", str);

    int nthreads = omp_get_num_procs();
    int i;
    int ind[nthreads];
    for( i = 0; i < nthreads; i++){
        ind[i] = -1;
    }

    int procn;
    int flag;
#pragma omp parallel  private(procn, flag)
    {
        flag = 1;
        procn = omp_get_thread_num();
#pragma omp for
        for( i = 0; i < N; i++){
            if (str[i] == '\0' && flag == 1){
                ind[procn] = i;
                flag = 0;
            }
        }
    }
    int len = 0;
    for( i = 0; i < nthreads; i++){
        if(ind[i]>-1){
            len = ind[i];
            break;
        }
    }
    fprintf(stdout,"strlen %d\n", len);
    free(str);
    return 0;
}

Do you know the maximum size of that char array? If so, you could do a parallel search in different junks and return the index of the terminator with smallest index.
Hence you are then only working on allocated memory, you cannot get segfaults.

Of course this is not as sophisticated as s_nairs answer but pretty straight forward.
example:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <omp.h>

int main(int argc, char **argv)
{
    int N=1000;
    char *str = calloc(N, sizeof(char));
    strcpy(str, "This is a test string!");  
    fprintf(stdout, "%s\n", str);

    int nthreads = omp_get_num_procs();
    int i;
    int ind[nthreads];
    for( i = 0; i < nthreads; i++){
        ind[i] = -1;
    }

    int procn;
    int flag;
#pragma omp parallel  private(procn, flag)
    {
        flag = 1;
        procn = omp_get_thread_num();
#pragma omp for
        for( i = 0; i < N; i++){
            if (str[i] == '\0' && flag == 1){
                ind[procn] = i;
                flag = 0;
            }
        }
    }
    int len = 0;
    for( i = 0; i < nthreads; i++){
        if(ind[i]>-1){
            len = ind[i];
            break;
        }
    }
    fprintf(stdout,"strlen %d\n", len);
    free(str);
    return 0;
}

回复收藏 0 原文

烟凡古楼 2024-12-22 05:44:48

您可以在 Windows 中做一些丑陋的事情，将不安全的内存读取包含在 SEH __try 块中：

#include <windows.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define N 2

DWORD WINAPI FindZeroThread(LPVOID lpParameter)
{
  const char* volatile* pp = (const char* volatile*)lpParameter;

  __try
  {
    while (**pp)
    {
      (*pp) += N;
    }
  }
  __except (EXCEPTION_EXECUTE_HANDLER)
  {
    *pp = NULL;
  }

  return 0;
}

size_t pstrlen(const char* s)
{
  int i;
  HANDLE handles[N];
  const char* volatile ptrs[N];
  const char* p = (const char*)(UINT_PTR)-1;

  for (i = 0; i < N; i++)
  {
    ptrs[i] = s + i;
    handles[i] = CreateThread(NULL, 0, &FindZeroThread, (LPVOID)&ptrs[i], 0, NULL);
  }

  WaitForMultipleObjects(N, handles, TRUE /* bWaitAll */, INFINITE);

  for (i = 0; i < N; i++)
  {
    CloseHandle(handles[i]);
    if (ptrs[i] && p > ptrs[i]) p = ptrs[i];
  }

  return (size_t)(p - s);
}

#define LEN (20 * 1000 * 1000)

int main(void)
{
  char* s = malloc(LEN);

  memset(s, '*', LEN);
  s[LEN - 1] = 0;

  printf("strlen()=%zu pstrlen()=%zu\n", strlen(s), pstrlen(s));

  return 0;
}

输出：

strlen()=19999999 pstrlen()=19999999

我认为使用 MMX/SSE 指令以某种并行方式加速代码可能会更好。

编辑：这在 Windows 上可能不是一个好主意，请参阅 Raymond Chen 的
IsBadXxxPtr 实际上应该被称为 CrashProgramRandomly。

You could do something ugly like this in Windows enclosing unsafe memory reads in a SEH __try block:

#include <windows.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define N 2

DWORD WINAPI FindZeroThread(LPVOID lpParameter)
{
  const char* volatile* pp = (const char* volatile*)lpParameter;

  __try
  {
    while (**pp)
    {
      (*pp) += N;
    }
  }
  __except (EXCEPTION_EXECUTE_HANDLER)
  {
    *pp = NULL;
  }

  return 0;
}

size_t pstrlen(const char* s)
{
  int i;
  HANDLE handles[N];
  const char* volatile ptrs[N];
  const char* p = (const char*)(UINT_PTR)-1;

  for (i = 0; i < N; i++)
  {
    ptrs[i] = s + i;
    handles[i] = CreateThread(NULL, 0, &FindZeroThread, (LPVOID)&ptrs[i], 0, NULL);
  }

  WaitForMultipleObjects(N, handles, TRUE /* bWaitAll */, INFINITE);

  for (i = 0; i < N; i++)
  {
    CloseHandle(handles[i]);
    if (ptrs[i] && p > ptrs[i]) p = ptrs[i];
  }

  return (size_t)(p - s);
}

#define LEN (20 * 1000 * 1000)

int main(void)
{
  char* s = malloc(LEN);

  memset(s, '*', LEN);
  s[LEN - 1] = 0;

  printf("strlen()=%zu pstrlen()=%zu\n", strlen(s), pstrlen(s));

  return 0;
}

Output:

strlen()=19999999 pstrlen()=19999999

I think it may be better to use MMX/SSE instructions to speed up the code in a somewhat parallel way.

EDIT: This may be not a very good idea on Windows after all, see Raymond Chen's
IsBadXxxPtr should really be called CrashProgramRandomly.

回复收藏 0 原文

扮仙女 2024-12-22 05:44:48

让我承认这一点，

以下代码是使用 C# 而不是 C 编写的。您可以将我想要表达的想法联系起来。大部分内容都来自并行模式（是微软关于并行方法的草稿文档）。

为了尽可能实现最佳静态分区，您需要能够提前准确预测所有迭代将花费多长时间。这几乎不可行，因此需要更加动态的分区，以便系统能够快速适应不断变化的工作负载。我们可以通过转移到分区权衡范围的另一端来解决这个问题，并尽可能多地进行负载平衡。

为此，我们可以让线程竞争迭代，而不是向每个线程推送一组给定的索引来处理。我们使用要处理的剩余迭代池，该池最初开始填充所有迭代。在处理完所有迭代之前，每个线程都会进入迭代池，删除迭代值，对其进行处理，然后重复。通过这种方式，我们可以以贪婪的方式实现负载平衡的最佳水平的近似值（只有先验地了解每次迭代需要多长时间才能实现真正的最佳水平）。如果一个线程在处理特定的长迭代时陷入困境，其他线程将通过同时处理池中的工作来进行补偿。当然，即使使用这种方案，您仍然会发现自己的分区远非最佳（如果一个线程碰巧遇到比其他线程大得多的几项工作，则可能会发生这种情况），但不知道一个线程需要多少处理时间。鉴于需要完成的工作，几乎没有什么可以做的了。

下面是一个将负载平衡发挥到极致的示例实现。迭代值池被维护为表示下一个可用迭代的单个整数，并且参与处理的线程通过原子地递增该整数来“删除项目”：

public static void MyParallelFor( 
int inclusiveLowerBound, int exclusiveUpperBound, Action<int> body) 
{ 
// Get the number of processors, initialize the number of remaining 
// threads, and set the starting point for the iteration. 
int numProcs = Environment.ProcessorCount; 
int remainingWorkItems = numProcs; 
int nextIteration = inclusiveLowerBound; 
using (ManualResetEvent mre = new ManualResetEvent(false)) 
{ 
// Create each of the work items. 
for (int p = 0; p < numProcs; p++) 
{ 
ThreadPool.QueueUserWorkItem(delegate 
{ 
int index; 
while ((index = Interlocked.Increment( 
ref nextIteration) - 1) < exclusiveUpperBound) 
{ 
body(index); 
} 
if (Interlocked.Decrement(ref remainingWorkItems) == 0) 
mre.Set(); 
}); 
} 
// Wait for all threads to complete. 
mre.WaitOne(); 
} 
}

Let me acknowledge this,

Following code has been written using C# and not C. You can associate the idea what I am trying to articulate. And most of the content are from a Parallel Pattern (was a draft document by Microsoft on parallel approach)

To do the best static partitioning possible, you need to be able to accurately predict ahead of time how long all the iterations will take. That’s rarely feasible, resulting in a need for a more dynamic partitioning, where the system can adapt to changing workloads quickly. We can address this by shifting to the other end of the partitioning tradeoffs spectrum, with as much load-balancing as possible.

To do that, rather than pushing to each of the threads a given set of indices to process, we can have the threads compete for iterations. We employ a pool of the remaining iterations to be processed, which initially starts filled with all iterations. Until all of the iterations have been processed, each thread goes to the iteration pool, removes an iteration value, processes it, and then repeats. In this manner, we can achieve in a greedy fashion an approximation for the optimal level of load-balancing possible (the true optimum could only be achieved with a priori knowledge of exactly how long each iteration would take). If a thread gets stuck processing a particular long iteration, the other threads will compensate by processing work from the pool in the meantime. Of course, even with this scheme you can still find yourself with a far from optimal partitioning (which could occur if one thread happened to get stuck with several pieces of work significantly larger than the rest), but without knowledge of how much processing time a given piece of work will require, there’s little more that can be done.

Here’s an example implementation that takes load-balancing to this extreme. The pool of iteration values is maintained as a single integer representing the next iteration available, and the threads involved in the processing “remove items” by atomically incrementing this integer:

public static void MyParallelFor( 
int inclusiveLowerBound, int exclusiveUpperBound, Action<int> body) 
{ 
// Get the number of processors, initialize the number of remaining 
// threads, and set the starting point for the iteration. 
int numProcs = Environment.ProcessorCount; 
int remainingWorkItems = numProcs; 
int nextIteration = inclusiveLowerBound; 
using (ManualResetEvent mre = new ManualResetEvent(false)) 
{ 
// Create each of the work items. 
for (int p = 0; p < numProcs; p++) 
{ 
ThreadPool.QueueUserWorkItem(delegate 
{ 
int index; 
while ((index = Interlocked.Increment( 
ref nextIteration) - 1) < exclusiveUpperBound) 
{ 
body(index); 
} 
if (Interlocked.Decrement(ref remainingWorkItems) == 0) 
mre.Set(); 
}); 
} 
// Wait for all threads to complete. 
mre.WaitOne(); 
} 
}

回复收藏 0 原文

~没有更多了~