关于多线程、锁和多核处理器的多部分问题(multi^3)

发布于 2024-09-03 17:31:51 字数 1511 浏览 2 评论 0原文

我有一个有两种方法的程序。第一个方法采用两个数组作为参数,并执行一个操作,其中一个数组中的值有条件地写入另一个数组,如下所示:

void Blend(int[] dest, int[] src, int offset)
{
    for (int i = 0; i < src.Length; i++)
    {
        int rdr = dest[i + offset];
        dest[i + offset] = src[i] > rdr? src[i] : rdr; 
    }
}

第二个方法创建两组独立的 int 数组并迭代它们这样一组中的每个数组都与另一组中的每个数组混合,如下所示:

void CrossBlend()
{
    int[][] set1 = new int[150][75000]; // we'll pretend this actually compiles
    int[][] set2 = new int[25][10000]; // we'll pretend this actually compiles
    for (int i1 = 0; i1 < set1.Length; i1++)
    {
        for (int i2 = 0; i2 < set2.Length; i2++)
        {
            Blend(set1[i1], set2[i2], 0); // or any offset, doesn't matter
        }
    }
}

第一个问题:由于这种方法显然是并行化的候选者,所以它是吗?本质上线程安全?似乎不是,因为我可以设想一种场景(我认为不太可能),其中一个线程的更改由于不同的线程〜同时操作而丢失。

如果不是,这会

void Blend(int[] dest, int[] src, int offset)
{
    lock (dest)
    {
        for (int i = 0; i < src.Length; i++)
        {
            int rdr = dest[i + offset];
            dest[i + offset] = src[i] > rdr? src[i] : rdr; 
        }
    }
}

是一个有效的解决方案吗?

第二个问题:如果是这样,使用这样的锁可能会产生什么性能成本?我假设,对于这样的事情,如果一个线程尝试锁定当前被另一个线程锁定的目标数组,第一个线程将阻塞,直到锁被释放,而不是继续处理某些内容。

另外,获取锁实际上需要多少时间?纳秒级或更糟?这会是这样的事情的一个主要问题吗?

第三个问题:我如何以利用多核处理器的多线程方式最好地解决这个问题(这是基于一个潜在错误的假设,即多线程解决方案不会在单核处理器上加速此操作)?我猜想我希望每个核心运行一个线程,但我不知道这是否属实。

I have a program with two methods. The first method takes two arrays as parameters, and performs an operation in which values from one array are conditionally written into the other, like so:

void Blend(int[] dest, int[] src, int offset)
{
    for (int i = 0; i < src.Length; i++)
    {
        int rdr = dest[i + offset];
        dest[i + offset] = src[i] > rdr? src[i] : rdr; 
    }
}

The second method creates two separate sets of int arrays and iterates through them such that each array of one set is Blended with each array from the other set, like so:

void CrossBlend()
{
    int[][] set1 = new int[150][75000]; // we'll pretend this actually compiles
    int[][] set2 = new int[25][10000]; // we'll pretend this actually compiles
    for (int i1 = 0; i1 < set1.Length; i1++)
    {
        for (int i2 = 0; i2 < set2.Length; i2++)
        {
            Blend(set1[i1], set2[i2], 0); // or any offset, doesn't matter
        }
    }
}

First question: Since this apporoach is an obvious candidate for parallelization, is it intrinsically thread-safe? It seems like no, since I can conceive a scenario (unlikely, I think) where one thread's changes are lost because a different threads ~simultaneous operation.

If no, would this:

void Blend(int[] dest, int[] src, int offset)
{
    lock (dest)
    {
        for (int i = 0; i < src.Length; i++)
        {
            int rdr = dest[i + offset];
            dest[i + offset] = src[i] > rdr? src[i] : rdr; 
        }
    }
}

be an effective fix?

Second question: If so, what would be the likely performance cost of using locks like this? I assume that with something like this, if a thread attempts to lock a destination array that is currently locked by another thread, the first thread would block until the lock was released instead of continuing to process something.

Also, how much time does it actually take to acquire a lock? Nanosecond scale, or worse than that? Would this be a major issue in something like this?

Third question: How would I best approach this problem in a multi-threaded way that would take advantage of multi-core processors (and this is based on the potentially wrong assumption that a multi-threaded solution would not speed up this operation on a single core processor)? I'm guessing that I would want to have one thread running per core, but I don't know if that's true.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

拿命拼未来 2024-09-10 17:31:51

CrossBlend 的潜在争用是 set1 - 混合的目的地。与其使用锁(与您正在执行的工作量相比,锁的成本相对较高),不如安排每个线程在其自己的目标上工作。也就是说,给定的目标(set1 中某个索引处的数组)由给定的任务拥有。这是可能的,因为结果与 CrossBlend 处理数组的顺序无关。

每个任务应该只运行 CrossBlend 中的内部循环,并且使用要使用的目标数组 (set1) 的索引(或范围)对任务进行参数化)

您还可以并行化 Blend 方法,因为每个索引都是独立于其他索引计算的,因此不会出现争用。但在当今的机器上,拥有 <40 个内核,只需线程化 CrossBlend 方法即可获得足够的并行性。

为了在多核上有效运行,您可以

  1. 对于 N 个核,将问题分为 N 个部分。鉴于 set1 与核心数量相比相当大,您可以将 set1 划分为 N 个范围,并将每个索引范围传递到运行内部 CrossBlend 循环的 N 个线程中。这将为您提供相当好的并行性,但这并不是最佳的。 (有些线程会更快完成,并且最终没有任何工作可做。)
  2. 更复杂的方案是使 CrossBlend 内部循环的每次迭代成为一个单独的任务。有 N 个队列(针对 N 个核心),并在队列之间分配任务。启动 N 个线程,每个线程从队列中读取其任务。如果线程队列变空,它将从其他线程的队列中获取任务。

第二种方法最适合大小不规则的任务,或者系统正在用于其他任务的情况,因此某些核心可能会在其他进程之间进行时间切换,因此您不能期望在不同的任务上大致相同的时间内完成等量的工作核心。

第一种方法的编码要简单得多,并且会给您带来良好的并行性。

The potential contention with CrossBlend is set1 - the destination of the blend. Rather than using a lock, which is going to be comparatively expensive compared to the amount of work you are doing, arrange for each thread to work on it's own destination. That is a given destination (array at some index in set1) is owned by a given task. This is possible since the outcome is independent of the order that CrossBlend processes the arrays in.

Each task should then run just the inner loop in CrossBlend, and the task is parameterized with the index of the dest array (set1) to use (or range of indices.)

You can also parallelize the Blend method, since each index is computed independently of the others, so no contention there. But on todays machines, with <40 cores you will get sufficient parallism just threading the CrossBlend method.

To run effectively on multi-core you can either

  1. for N cores, divide the problem into N parts. Given that set1 is reasonably large compared to the number of cores, you could just divide set1 into N ranges, and pass each range of indices into N threads running the inner CrossBlend loop. That will give you fairly good parallelism, but it's not optimal. (Some threads will finish sooner and end up with no work to do.)
  2. A more involved scheme is to make each iteration of the CrossBlend inner loop a separate task. Have N queues (for N cores), and distribute the tasks amongst the queues. Start N threads, with each thread reading it's tasks from a queue. If a threads queue becomes empty, it takes a task from some other thread's queue.

The second approach is best suited to irregularly sized tasks, or where the system is being used for other tasks, so some cores may be time switching between other processes, so you cannot expect that equal amounts of work complete in the roughly same time on different cores.

The first approach is much simpler to code, and will give you a good level of parallelism.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文