C++快速将 2 个数组加在一起

发布于 2024-09-03 21:38:41 字数 445 浏览 2 评论 0原文

给定数组:

int canvas[10][10];
int addon[10][10];

所有值的范围为 0 - 100,在 C++ 中添加这两个数组的最快方法是什么,以便画布中的每个单元格等于自身加上插件中相应的单元格值?

IE,我想实现类似的目标:

canvas += another;

因此,如果 canvas[0][0] =3 且 addon[0][0] = 2 那么 canvas[0][0] = 5

速度在这里至关重要,因为我正在编写一个非常简单的程序暴力破解一个背包式的问题,会有上千万种组合。

作为一个额外的小问题(如果您能提供帮助,谢谢!)检查画布中的任何值是否超过 100 的最快方法是什么?循环很慢!

Given the arrays:

int canvas[10][10];
int addon[10][10];

Where all the values range from 0 - 100, what is the fastest way in C++ to add those two arrays so each cell in canvas equals itself plus the corresponding cell value in addon?

IE, I want to achieve something like:

canvas += another;

So if canvas[0][0] =3 and addon[0][0] = 2 then canvas[0][0] = 5

Speed is essential here as I am writing a very simple program to brute force a knapsack type problem and there will be tens of millions of combinations.

And as a small extra question (thanks if you can help!) what would be the fastest way of checking if any of the values in canvas exceed 100? Loops are slow!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

苏大泽ㄣ 2024-09-10 21:38:41

下面是一个 SSE4 实现,应该在 Nehalem (Core i7) 上表现良好:

#include <limits.h>
#include <emmintrin.h>
#include <smmintrin.h>

static inline int canvas_add(int canvas[10][10], int addon[10][10])
{
    __m128i * cp = (__m128i *)&canvas[0][0];
    const __m128i * ap = (__m128i *)&addon[0][0];
    const __m128i vlimit = _mm_set1_epi32(100);
    __m128i vmax = _mm_set1_epi32(INT_MIN);
    __m128i vcmp;
    int cmp;
    int i;

    for (i = 0; i < 10 * 10; i += 4)
    {
        __m128i vc = _mm_loadu_si128(cp);
        __m128i va = _mm_loadu_si128(ap);

        vc = _mm_add_epi32(vc, va);
        vmax = _mm_max_epi32(vmax, vc);   // SSE4 *

        _mm_storeu_si128(cp, vc);

        cp++;
        ap++;
    }
    vcmp = _mm_cmpgt_epi32(vmax, vlimit); // SSE4 *
    cmp = _mm_testz_si128(vcmp, vcmp);    // SSE4 *
    return cmp == 0;
}

使用 gcc -msse4.1 ... 或适合您的特定开发环境的等效工具进行编译。

对于没有 SSE4 的旧版 CPU(并且具有更昂贵的未对齐加载/存储),您需要 (a) 使用 SSE2/SSE3 内在函数的合适组合来替换 SSE4 操作(标有 *上面),理想情况下 (b) 确保您的数据是 16 字节对齐并使用对齐的加载/存储 (_mm_load_si128/_mm_store_si128) 代替 _mm_loadu_si128代码>/<代码>_mm_storeu_si128。

Here is an SSE4 implementation that should perform pretty well on Nehalem (Core i7):

#include <limits.h>
#include <emmintrin.h>
#include <smmintrin.h>

static inline int canvas_add(int canvas[10][10], int addon[10][10])
{
    __m128i * cp = (__m128i *)&canvas[0][0];
    const __m128i * ap = (__m128i *)&addon[0][0];
    const __m128i vlimit = _mm_set1_epi32(100);
    __m128i vmax = _mm_set1_epi32(INT_MIN);
    __m128i vcmp;
    int cmp;
    int i;

    for (i = 0; i < 10 * 10; i += 4)
    {
        __m128i vc = _mm_loadu_si128(cp);
        __m128i va = _mm_loadu_si128(ap);

        vc = _mm_add_epi32(vc, va);
        vmax = _mm_max_epi32(vmax, vc);   // SSE4 *

        _mm_storeu_si128(cp, vc);

        cp++;
        ap++;
    }
    vcmp = _mm_cmpgt_epi32(vmax, vlimit); // SSE4 *
    cmp = _mm_testz_si128(vcmp, vcmp);    // SSE4 *
    return cmp == 0;
}

Compile with gcc -msse4.1 ... or equivalent for your particular development environment.

For older CPUs without SSE4 (and with much more expensive misaligned loads/stores) you'll need to (a) use a suitable combination of SSE2/SSE3 intrinsics to replace the SSE4 operations (marked with an * above) and ideally (b) make sure your data is 16-byte aligned and use aligned loads/stores (_mm_load_si128/_mm_store_si128) in place of _mm_loadu_si128/_mm_storeu_si128.

深海不蓝 2024-09-10 21:38:41

没有什么比 C++ 中的循环更快的了。您需要使用一些特定于平台的向量指令。也就是说,您需要深入到汇编语言级别。但是,有一些 C++ 库尝试为您执行此操作,因此您可以进行高级编写,并让库负责执行低级操作 SIMD 工作适合您的编译器针对的任何架构。

MacSTL 是一个您可能想要查看的库。它最初是 Macintosh 特定的库,但现在是跨平台的。请参阅他们的主页以获取更多信息。

You can't do anything faster than loops in just C++. You would need to use some platform specific vector instructions. That is, you would need to go down to the assembly language level. However, there are some C++ libraries that try to do this for you, so you can write at a high level and have the library take care of doing the low level SIMD work that is appropriate for whatever architecture you are targetting with your compiler.

MacSTL is a library that you might want to look at. It was originally a Macintosh specific library, but it is cross platform now. See their home page for more info.

自由如风 2024-09-10 21:38:41

在标准 C 或 C++ 中,您要做的最好的事情就是将其重新转换为包含 100 个数字的一​​维数组,并将它们添加到循环中。 (单下标将比双下标使用更少的处理,除非编译器可以对其进行优化。如果有的话,您要知道有多少影响的唯一方法就是进行测试。)

您可以当然,创建一个类,其中添加的是一个简单的 C++ 指令 (canvas += addon;),但这不会加快任何速度。所发生的只是简单的 C++ 指令将扩展到上面的循环中。

您需要进入较低级别的处理才能加快速度。许多现代 CPU 上都有附加指令来执行此类处理,您也许可以使用。您也许可以使用 Cuda 之类的东西在 GPU 上运行类似的东西。您可以尝试使操作并行并在多个核心上运行,但在如此小的实例上,您必须了解缓存在 CPU 上的工作原理。

替代方案是改进您的算法(对于背包型问题,您可以使用动态规划< /a> 以某种方式 - 如果没有您提供更多信息,我们无法告诉您),或接受表演。 10 x 10 数组上的数千万次运算变成了数以千亿计的运算,而且这并不像以前那样令人生畏。当然,我不知道你的使用场景或者性能要求。

The best you're going to do in standard C or C++ is to recast that as a one-dimensional array of 100 numbers and add them in a loop. (Single subscripts will use a bit less processing than double ones, unless the compiler can optimize it out. The only way you're going to know how much of an effect there is, if there is one, is to test.)

You could certainly create a class where the addition would be one simple C++ instruction (canvas += addon;), but that wouldn't speed anything up. All that would happen is that the simple C++ instruction would expand into the loop above.

You would need to get into lower-level processing in order to speed that up. There are additional instructions on many modern CPUs to do such processing that you might be able to use. You might be able to run something like this on a GPU using something like Cuda. You could try making the operation parallel and running on several cores, but on such a small instance you'll have to know how caching works on your CPU.

The alternatives are to improve your algorithm (on a knapsack-type problem, you might be able to use dynamic programming in some way - without more information from you, we can't tell you), or to accept the performance. Tens of millions of operations on a 10 by 10 array turn into hundreds of billions of operations on numbers, and that's not as intimidating as it used to be. Of course, I don't know your usage scenario or performance requirements.

咿呀咿呀哟 2024-09-10 21:38:41

两个部分:首先,将二维数组 [10][10] 视为单个数组 [100]。 C++ 的布局规则应该允许这样做。其次,检查您的编译器是否有实现某种形式 SIMD 指令 的内部函数,例如 Intel 的 SSE。例如Microsoft 提供了一套。我相信 SSE 有一些检查最大值的说明,如果需要的话甚至可以限制到最大值。

Two parts: first, consider your two-dimensional array [10][10] as a single array [100]. The layout rules of C++ should allow this. Second, check your compiler for intrinsic functions implementing some form of SIMD instructions, such as Intel's SSE. For example Microsoft supplies a set. I believe SSE has some instructions for checking against a maximum value, and even clamping to the maximum if you want.

我纯我任性 2024-09-10 21:38:41

这是一个替代方案。

如果您 100% 确定所有值都在 0 到 100 之间,则可以将类型从 int 更改为 uint8_t。然后,您可以使用 uint32_t 一次将 4 个元素添加在一起,而不必担心溢出。

那是......

uint8_t  array1[10][10];
uint8_t  array2[10][10];
uint8_t  dest[10][10];
uint32_t *pArr1 = (uint32_t *) &array1[0][0];
uint32_t *pArr2 = (uint32_t *) &array2[0][0];
uint32_t *pDest = (uint32_t *) &dest[0][0];

int  i;

for (i = 0; i < sizeof (dest) / sizeof (uint32_t); i++) {
    pDest[i] = pArr1[i] + pArr2[i];
}

它可能不是最优雅的,但它可以帮助您避免使用特定于体系结构的代码。此外,如果您要这样做,我强烈建议您评论您正在做什么以及为什么。

Here is an alternative.

If you are 100% certain that all your values are between 0 and 100, you could change your type from an int to a uint8_t. Then, you could add 4 elements together at once of them together using uint32_t without worrying about overflow.

That is ...

uint8_t  array1[10][10];
uint8_t  array2[10][10];
uint8_t  dest[10][10];
uint32_t *pArr1 = (uint32_t *) &array1[0][0];
uint32_t *pArr2 = (uint32_t *) &array2[0][0];
uint32_t *pDest = (uint32_t *) &dest[0][0];

int  i;

for (i = 0; i < sizeof (dest) / sizeof (uint32_t); i++) {
    pDest[i] = pArr1[i] + pArr2[i];
}

It may not be the most elegant, but it could help keep you from going to architecture specific code. Additionally, if you were to do this, I would strongly recommend you comment what you are doing and why.

戏剧牡丹亭 2024-09-10 21:38:41

你应该检查一下 CUDA。这种问题是 CUDA 的问题。推荐 大规模并行处理器编程一书。

然而,这确实需要支持 CUDA 的硬件,并且 CUDA 需要花费一些精力才能在您的开发环境中进行设置,因此这取决于这到底有多重要!

祝你好运!

You should check out CUDA. This kind of problem is right up CUDA's street. Recommend the Programming Massively Parallel Processors book.

However, this does require CUDA capable hardware, and CUDA takes a bit of effort to get setup in your development environment, so it would depend how important this really is!

Good luck!

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文