C++快速将 2 个数组加在一起

发布于 2024-09-03 21:38:41 字数 445 浏览 16 评论 0原文

给定数组：

int canvas[10][10];
int addon[10][10];

所有值的范围为 0 - 100，在 C++ 中添加这两个数组的最快方法是什么，以便画布中的每个单元格等于自身加上插件中相应的单元格值？

IE，我想实现类似的目标：

canvas += another;

因此，如果 canvas[0][0] =3 且 addon[0][0] = 2 那么 canvas[0][0] = 5

速度在这里至关重要，因为我正在编写一个非常简单的程序暴力破解一个背包式的问题，会有上千万种组合。

作为一个额外的小问题（如果您能提供帮助，谢谢！）检查画布中的任何值是否超过 100 的最快方法是什么？循环很慢！

原文

Given the arrays:

int canvas[10][10];
int addon[10][10];

Where all the values range from 0 - 100, what is the fastest way in C++ to add those two arrays so each cell in canvas equals itself plus the corresponding cell value in addon?

IE, I want to achieve something like:

canvas += another;

So if canvas[0][0] =3 and addon[0][0] = 2 then canvas[0][0] = 5

Speed is essential here as I am writing a very simple program to brute force a knapsack type problem and there will be tens of millions of combinations.

And as a small extra question (thanks if you can help!) what would be the fastest way of checking if any of the values in canvas exceed 100? Loops are slow!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

苏大泽ㄣ 2024-09-10 21:38:41

下面是一个 SSE4 实现，应该在 Nehalem (Core i7) 上表现良好：

#include <limits.h>
#include <emmintrin.h>
#include <smmintrin.h>

static inline int canvas_add(int canvas[10][10], int addon[10][10])
{
    __m128i * cp = (__m128i *)&canvas[0][0];
    const __m128i * ap = (__m128i *)&addon[0][0];
    const __m128i vlimit = _mm_set1_epi32(100);
    __m128i vmax = _mm_set1_epi32(INT_MIN);
    __m128i vcmp;
    int cmp;
    int i;

    for (i = 0; i < 10 * 10; i += 4)
    {
        __m128i vc = _mm_loadu_si128(cp);
        __m128i va = _mm_loadu_si128(ap);

        vc = _mm_add_epi32(vc, va);
        vmax = _mm_max_epi32(vmax, vc);   // SSE4 *

        _mm_storeu_si128(cp, vc);

        cp++;
        ap++;
    }
    vcmp = _mm_cmpgt_epi32(vmax, vlimit); // SSE4 *
    cmp = _mm_testz_si128(vcmp, vcmp);    // SSE4 *
    return cmp == 0;
}

使用 gcc -msse4.1 ... 或适合您的特定开发环境的等效工具进行编译。

对于没有 SSE4 的旧版 CPU（并且具有更昂贵的未对齐加载/存储），您需要 (a) 使用 SSE2/SSE3 内在函数的合适组合来替换 SSE4 操作（标有 *上面），理想情况下 (b) 确保您的数据是 16 字节对齐并使用对齐的加载/存储 (_mm_load_si128/_mm_store_si128) 代替 _mm_loadu_si128代码>/<代码>_mm_storeu_si128。

Here is an SSE4 implementation that should perform pretty well on Nehalem (Core i7):

#include <limits.h>
#include <emmintrin.h>
#include <smmintrin.h>

static inline int canvas_add(int canvas[10][10], int addon[10][10])
{
    __m128i * cp = (__m128i *)&canvas[0][0];
    const __m128i * ap = (__m128i *)&addon[0][0];
    const __m128i vlimit = _mm_set1_epi32(100);
    __m128i vmax = _mm_set1_epi32(INT_MIN);
    __m128i vcmp;
    int cmp;
    int i;

    for (i = 0; i < 10 * 10; i += 4)
    {
        __m128i vc = _mm_loadu_si128(cp);
        __m128i va = _mm_loadu_si128(ap);

        vc = _mm_add_epi32(vc, va);
        vmax = _mm_max_epi32(vmax, vc);   // SSE4 *

        _mm_storeu_si128(cp, vc);

        cp++;
        ap++;
    }
    vcmp = _mm_cmpgt_epi32(vmax, vlimit); // SSE4 *
    cmp = _mm_testz_si128(vcmp, vcmp);    // SSE4 *
    return cmp == 0;
}

Compile with gcc -msse4.1 ... or equivalent for your particular development environment.

For older CPUs without SSE4 (and with much more expensive misaligned loads/stores) you'll need to (a) use a suitable combination of SSE2/SSE3 intrinsics to replace the SSE4 operations (marked with an * above) and ideally (b) make sure your data is 16-byte aligned and use aligned loads/stores (_mm_load_si128/_mm_store_si128) in place of _mm_loadu_si128/_mm_storeu_si128.

回复收藏 0 原文

深海不蓝 2024-09-10 21:38:41

没有什么比 C++ 中的循环更快的了。您需要使用一些特定于平台的向量指令。也就是说，您需要深入到汇编语言级别。但是，有一些 C++ 库尝试为您执行此操作，因此您可以进行高级编写，并让库负责执行低级操作 SIMD 工作适合您的编译器针对的任何架构。

MacSTL 是一个您可能想要查看的库。它最初是 Macintosh 特定的库，但现在是跨平台的。请参阅他们的主页以获取更多信息。

回复收藏 0 原文

自由如风 2024-09-10 21:38:41

在标准 C 或 C++ 中，您要做的最好的事情就是将其重新转换为包含 100 个数字的一维数组，并将它们添加到循环中。（单下标将比双下标使用更少的处理，除非编译器可以对其进行优化。如果有的话，您要知道有多少影响的唯一方法就是进行测试。）

您可以当然，创建一个类，其中添加的是一个简单的 C++ 指令 (canvas += addon;)，但这不会加快任何速度。所发生的只是简单的 C++ 指令将扩展到上面的循环中。

您需要进入较低级别的处理才能加快速度。许多现代 CPU 上都有附加指令来执行此类处理，您也许可以使用。您也许可以使用 Cuda 之类的东西在 GPU 上运行类似的东西。您可以尝试使操作并行并在多个核心上运行，但在如此小的实例上，您必须了解缓存在 CPU 上的工作原理。

替代方案是改进您的算法（对于背包型问题，您可以使用动态规划< /a> 以某种方式 - 如果没有您提供更多信息，我们无法告诉您），或接受表演。 10 x 10 数组上的数千万次运算变成了数以千亿计的运算，而且这并不像以前那样令人生畏。当然，我不知道你的使用场景或者性能要求。

回复收藏 0 原文

咿呀咿呀哟 2024-09-10 21:38:41

两个部分：首先，将二维数组 [10][10] 视为单个数组 [100]。 C++ 的布局规则应该允许这样做。其次，检查您的编译器是否有实现某种形式 SIMD 指令的内部函数，例如 Intel 的 SSE。例如Microsoft 提供了一套。我相信 SSE 有一些检查最大值的说明，如果需要的话甚至可以限制到最大值。

回复收藏 0 原文

我纯我任性 2024-09-10 21:38:41

这是一个替代方案。

如果您 100% 确定所有值都在 0 到 100 之间，则可以将类型从 int 更改为 uint8_t。然后，您可以使用 uint32_t 一次将 4 个元素添加在一起，而不必担心溢出。

那是......

uint8_t  array1[10][10];
uint8_t  array2[10][10];
uint8_t  dest[10][10];
uint32_t *pArr1 = (uint32_t *) &array1[0][0];
uint32_t *pArr2 = (uint32_t *) &array2[0][0];
uint32_t *pDest = (uint32_t *) &dest[0][0];

int  i;

for (i = 0; i < sizeof (dest) / sizeof (uint32_t); i++) {
    pDest[i] = pArr1[i] + pArr2[i];
}

它可能不是最优雅的，但它可以帮助您避免使用特定于体系结构的代码。此外，如果您要这样做，我强烈建议您评论您正在做什么以及为什么。

Here is an alternative.

If you are 100% certain that all your values are between 0 and 100, you could change your type from an int to a uint8_t. Then, you could add 4 elements together at once of them together using uint32_t without worrying about overflow.

That is ...

uint8_t  array1[10][10];
uint8_t  array2[10][10];
uint8_t  dest[10][10];
uint32_t *pArr1 = (uint32_t *) &array1[0][0];
uint32_t *pArr2 = (uint32_t *) &array2[0][0];
uint32_t *pDest = (uint32_t *) &dest[0][0];

int  i;

for (i = 0; i < sizeof (dest) / sizeof (uint32_t); i++) {
    pDest[i] = pArr1[i] + pArr2[i];
}

It may not be the most elegant, but it could help keep you from going to architecture specific code. Additionally, if you were to do this, I would strongly recommend you comment what you are doing and why.

回复收藏 0 原文