在 C++ 中实现 SIMD

发布于 2024-08-30 23:32:09 字数 1251 浏览 4 评论 0原文

我正在编写一些代码,并尝试尽可能地优化它,基本上让它在一定的时间限制下运行。

下面进行调用......

static affinity_partitioner ap;
parallel_for(blocked_range<size_t>(0, T), LoopBody(score), ap);

下面是执行的内容。

void operator()(const blocked_range<size_t> &r) const {

    int temp;
    int i;
    int j;
    size_t k;
    size_t begin = r.begin();
    size_t end = r.end();

    for(k = begin; k != end; ++k) { // for each trainee
        temp = 0;
        for(i = 0; i < N; ++i) { // for each sample
            int trr = trRating[k][i];
            int ei = E[i];              
            for(j = 0; j < ei; ++j) { // for each expert
                temp += delta(i, trr, exRating[j][i]);
            }
        }           
        myscore[k] = temp;
    }
}

我正在使用英特尔的 TBB 来优化它。但我也一直在阅读有关 SIMD 和 SSE2 以及类似性质的内容。所以我的问题是,如何将变量(i,j,k)存储在寄存器中,以便CPU可以更快地访问它们?我认为答案与实施 SSE2 或其某些变体有关,但我不知道该怎么做。有什么想法吗?

编辑:这将在 Linux 机器上运行,但我相信使用英特尔的编译器。如果有帮助,我必须在执行任何操作之前运行以下命令以确保编译器正常工作... source /opt/intel/Compiler/11.1/064/bin/intel64/iccvars_intel64.csh; source /opt/intel/tbb/2.2/bin/intel64/tbbvars.csh ...然后编译我这样做:icc -ltbb test.cxx -o test

如果没有简单的方法来实现 SSE2,那么有关如何实现的任何建议进一步优化代码?

谢谢, 赫里斯托

I'm working on a bit of code and I'm trying to optimize it as much as possible, basically get it running under a certain time limit.

The following makes the call...

static affinity_partitioner ap;
parallel_for(blocked_range<size_t>(0, T), LoopBody(score), ap);

... and the following is what is executed.

void operator()(const blocked_range<size_t> &r) const {

    int temp;
    int i;
    int j;
    size_t k;
    size_t begin = r.begin();
    size_t end = r.end();

    for(k = begin; k != end; ++k) { // for each trainee
        temp = 0;
        for(i = 0; i < N; ++i) { // for each sample
            int trr = trRating[k][i];
            int ei = E[i];              
            for(j = 0; j < ei; ++j) { // for each expert
                temp += delta(i, trr, exRating[j][i]);
            }
        }           
        myscore[k] = temp;
    }
}

I'm using Intel's TBB to optimize this. But I've also been reading about SIMD and SSE2 and things along that nature. So my question is, how do I store the variables (i,j,k) in registers so that they can be accessed faster by the CPU? I think the answer has to do with implementing SSE2 or some variation of it, but I have no idea how to do that. Any ideas?

Edit: This will be run on a Linux box, but using Intel's compiler I believe. If it helps, I have to run the following commands before I do anything to make sure the compiler works... source /opt/intel/Compiler/11.1/064/bin/intel64/iccvars_intel64.csh; source /opt/intel/tbb/2.2/bin/intel64/tbbvars.csh ... and then to compile I do: icc -ltbb test.cxx -o test

If there's no easy way to implement SSE2, any advice on how to further optimize the code?

Thanks,
Hristo

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

勿忘初心 2024-09-06 23:32:09

你的问题代表了对正在发生的事情的一些困惑。 i,j,k 变量几乎肯定已经保存在寄存器中,假设您正在编译优化(您应该这样做 - 将“-O2”添加到您的 icc 调用中)。

您可以使用 asm 块,但考虑到您已经在使用 ICC,更简单的方法是使用 SSE 内在函数。英特尔的文档位于此处 - http://www .intel.com/software/products/compilers/clin/docs/ug_cpp/comm1019.htm

看起来您可以对顶层循环进行 SIMD 化,尽管这在很大程度上取决于您的 delta 函数是。

Your question represents some confusion on what is going on. The i,j,k variables are almost certainly held in registers already, assuming you are compiling with optimizations on (which you should do - add "-O2" to your icc invocation).

You can use an asm block, but an easier method considering you're already using ICC is to use the SSE intrinsics. Intel's documentation for them is here - http://www.intel.com/software/products/compilers/clin/docs/ug_cpp/comm1019.htm

It looks like you can SIMD-ize the top-level loop, though it's going to depend greatly on what your delta function is.

埋葬我深情 2024-09-06 23:32:09

当您想在 C++ 模块中使用汇编语言时,您可以将其放入 asm 块中,并继续使用块外部的变量名称。您在 asm 块中使用的汇编指令将指定正在操作的寄存器等,但它们会因平台而异。

When you want to use assembly language within a C++ module, you can just put it inside an asm block, and continue to use your variable names from outside the block. The assembly instructions you use within the asm block will specify which register etc. is being operated on, but they will vary by platform.

丘比特射中我 2024-09-06 23:32:09

如果您使用的是 GCC,请参阅 http://gcc.gnu.org/ items/tree-ssa/vectorization.html 了解如何帮助编译器自动矢量化您的代码和示例。

否则,您需要让我们知道您正在使用什么平台。

If you're using GCC, see http://gcc.gnu.org/projects/tree-ssa/vectorization.html for how to help the compiler auto-vectorize your code, and examples.

Otherwise, you need to let us know what platform you are using.

鱼窥荷 2024-09-06 23:32:09

编译器应该为你做这件事。例如,在 VC++ 中,您可以简单地打开 SSE2。

The compiler should be doing this for you. For example, in VC++ you can simply turn on SSE2.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文