在 C++ 中实现 SIMD

发布于 2024-08-30 23:32:09 字数 1251 浏览 4 评论 0原文

我正在编写一些代码，并尝试尽可能地优化它，基本上让它在一定的时间限制下运行。

下面进行调用......

static affinity_partitioner ap;
parallel_for(blocked_range<size_t>(0, T), LoopBody(score), ap);

下面是执行的内容。

void operator()(const blocked_range<size_t> &r) const {

    int temp;
    int i;
    int j;
    size_t k;
    size_t begin = r.begin();
    size_t end = r.end();

    for(k = begin; k != end; ++k) { // for each trainee
        temp = 0;
        for(i = 0; i < N; ++i) { // for each sample
            int trr = trRating[k][i];
            int ei = E[i];              
            for(j = 0; j < ei; ++j) { // for each expert
                temp += delta(i, trr, exRating[j][i]);
            }
        }           
        myscore[k] = temp;
    }
}

我正在使用英特尔的 TBB 来优化它。但我也一直在阅读有关 SIMD 和 SSE2 以及类似性质的内容。所以我的问题是，如何将变量（i，j，k）存储在寄存器中，以便CPU可以更快地访问它们？我认为答案与实施 SSE2 或其某些变体有关，但我不知道该怎么做。有什么想法吗？

编辑：这将在 Linux 机器上运行，但我相信使用英特尔的编译器。如果有帮助，我必须在执行任何操作之前运行以下命令以确保编译器正常工作... source /opt/intel/Compiler/11.1/064/bin/intel64/iccvars_intel64.csh; source /opt/intel/tbb/2.2/bin/intel64/tbbvars.csh ...然后编译我这样做：icc -ltbb test.cxx -o test

如果没有简单的方法来实现 SSE2，那么有关如何实现的任何建议进一步优化代码？

谢谢，赫里斯托

原文

I'm working on a bit of code and I'm trying to optimize it as much as possible, basically get it running under a certain time limit.

The following makes the call...

static affinity_partitioner ap;
parallel_for(blocked_range<size_t>(0, T), LoopBody(score), ap);

... and the following is what is executed.

void operator()(const blocked_range<size_t> &r) const {

    int temp;
    int i;
    int j;
    size_t k;
    size_t begin = r.begin();
    size_t end = r.end();

    for(k = begin; k != end; ++k) { // for each trainee
        temp = 0;
        for(i = 0; i < N; ++i) { // for each sample
            int trr = trRating[k][i];
            int ei = E[i];              
            for(j = 0; j < ei; ++j) { // for each expert
                temp += delta(i, trr, exRating[j][i]);
            }
        }           
        myscore[k] = temp;
    }
}

I'm using Intel's TBB to optimize this. But I've also been reading about SIMD and SSE2 and things along that nature. So my question is, how do I store the variables (i,j,k) in registers so that they can be accessed faster by the CPU? I think the answer has to do with implementing SSE2 or some variation of it, but I have no idea how to do that. Any ideas?

Edit: This will be run on a Linux box, but using Intel's compiler I believe. If it helps, I have to run the following commands before I do anything to make sure the compiler works... source /opt/intel/Compiler/11.1/064/bin/intel64/iccvars_intel64.csh; source /opt/intel/tbb/2.2/bin/intel64/tbbvars.csh ... and then to compile I do: icc -ltbb test.cxx -o test

If there's no easy way to implement SSE2, any advice on how to further optimize the code?

Thanks,
Hristo

分享到QQ

分享到微博