如何增强C++大量访问时的阵列性能

发布于 2025-02-11 06:51:24 字数 5440 浏览 3 评论 0原文

我正在处理一个将被称为240000000次的函数。在该功能中,我将每次访问三个向量。类似:

aj = a_[j];
bj = b_[j];
cj = c_[j];

所有向量均在同一类中定义,具有999个元素,哪种类型是doule。完成工作大约需要60年代。 但是,如果我更改矢量访问三个双变量,则时间将减少到10秒。像:

aj = fa;
bj = fb;
cj = fc;

如果我将矢量更改为数组,则使用大约50秒的时间来帮助更少。 为什么时间间隙这么大?我认为数组访问仅涉及索引渗透。 对此有任何想法吗?

添加一些代码和数据:

1,用于比较,考虑4个情况:原始向量数据;伪造的阵列几乎充满了真实的数据(我还检查了伪造的矢量,它与伪造的阵列几乎相同);变量为班级成员;本地变量。 我将它们添加到课堂(构造时启动它们)和功能定义, 班级定义:

        class CoefficientHolder {
      public:
        explicit CoefficientHolder(Size n)
        : n_(n), primitiveConst_(n-1), a_(n-1), b_(n-1), c_(n-1),
          monotonicityAdjustments_(n), fra{0.000000, 0.000000, ...},
 frb{0.000000, -0.001426, ...}, frc {0.000000, 0.000063, 0.000002, ...}
          {
            fa = 0.00134;
            fb = 0.00001;
            fc = 0.00002;
            fake_a_={0.000000, 0.000000, -0.007229, ...};
            fake_b_={0.000000, -0.001426, 0.000026, ...};
            fake_c_={0.000000, 0.000063, 0.000002, ...};
          }

        std::vector<Real> a_, b_, c_;   // original data, Real is redefination of double
        mutable Real fra[999], frb[999], frc[999]; // faked array
        mutable Real fa, fb, fc;        //faked variables
    };

功能定义:

            Real value(Real x) const {
            static unsigned long long i = 0;
            GET_TIME_NS(start); // Get current time in nanosecond
            Size j = this->locate(x);
            Real dx_ = x-this->xBegin_[j];
            // Local faked variables
            //Real lfa = 0.01+0.002;
            //Real lfb = 0.00001 + 0.00000001;
            //Real lfc = 0.000000001 + 0.0000000000001;
            /*Check local variables performance*/
            //Real ret = this->yBegin_[j] + dx_*(lfa + dx_*(lfb + dx_*lfc));
            /*Check variables as class memeber performance*/
            //Real ret = this->yBegin_[j] + dx_*(fa + dx_*(fb + dx_*fc));
            /*Check faked variable performance*/
            //Real ret = this->yBegin_[j] + dx_*(fra[j] + dx_*(frb[j] + dx_*frc[j]));
            /*The real data performance*/
            Real ret = this->yBegin_[j] + dx_*(a_[j] + dx_*(b_[j] + dx_*c_[j]));
            CAL_NS_GAP(start, gap); // calculate time gap
            ++i;
            CALogger* g_logger = CALogger::GetInstance();
            static unsigned long long time_sum = 0;
            if(i >= 3000 && i < 53000)
            {
                g_logger->writeLog(std::to_string(i) + ":"  + std::to_string(gap));
                time_sum += gap;
            }
            if (i == 53000)
                g_logger->writeLog("###mean for 50000:"  + std::to_string(time_sum/50000));

            static unsigned long long time_sum1 = 0;
            if(i >= 50000000 && i < 50050000)
            {   
                g_logger->writeLog(std::to_string(i) + ":"  + std::to_string(gap));
                time_sum1 += gap;
            }
            if (i == 50050000)
                g_logger->writeLog("###mean for 50000:"  + std::to_string(time_sum1/50000));

time_sum2/time_sum3/time_sum4

我计算了他们消耗的纳秒,并在每种情况下取​​5 * 50000样品。看起来本地变量与类成员变量几乎相同,但比伪造的阵列要多得多。伪造的阵列与原始数据几乎相同。

测试结果(每个案例都有5组,每组有50000个样本,仅显示平均值):

原始数据(303.8):
平均50000:341
50000:306
的平均值 平均50000:294
平均50000:295
50000的平均值:283个

伪造的阵列(246.8):
平均50000:278
平均50000:246
平均50000:243
平均50000:234
50000:233

本地伪造变量(179):
平均50000:196
平均50000:176
平均50000:170
平均50000:189
50000:164个

伪造的变量作为班级成员(151.6):
平均50000:168
平均50000:142
平均50000:156
平均50000:147
50000:145

更多信息
将三个向量更改为一个结构向量后,性能会变得更好,但还不够:
代码:

typedef struct st_factor
{
    Real a_;
    Real b_;
    Real c_;
    st_factor() : a_(0), b_(0), c_(0) {}
}STFACTOR;
typedef std::vector<STFACTOR> VFACTOR;

测试结果(226.2):
平均50000:225
平均50000:225
平均50000:228
平均50000:221
50000:232 的平均值

添加另一组测试结果,用于不同的假矢量尺寸
我尝试了具有不同尺寸的假向量:1000/500/100/50/10
1000和500几乎相同,最终平均值约为190ns
100是100ns
50是85ns
10是60ns
代码:

    namespace QuantLib {
    typedef struct st_factor
    {
        Real a_;
        Real b_;
        Real c_;
        .......
    }STFACTOR;
    typedef std::vecotr<STFACTOR> VFACTOR;
        class CoefficientHolder {
          public:
            explicit CoefficientHolder(Size n)
            : m_fakeVf(TSZ),
              {
               m_fakeVf = {STFACTOR(0.01, 0.0002, 0),};
              }
  }

 Real value(Real x) const {
  Size j = this->locate(x);
  j = j%TSZ;
  GET_TIME_NS(start);
  Real fa = m_fakeVf[j].a_;
  Real fb = m_fakeVf[j].b_;
  Real fc = m_fakeVf[j].c_;
  CAL_NS_GAP(start, gap);
  // calculate gap mean for 5* 50000 samples

cachegrind测试结果
== 15954 == D Refs:97,902,271,729(72,009,552,612 RD + 25,892,719,117 WR)
== 15954 == D1遗失:1,813,482,789(1,752,767,108 RD + 60,715,681 WR)
== 15954 == lld Misses:56,883,506(49,812,399 RD + 7,071,107 WR)
== 15954 == D1错率:1.9%(2.4% + 0.2%)
== 15954 == lld Miss率:0.1%(0.1% + 0.0%)

ps:

我的服务器是虚拟机。该值函数将被称为约234000000次;矢量元素在过程中保持不变;我的L1数据缓存为32K。

I'm working on a function which will be called about 240000000 times. In that function, I will access three vectors just one time each. like:

aj = a_[j];
bj = b_[j];
cj = c_[j];

All the vectors are defined in the same class, having 999 elements, which type is doule. It takes about 60s to finish the job.
But if I change vector access to three double variables, the time will reduce to 10s. like:

aj = fa;
bj = fb;
cj = fc;

If I change vector to array, it help less, using about 50s.
Why the time gap is so large? I think array access only involve index caculate.
Any idea about that?

Add some code and data:

1、To compare, consider 4 cases: original vector data; faked array which filled with almost the real data(I also checked the faked vector, it works almost the same with faked array); variables as class member; local variables.
I add them in the class(init them when construct) and function defination,
class defination:

        class CoefficientHolder {
      public:
        explicit CoefficientHolder(Size n)
        : n_(n), primitiveConst_(n-1), a_(n-1), b_(n-1), c_(n-1),
          monotonicityAdjustments_(n), fra{0.000000, 0.000000, ...},
 frb{0.000000, -0.001426, ...}, frc {0.000000, 0.000063, 0.000002, ...}
          {
            fa = 0.00134;
            fb = 0.00001;
            fc = 0.00002;
            fake_a_={0.000000, 0.000000, -0.007229, ...};
            fake_b_={0.000000, -0.001426, 0.000026, ...};
            fake_c_={0.000000, 0.000063, 0.000002, ...};
          }

        std::vector<Real> a_, b_, c_;   // original data, Real is redefination of double
        mutable Real fra[999], frb[999], frc[999]; // faked array
        mutable Real fa, fb, fc;        //faked variables
    };

function defination:

            Real value(Real x) const {
            static unsigned long long i = 0;
            GET_TIME_NS(start); // Get current time in nanosecond
            Size j = this->locate(x);
            Real dx_ = x-this->xBegin_[j];
            // Local faked variables
            //Real lfa = 0.01+0.002;
            //Real lfb = 0.00001 + 0.00000001;
            //Real lfc = 0.000000001 + 0.0000000000001;
            /*Check local variables performance*/
            //Real ret = this->yBegin_[j] + dx_*(lfa + dx_*(lfb + dx_*lfc));
            /*Check variables as class memeber performance*/
            //Real ret = this->yBegin_[j] + dx_*(fa + dx_*(fb + dx_*fc));
            /*Check faked variable performance*/
            //Real ret = this->yBegin_[j] + dx_*(fra[j] + dx_*(frb[j] + dx_*frc[j]));
            /*The real data performance*/
            Real ret = this->yBegin_[j] + dx_*(a_[j] + dx_*(b_[j] + dx_*c_[j]));
            CAL_NS_GAP(start, gap); // calculate time gap
            ++i;
            CALogger* g_logger = CALogger::GetInstance();
            static unsigned long long time_sum = 0;
            if(i >= 3000 && i < 53000)
            {
                g_logger->writeLog(std::to_string(i) + ":"  + std::to_string(gap));
                time_sum += gap;
            }
            if (i == 53000)
                g_logger->writeLog("###mean for 50000:"  + std::to_string(time_sum/50000));

            static unsigned long long time_sum1 = 0;
            if(i >= 50000000 && i < 50050000)
            {   
                g_logger->writeLog(std::to_string(i) + ":"  + std::to_string(gap));
                time_sum1 += gap;
            }
            if (i == 50050000)
                g_logger->writeLog("###mean for 50000:"  + std::to_string(time_sum1/50000));

time_sum2/time_sum3/time_sum4

I counted nanosecond they consumed and take 5 * 50000 samples in each case. It looks local variables works nearly same with class member variable, but much bettter than faked array; Faked array works nearly same with original data.

Test result(Eache case have 5 sets, each set have 50000 samples, which only display the mean value):

original data(303.8):
mean for 50000:341
mean for 50000:306
mean for 50000:294
mean for 50000:295
mean for 50000:283

faked array(246.8):
mean for 50000:278
mean for 50000:246
mean for 50000:243
mean for 50000:234
mean for 50000:233

local faked variables(179):
mean for 50000:196
mean for 50000:176
mean for 50000:170
mean for 50000:189
mean for 50000:164

faked variables as class member(151.6):
mean for 50000:168
mean for 50000:142
mean for 50000:156
mean for 50000:147
mean for 50000:145

More infor
After I changed three vector to one struct vector, performance get better, but not enough:
code:

typedef struct st_factor
{
    Real a_;
    Real b_;
    Real c_;
    st_factor() : a_(0), b_(0), c_(0) {}
}STFACTOR;
typedef std::vector<STFACTOR> VFACTOR;

Test result(226.2):
mean for 50000:225
mean for 50000:225
mean for 50000:228
mean for 50000:221
mean for 50000:232

Add another set of test result for different fake vector sizes
I tried fake vectors with different sizes:1000/500/100/50/10
1000 and 500 almost same, final mean is about 190ns
100 is about 100ns
50 is about 85ns
10 is about 60ns
code:

    namespace QuantLib {
    typedef struct st_factor
    {
        Real a_;
        Real b_;
        Real c_;
        .......
    }STFACTOR;
    typedef std::vecotr<STFACTOR> VFACTOR;
        class CoefficientHolder {
          public:
            explicit CoefficientHolder(Size n)
            : m_fakeVf(TSZ),
              {
               m_fakeVf = {STFACTOR(0.01, 0.0002, 0),};
              }
  }

 Real value(Real x) const {
  Size j = this->locate(x);
  j = j%TSZ;
  GET_TIME_NS(start);
  Real fa = m_fakeVf[j].a_;
  Real fb = m_fakeVf[j].b_;
  Real fc = m_fakeVf[j].c_;
  CAL_NS_GAP(start, gap);
  // calculate gap mean for 5* 50000 samples

cachegrind test result
==15954== D refs: 97,902,271,729 (72,009,552,612 rd + 25,892,719,117 wr)
==15954== D1 misses: 1,813,482,789 ( 1,752,767,108 rd + 60,715,681 wr)
==15954== LLd misses: 56,883,506 ( 49,812,399 rd + 7,071,107 wr)
==15954== D1 miss rate: 1.9% ( 2.4% + 0.2% )
==15954== LLd miss rate: 0.1% ( 0.1% + 0.0% )

PS:

My server is a virtual machine. The value function will be called about 234000000 times, alway; the vector elements keep unchanged during the proccess; My L1 data cache is 32K.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

纵情客 2025-02-18 06:51:24

缓存填充可能引起的差距。尝试按下以下来重新订购向量:

struct TT{
    double a_;
    double b_;
    double c_;
};
struct TT vector[999];
struct TT* p = &vector[j];

您可以使用p访问a _b _c _ as你需要。

The gap may caused by cache filling. Try to re-order the vector as follows:

struct TT{
    double a_;
    double b_;
    double c_;
};
struct TT vector[999];
struct TT* p = &vector[j];

You can use p to access a_, b_ and c_ as you need.

×纯※雪 2025-02-18 06:51:24

aj = a_ [j];考虑数组大小,可能必须来自L1缓存。但是aj = fa;?很有可能这甚至不接受1个指令。编译器可能会简单地注意到两个变量具有相同的值。因此,在稍后的代码中,读取aj,编译器简单地读取fa

现代CPU上的阵列访问计算即将接近免费。 X86特别可以执行效果“从基本加索引 * 8”进行效应。

aj = a_[j]; will likely have to come from the L1 cache, considering the array size. But aj = fa;? Chances are that this doesn't even take 1 instruction. The compiler might simply note that the two variables have the same value. Thus, in code later on that reads aj, the compiler simply reads fa instead.

Array access calculation on modern CPU's is close to free. x86 in particular can do an effectlive "load from Base plus Index * 8"

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文