如何增强C＆＃x2B;＆＃x2B;大量访问时的阵列性能

发布于 2025-02-11 06:51:24 字数 5440 浏览 3 评论 0原文

我正在处理一个将被称为240000000次的函数。在该功能中，我将每次访问三个向量。类似：

aj = a_[j];
bj = b_[j];
cj = c_[j];

所有向量均在同一类中定义，具有999个元素，哪种类型是doule。完成工作大约需要60年代。但是，如果我更改矢量访问三个双变量，则时间将减少到10秒。像：

aj = fa;
bj = fb;
cj = fc;

如果我将矢量更改为数组，则使用大约50秒的时间来帮助更少。为什么时间间隙这么大？我认为数组访问仅涉及索引渗透。对此有任何想法吗？

添加一些代码和数据：

1，用于比较，考虑4个情况：原始向量数据；伪造的阵列几乎充满了真实的数据（我还检查了伪造的矢量，它与伪造的阵列几乎相同）；变量为班级成员；本地变量。我将它们添加到课堂（构造时启动它们）和功能定义，班级定义：

        class CoefficientHolder {
      public:
        explicit CoefficientHolder(Size n)
        : n_(n), primitiveConst_(n-1), a_(n-1), b_(n-1), c_(n-1),
          monotonicityAdjustments_(n), fra{0.000000, 0.000000, ...},
 frb{0.000000, -0.001426, ...}, frc {0.000000, 0.000063, 0.000002, ...}
          {
            fa = 0.00134;
            fb = 0.00001;
            fc = 0.00002;
            fake_a_={0.000000, 0.000000, -0.007229, ...};
            fake_b_={0.000000, -0.001426, 0.000026, ...};
            fake_c_={0.000000, 0.000063, 0.000002, ...};
          }

        std::vector<Real> a_, b_, c_;   // original data, Real is redefination of double
        mutable Real fra[999], frb[999], frc[999]; // faked array
        mutable Real fa, fb, fc;        //faked variables
    };

功能定义：

            Real value(Real x) const {
            static unsigned long long i = 0;
            GET_TIME_NS(start); // Get current time in nanosecond
            Size j = this->locate(x);
            Real dx_ = x-this->xBegin_[j];
            // Local faked variables
            //Real lfa = 0.01+0.002;
            //Real lfb = 0.00001 + 0.00000001;
            //Real lfc = 0.000000001 + 0.0000000000001;
            /*Check local variables performance*/
            //Real ret = this->yBegin_[j] + dx_*(lfa + dx_*(lfb + dx_*lfc));
            /*Check variables as class memeber performance*/
            //Real ret = this->yBegin_[j] + dx_*(fa + dx_*(fb + dx_*fc));
            /*Check faked variable performance*/
            //Real ret = this->yBegin_[j] + dx_*(fra[j] + dx_*(frb[j] + dx_*frc[j]));
            /*The real data performance*/
            Real ret = this->yBegin_[j] + dx_*(a_[j] + dx_*(b_[j] + dx_*c_[j]));
            CAL_NS_GAP(start, gap); // calculate time gap
            ++i;
            CALogger* g_logger = CALogger::GetInstance();
            static unsigned long long time_sum = 0;
            if(i >= 3000 && i < 53000)
            {
                g_logger->writeLog(std::to_string(i) + ":"  + std::to_string(gap));
                time_sum += gap;
            }
            if (i == 53000)
                g_logger->writeLog("###mean for 50000:"  + std::to_string(time_sum/50000));

            static unsigned long long time_sum1 = 0;
            if(i >= 50000000 && i < 50050000)
            {   
                g_logger->writeLog(std::to_string(i) + ":"  + std::to_string(gap));
                time_sum1 += gap;
            }
            if (i == 50050000)
                g_logger->writeLog("###mean for 50000:"  + std::to_string(time_sum1/50000));

time_sum2/time_sum3/time_sum4

我计算了他们消耗的纳秒，并在每种情况下取5 * 50000样品。看起来本地变量与类成员变量几乎相同，但比伪造的阵列要多得多。伪造的阵列与原始数据几乎相同。

测试结果（每个案例都有5组，每组有50000个样本，仅显示平均值）：

原始数据（303.8）：
平均50000：341
50000：306
的平均值平均50000：294
平均50000：295
50000的平均值：283个

伪造的阵列（246.8）：
平均50000：278
平均50000：246
平均50000：243
平均50000：234
50000：233

本地伪造变量（179）：
平均50000：196
平均50000：176
平均50000：170
平均50000：189
50000：164个

伪造的变量作为班级成员（151.6）：
平均50000：168
平均50000：142
平均50000：156
平均50000：147
50000：145

更多信息
将三个向量更改为一个结构向量后，性能会变得更好，但还不够：
代码：

typedef struct st_factor
{
    Real a_;
    Real b_;
    Real c_;
    st_factor() : a_(0), b_(0), c_(0) {}
}STFACTOR;
typedef std::vector<STFACTOR> VFACTOR;

测试结果（226.2）：
平均50000：225
平均50000：225
平均50000：228
平均50000：221
50000：232 的平均值

添加另一组测试结果，用于不同的假矢量尺寸
我尝试了具有不同尺寸的假向量：1000/500/100/50/10
1000和500几乎相同，最终平均值约为190ns
100是100ns
50是85ns
10是60ns
代码：

    namespace QuantLib {
    typedef struct st_factor
    {
        Real a_;
        Real b_;
        Real c_;
        .......
    }STFACTOR;
    typedef std::vecotr<STFACTOR> VFACTOR;
        class CoefficientHolder {
          public:
            explicit CoefficientHolder(Size n)
            : m_fakeVf(TSZ),
              {
               m_fakeVf = {STFACTOR(0.01, 0.0002, 0),};
              }
  }

 Real value(Real x) const {
  Size j = this->locate(x);
  j = j%TSZ;
  GET_TIME_NS(start);
  Real fa = m_fakeVf[j].a_;
  Real fb = m_fakeVf[j].b_;
  Real fc = m_fakeVf[j].c_;
  CAL_NS_GAP(start, gap);
  // calculate gap mean for 5* 50000 samples

cachegrind测试结果
== 15954 == D Refs：97,902,271,729（72,009,552,612 RD + 25,892,719,117 WR）
== 15954 == D1遗失：1,813,482,789（1,752,767,108 RD + 60,715,681 WR）
== 15954 == lld Misses：56,883,506（49,812,399 RD + 7,071,107 WR）
== 15954 == D1错率：1.9％（2.4％ + 0.2％）
== 15954 == lld Miss率：0.1％（0.1％ + 0.0％）

ps：

我的服务器是虚拟机。该值函数将被称为约234000000次；矢量元素在过程中保持不变；我的L1数据缓存为32K。

原文

I'm working on a function which will be called about 240000000 times. In that function, I will access three vectors just one time each. like:

aj = a_[j];
bj = b_[j];
cj = c_[j];

All the vectors are defined in the same class, having 999 elements, which type is doule. It takes about 60s to finish the job.
But if I change vector access to three double variables, the time will reduce to 10s. like:

aj = fa;
bj = fb;
cj = fc;

If I change vector to array, it help less, using about 50s.
Why the time gap is so large? I think array access only involve index caculate.
Any idea about that?

Add some code and data:

1、To compare, consider 4 cases: original vector data; faked array which filled with almost the real data(I also checked the faked vector, it works almost the same with faked array); variables as class member; local variables.
I add them in the class(init them when construct) and function defination,
class defination:

        class CoefficientHolder {
      public:
        explicit CoefficientHolder(Size n)
        : n_(n), primitiveConst_(n-1), a_(n-1), b_(n-1), c_(n-1),
          monotonicityAdjustments_(n), fra{0.000000, 0.000000, ...},
 frb{0.000000, -0.001426, ...}, frc {0.000000, 0.000063, 0.000002, ...}
          {
            fa = 0.00134;
            fb = 0.00001;
            fc = 0.00002;
            fake_a_={0.000000, 0.000000, -0.007229, ...};
            fake_b_={0.000000, -0.001426, 0.000026, ...};
            fake_c_={0.000000, 0.000063, 0.000002, ...};
          }

        std::vector<Real> a_, b_, c_;   // original data, Real is redefination of double
        mutable Real fra[999], frb[999], frc[999]; // faked array
        mutable Real fa, fb, fc;        //faked variables
    };

function defination:

            Real value(Real x) const {
            static unsigned long long i = 0;
            GET_TIME_NS(start); // Get current time in nanosecond
            Size j = this->locate(x);
            Real dx_ = x-this->xBegin_[j];
            // Local faked variables
            //Real lfa = 0.01+0.002;
            //Real lfb = 0.00001 + 0.00000001;
            //Real lfc = 0.000000001 + 0.0000000000001;
            /*Check local variables performance*/
            //Real ret = this->yBegin_[j] + dx_*(lfa + dx_*(lfb + dx_*lfc));
            /*Check variables as class memeber performance*/
            //Real ret = this->yBegin_[j] + dx_*(fa + dx_*(fb + dx_*fc));
            /*Check faked variable performance*/
            //Real ret = this->yBegin_[j] + dx_*(fra[j] + dx_*(frb[j] + dx_*frc[j]));
            /*The real data performance*/
            Real ret = this->yBegin_[j] + dx_*(a_[j] + dx_*(b_[j] + dx_*c_[j]));
            CAL_NS_GAP(start, gap); // calculate time gap
            ++i;
            CALogger* g_logger = CALogger::GetInstance();
            static unsigned long long time_sum = 0;
            if(i >= 3000 && i < 53000)
            {
                g_logger->writeLog(std::to_string(i) + ":"  + std::to_string(gap));
                time_sum += gap;
            }
            if (i == 53000)
                g_logger->writeLog("###mean for 50000:"  + std::to_string(time_sum/50000));

            static unsigned long long time_sum1 = 0;
            if(i >= 50000000 && i < 50050000)
            {   
                g_logger->writeLog(std::to_string(i) + ":"  + std::to_string(gap));
                time_sum1 += gap;
            }
            if (i == 50050000)
                g_logger->writeLog("###mean for 50000:"  + std::to_string(time_sum1/50000));

time_sum2/time_sum3/time_sum4

I counted nanosecond they consumed and take 5 * 50000 samples in each case. It looks local variables works nearly same with class member variable, but much bettter than faked array; Faked array works nearly same with original data.

Test result(Eache case have 5 sets, each set have 50000 samples, which only display the mean value)：

original data(303.8):
mean for 50000:341
mean for 50000:306
mean for 50000:294
mean for 50000:295
mean for 50000:283

faked array(246.8):
mean for 50000:278
mean for 50000:246
mean for 50000:243
mean for 50000:234
mean for 50000:233

local faked variables(179):
mean for 50000:196
mean for 50000:176
mean for 50000:170
mean for 50000:189
mean for 50000:164

faked variables as class member(151.6):
mean for 50000:168
mean for 50000:142
mean for 50000:156
mean for 50000:147
mean for 50000:145

More infor
After I changed three vector to one struct vector, performance get better, but not enough:
code：

typedef struct st_factor
{
    Real a_;
    Real b_;
    Real c_;
    st_factor() : a_(0), b_(0), c_(0) {}
}STFACTOR;
typedef std::vector<STFACTOR> VFACTOR;

Test result(226.2):
mean for 50000:225
mean for 50000:225
mean for 50000:228
mean for 50000:221
mean for 50000:232

Add another set of test result for different fake vector sizes
I tried fake vectors with different sizes:1000/500/100/50/10
1000 and 500 almost same, final mean is about 190ns
100 is about 100ns
50 is about 85ns
10 is about 60ns
code:

    namespace QuantLib {
    typedef struct st_factor
    {
        Real a_;
        Real b_;
        Real c_;
        .......
    }STFACTOR;
    typedef std::vecotr<STFACTOR> VFACTOR;
        class CoefficientHolder {
          public:
            explicit CoefficientHolder(Size n)
            : m_fakeVf(TSZ),
              {
               m_fakeVf = {STFACTOR(0.01, 0.0002, 0),};
              }
  }

 Real value(Real x) const {
  Size j = this->locate(x);
  j = j%TSZ;
  GET_TIME_NS(start);
  Real fa = m_fakeVf[j].a_;
  Real fb = m_fakeVf[j].b_;
  Real fc = m_fakeVf[j].c_;
  CAL_NS_GAP(start, gap);
  // calculate gap mean for 5* 50000 samples

cachegrind test result
==15954== D refs: 97,902,271,729 (72,009,552,612 rd + 25,892,719,117 wr)
==15954== D1 misses: 1,813,482,789 ( 1,752,767,108 rd + 60,715,681 wr)
==15954== LLd misses: 56,883,506 ( 49,812,399 rd + 7,071,107 wr)
==15954== D1 miss rate: 1.9% ( 2.4% + 0.2% )
==15954== LLd miss rate: 0.1% ( 0.1% + 0.0% )

PS:

My server is a virtual machine. The value function will be called about 234000000 times, alway; the vector elements keep unchanged during the proccess; My L1 data cache is 32K.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

纵情客 2025-02-18 06:51:24

缓存填充可能引起的差距。尝试按下以下来重新订购向量：

struct TT{
    double a_;
    double b_;
    double c_;
};
struct TT vector[999];
struct TT* p = &vector[j];

您可以使用p访问a _，b _和c _ as你需要。

The gap may caused by cache filling. Try to re-order the vector as follows:

struct TT{
    double a_;
    double b_;
    double c_;
};
struct TT vector[999];
struct TT* p = &vector[j];

You can use p to access a_, b_ and c_ as you need.

回复收藏 0 原文

×纯※雪 2025-02-18 06:51:24

aj = a_ [j];考虑数组大小，可能必须来自L1缓存。但是aj = fa;？很有可能这甚至不接受1个指令。编译器可能会简单地注意到两个变量具有相同的值。因此，在稍后的代码中，读取aj，编译器简单地读取fa。

现代CPU上的阵列访问计算即将接近免费。 X86特别可以执行效果“从基本加索引 * 8”进行效应。

回复收藏 0 原文

~没有更多了~

关于作者

身边

暂无简介

文章

28 人气

关注发私信

友情链接

文江博客

如何增强C＆＃x2B;＆＃x2B;大量访问时的阵列性能

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

櫻之舞

弥枳

m2429

寻找一个思念的角度

野却迷人

我怀念的。

友情链接

如何增强C＆＃x2B;＆＃x2B;大量访问时的阵列性能

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

櫻之舞

弥枳

m2429

寻找一个思念的角度

野却迷人

我怀念的。

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。