如何增强C++大量访问时的阵列性能
我正在处理一个将被称为240000000次的函数。在该功能中,我将每次访问三个向量。类似:
aj = a_[j];
bj = b_[j];
cj = c_[j];
所有向量均在同一类中定义,具有999个元素,哪种类型是doule。完成工作大约需要60年代。 但是,如果我更改矢量访问三个双变量,则时间将减少到10秒。像:
aj = fa;
bj = fb;
cj = fc;
如果我将矢量更改为数组,则使用大约50秒的时间来帮助更少。 为什么时间间隙这么大?我认为数组访问仅涉及索引渗透。 对此有任何想法吗?
添加一些代码和数据:
1,用于比较,考虑4个情况:原始向量数据;伪造的阵列几乎充满了真实的数据(我还检查了伪造的矢量,它与伪造的阵列几乎相同);变量为班级成员;本地变量。 我将它们添加到课堂(构造时启动它们)和功能定义, 班级定义:
class CoefficientHolder {
public:
explicit CoefficientHolder(Size n)
: n_(n), primitiveConst_(n-1), a_(n-1), b_(n-1), c_(n-1),
monotonicityAdjustments_(n), fra{0.000000, 0.000000, ...},
frb{0.000000, -0.001426, ...}, frc {0.000000, 0.000063, 0.000002, ...}
{
fa = 0.00134;
fb = 0.00001;
fc = 0.00002;
fake_a_={0.000000, 0.000000, -0.007229, ...};
fake_b_={0.000000, -0.001426, 0.000026, ...};
fake_c_={0.000000, 0.000063, 0.000002, ...};
}
std::vector<Real> a_, b_, c_; // original data, Real is redefination of double
mutable Real fra[999], frb[999], frc[999]; // faked array
mutable Real fa, fb, fc; //faked variables
};
功能定义:
Real value(Real x) const {
static unsigned long long i = 0;
GET_TIME_NS(start); // Get current time in nanosecond
Size j = this->locate(x);
Real dx_ = x-this->xBegin_[j];
// Local faked variables
//Real lfa = 0.01+0.002;
//Real lfb = 0.00001 + 0.00000001;
//Real lfc = 0.000000001 + 0.0000000000001;
/*Check local variables performance*/
//Real ret = this->yBegin_[j] + dx_*(lfa + dx_*(lfb + dx_*lfc));
/*Check variables as class memeber performance*/
//Real ret = this->yBegin_[j] + dx_*(fa + dx_*(fb + dx_*fc));
/*Check faked variable performance*/
//Real ret = this->yBegin_[j] + dx_*(fra[j] + dx_*(frb[j] + dx_*frc[j]));
/*The real data performance*/
Real ret = this->yBegin_[j] + dx_*(a_[j] + dx_*(b_[j] + dx_*c_[j]));
CAL_NS_GAP(start, gap); // calculate time gap
++i;
CALogger* g_logger = CALogger::GetInstance();
static unsigned long long time_sum = 0;
if(i >= 3000 && i < 53000)
{
g_logger->writeLog(std::to_string(i) + ":" + std::to_string(gap));
time_sum += gap;
}
if (i == 53000)
g_logger->writeLog("###mean for 50000:" + std::to_string(time_sum/50000));
static unsigned long long time_sum1 = 0;
if(i >= 50000000 && i < 50050000)
{
g_logger->writeLog(std::to_string(i) + ":" + std::to_string(gap));
time_sum1 += gap;
}
if (i == 50050000)
g_logger->writeLog("###mean for 50000:" + std::to_string(time_sum1/50000));
time_sum2/time_sum3/time_sum4
我计算了他们消耗的纳秒,并在每种情况下取5 * 50000样品。看起来本地变量与类成员变量几乎相同,但比伪造的阵列要多得多。伪造的阵列与原始数据几乎相同。
测试结果(每个案例都有5组,每组有50000个样本,仅显示平均值):
原始数据(303.8):
平均50000:341
50000:306
的平均值 平均50000:294
平均50000:295
50000的平均值:283个
伪造的阵列(246.8):
平均50000:278
平均50000:246
平均50000:243
平均50000:234
50000:233
本地伪造变量(179):
平均50000:196
平均50000:176
平均50000:170
平均50000:189
50000:164个
伪造的变量作为班级成员(151.6):
平均50000:168
平均50000:142
平均50000:156
平均50000:147
50000:145
更多信息
将三个向量更改为一个结构向量后,性能会变得更好,但还不够:
代码:
typedef struct st_factor
{
Real a_;
Real b_;
Real c_;
st_factor() : a_(0), b_(0), c_(0) {}
}STFACTOR;
typedef std::vector<STFACTOR> VFACTOR;
测试结果(226.2):
平均50000:225
平均50000:225
平均50000:228
平均50000:221
50000:232 的平均值
添加另一组测试结果,用于不同的假矢量尺寸
我尝试了具有不同尺寸的假向量:1000/500/100/50/10
1000和500几乎相同,最终平均值约为190ns
100是100ns
50是85ns
10是60ns
代码:
namespace QuantLib {
typedef struct st_factor
{
Real a_;
Real b_;
Real c_;
.......
}STFACTOR;
typedef std::vecotr<STFACTOR> VFACTOR;
class CoefficientHolder {
public:
explicit CoefficientHolder(Size n)
: m_fakeVf(TSZ),
{
m_fakeVf = {STFACTOR(0.01, 0.0002, 0),};
}
}
Real value(Real x) const {
Size j = this->locate(x);
j = j%TSZ;
GET_TIME_NS(start);
Real fa = m_fakeVf[j].a_;
Real fb = m_fakeVf[j].b_;
Real fc = m_fakeVf[j].c_;
CAL_NS_GAP(start, gap);
// calculate gap mean for 5* 50000 samples
cachegrind测试结果
== 15954 == D Refs:97,902,271,729(72,009,552,612 RD + 25,892,719,117 WR)
== 15954 == D1遗失:1,813,482,789(1,752,767,108 RD + 60,715,681 WR)
== 15954 == lld Misses:56,883,506(49,812,399 RD + 7,071,107 WR)
== 15954 == D1错率:1.9%(2.4% + 0.2%)
== 15954 == lld Miss率:0.1%(0.1% + 0.0%)
ps:
我的服务器是虚拟机。该值函数将被称为约234000000次;矢量元素在过程中保持不变;我的L1数据缓存为32K。
I'm working on a function which will be called about 240000000 times. In that function, I will access three vectors just one time each. like:
aj = a_[j];
bj = b_[j];
cj = c_[j];
All the vectors are defined in the same class, having 999 elements, which type is doule. It takes about 60s to finish the job.
But if I change vector access to three double variables, the time will reduce to 10s. like:
aj = fa;
bj = fb;
cj = fc;
If I change vector to array, it help less, using about 50s.
Why the time gap is so large? I think array access only involve index caculate.
Any idea about that?
Add some code and data:
1、To compare, consider 4 cases: original vector data; faked array which filled with almost the real data(I also checked the faked vector, it works almost the same with faked array); variables as class member; local variables.
I add them in the class(init them when construct) and function defination,
class defination:
class CoefficientHolder {
public:
explicit CoefficientHolder(Size n)
: n_(n), primitiveConst_(n-1), a_(n-1), b_(n-1), c_(n-1),
monotonicityAdjustments_(n), fra{0.000000, 0.000000, ...},
frb{0.000000, -0.001426, ...}, frc {0.000000, 0.000063, 0.000002, ...}
{
fa = 0.00134;
fb = 0.00001;
fc = 0.00002;
fake_a_={0.000000, 0.000000, -0.007229, ...};
fake_b_={0.000000, -0.001426, 0.000026, ...};
fake_c_={0.000000, 0.000063, 0.000002, ...};
}
std::vector<Real> a_, b_, c_; // original data, Real is redefination of double
mutable Real fra[999], frb[999], frc[999]; // faked array
mutable Real fa, fb, fc; //faked variables
};
function defination:
Real value(Real x) const {
static unsigned long long i = 0;
GET_TIME_NS(start); // Get current time in nanosecond
Size j = this->locate(x);
Real dx_ = x-this->xBegin_[j];
// Local faked variables
//Real lfa = 0.01+0.002;
//Real lfb = 0.00001 + 0.00000001;
//Real lfc = 0.000000001 + 0.0000000000001;
/*Check local variables performance*/
//Real ret = this->yBegin_[j] + dx_*(lfa + dx_*(lfb + dx_*lfc));
/*Check variables as class memeber performance*/
//Real ret = this->yBegin_[j] + dx_*(fa + dx_*(fb + dx_*fc));
/*Check faked variable performance*/
//Real ret = this->yBegin_[j] + dx_*(fra[j] + dx_*(frb[j] + dx_*frc[j]));
/*The real data performance*/
Real ret = this->yBegin_[j] + dx_*(a_[j] + dx_*(b_[j] + dx_*c_[j]));
CAL_NS_GAP(start, gap); // calculate time gap
++i;
CALogger* g_logger = CALogger::GetInstance();
static unsigned long long time_sum = 0;
if(i >= 3000 && i < 53000)
{
g_logger->writeLog(std::to_string(i) + ":" + std::to_string(gap));
time_sum += gap;
}
if (i == 53000)
g_logger->writeLog("###mean for 50000:" + std::to_string(time_sum/50000));
static unsigned long long time_sum1 = 0;
if(i >= 50000000 && i < 50050000)
{
g_logger->writeLog(std::to_string(i) + ":" + std::to_string(gap));
time_sum1 += gap;
}
if (i == 50050000)
g_logger->writeLog("###mean for 50000:" + std::to_string(time_sum1/50000));
time_sum2/time_sum3/time_sum4
I counted nanosecond they consumed and take 5 * 50000 samples in each case. It looks local variables works nearly same with class member variable, but much bettter than faked array; Faked array works nearly same with original data.
Test result(Eache case have 5 sets, each set have 50000 samples, which only display the mean value):
original data(303.8):
mean for 50000:341
mean for 50000:306
mean for 50000:294
mean for 50000:295
mean for 50000:283
faked array(246.8):
mean for 50000:278
mean for 50000:246
mean for 50000:243
mean for 50000:234
mean for 50000:233
local faked variables(179):
mean for 50000:196
mean for 50000:176
mean for 50000:170
mean for 50000:189
mean for 50000:164
faked variables as class member(151.6):
mean for 50000:168
mean for 50000:142
mean for 50000:156
mean for 50000:147
mean for 50000:145
More infor
After I changed three vector to one struct vector, performance get better, but not enough:
code:
typedef struct st_factor
{
Real a_;
Real b_;
Real c_;
st_factor() : a_(0), b_(0), c_(0) {}
}STFACTOR;
typedef std::vector<STFACTOR> VFACTOR;
Test result(226.2):
mean for 50000:225
mean for 50000:225
mean for 50000:228
mean for 50000:221
mean for 50000:232
Add another set of test result for different fake vector sizes
I tried fake vectors with different sizes:1000/500/100/50/10
1000 and 500 almost same, final mean is about 190ns
100 is about 100ns
50 is about 85ns
10 is about 60ns
code:
namespace QuantLib {
typedef struct st_factor
{
Real a_;
Real b_;
Real c_;
.......
}STFACTOR;
typedef std::vecotr<STFACTOR> VFACTOR;
class CoefficientHolder {
public:
explicit CoefficientHolder(Size n)
: m_fakeVf(TSZ),
{
m_fakeVf = {STFACTOR(0.01, 0.0002, 0),};
}
}
Real value(Real x) const {
Size j = this->locate(x);
j = j%TSZ;
GET_TIME_NS(start);
Real fa = m_fakeVf[j].a_;
Real fb = m_fakeVf[j].b_;
Real fc = m_fakeVf[j].c_;
CAL_NS_GAP(start, gap);
// calculate gap mean for 5* 50000 samples
cachegrind test result
==15954== D refs: 97,902,271,729 (72,009,552,612 rd + 25,892,719,117 wr)
==15954== D1 misses: 1,813,482,789 ( 1,752,767,108 rd + 60,715,681 wr)
==15954== LLd misses: 56,883,506 ( 49,812,399 rd + 7,071,107 wr)
==15954== D1 miss rate: 1.9% ( 2.4% + 0.2% )
==15954== LLd miss rate: 0.1% ( 0.1% + 0.0% )
PS:
My server is a virtual machine. The value function will be called about 234000000 times, alway; the vector elements keep unchanged during the proccess; My L1 data cache is 32K.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
缓存填充可能引起的差距。尝试按下以下来重新订购向量:
您可以使用
p
访问a _
,b _
和c _
as你需要。The gap may caused by cache filling. Try to re-order the vector as follows:
You can use
p
to accessa_
,b_
andc_
as you need.aj = a_ [j];
考虑数组大小,可能必须来自L1缓存。但是aj = fa;
?很有可能这甚至不接受1个指令。编译器可能会简单地注意到两个变量具有相同的值。因此,在稍后的代码中,读取aj
,编译器简单地读取fa
。现代CPU上的阵列访问计算即将接近免费。 X86特别可以执行效果“从基本加索引 * 8”进行效应。
aj = a_[j];
will likely have to come from the L1 cache, considering the array size. Butaj = fa;
? Chances are that this doesn't even take 1 instruction. The compiler might simply note that the two variables have the same value. Thus, in code later on that readsaj
, the compiler simply readsfa
instead.Array access calculation on modern CPU's is close to free. x86 in particular can do an effectlive "load from Base plus Index * 8"