C 程序的可变运行时间
我的(simd)实现需要不同的时间,尽管它是针对固定输入运行的。运行时间在 1 亿个时钟周期到 1.2 亿个时钟周期之间变化。该程序调用一个函数大约 600 次,而函数中最昂贵的部分是在内存中被访问约 2000 次。因此,在我的程序中,总体内存参与度相当高。
运行时间的变化是由于内存访问模式/初始内存内容造成的吗?
我使用 valgrind 来分析我的程序。它显示每次内存访问大约需要8条指令。这是正常的吗?
以下是被调用 600 次的代码(函数)。 Mulprev[32][20] 是访问次数最多的数组。
j = 15;
u3v = _mm_set_epi64x (0xF, 0xF);
while (j + 1)
{
l = j << 2;
for (i = 0; i < 20; i++)
{
val1v = _mm_load_si128 ((__m128i *) &elm1v[i]);
uv = _mm_and_si128 (_mm_srli_epi64 (val1v, l), u3v);
u1 = _mm_extract_epi16 (uv, 0);
u2 = _mm_extract_epi16 (uv, 4) + 16;
for (ival = i, ival1 = i + 1, k = 0; k < 20; k += 2, ival += 2, ival1 += 2)
{
temp11v = _mm_load_si128 ((__m128i *) &mulprev[u1][k]);
temp12v = _mm_load_si128 ((__m128i *) &mulprev[u2][k]);
val1v = _mm_load_si128 ((__m128i *) &res[ival]);
val2v = _mm_load_si128 ((__m128i *) &res[ival1]);
bv = _mm_xor_si128 (val1v, _mm_unpacklo_epi64 (temp11v, temp12v));
av = _mm_xor_si128 (val2v, _mm_unpackhi_epi64 (temp11v, temp12v));
_mm_store_si128 ((__m128i *) &res[ival], bv);
_mm_store_si128 ((__m128i *) &res[ival1], av);
}
}
if (j == 0)
break;
val0v = _mm_setzero_si128 ();
for (i = 0; i < 40; i++)
{
testv = _mm_load_si128 ((__m128i *) &res[i]);
val1v = _mm_srli_epi64 (testv, 60);
val2v = _mm_xor_si128 (val0v, _mm_slli_epi64 (testv, 4));
_mm_store_si128 (&res[i], val2v);
val0v = val1v;
}
j--;
}
我想减少程序的计算时间。有什么建议吗?
My (simd) implementation takes varied amount of time, though it is run for fixed input. The running time varies between say 100 million clock cycles to 120 million clock cycles. The program calls a function around 600 times, and the most expensive part of the function is in it memory is accessed ~2000 times. Thus, overall memory involvement in quite high in my program.
Is the variation in running time due to memory access patterns/initial memory contents?
I used valgrind to analyze profile my program. It shows each memory access takes about 8 instructions. Is this normal?
Following is the piece of code (function) that is called 600 times. Mulprev[32][20] is the array which is accessed most number of times.
j = 15;
u3v = _mm_set_epi64x (0xF, 0xF);
while (j + 1)
{
l = j << 2;
for (i = 0; i < 20; i++)
{
val1v = _mm_load_si128 ((__m128i *) &elm1v[i]);
uv = _mm_and_si128 (_mm_srli_epi64 (val1v, l), u3v);
u1 = _mm_extract_epi16 (uv, 0);
u2 = _mm_extract_epi16 (uv, 4) + 16;
for (ival = i, ival1 = i + 1, k = 0; k < 20; k += 2, ival += 2, ival1 += 2)
{
temp11v = _mm_load_si128 ((__m128i *) &mulprev[u1][k]);
temp12v = _mm_load_si128 ((__m128i *) &mulprev[u2][k]);
val1v = _mm_load_si128 ((__m128i *) &res[ival]);
val2v = _mm_load_si128 ((__m128i *) &res[ival1]);
bv = _mm_xor_si128 (val1v, _mm_unpacklo_epi64 (temp11v, temp12v));
av = _mm_xor_si128 (val2v, _mm_unpackhi_epi64 (temp11v, temp12v));
_mm_store_si128 ((__m128i *) &res[ival], bv);
_mm_store_si128 ((__m128i *) &res[ival1], av);
}
}
if (j == 0)
break;
val0v = _mm_setzero_si128 ();
for (i = 0; i < 40; i++)
{
testv = _mm_load_si128 ((__m128i *) &res[i]);
val1v = _mm_srli_epi64 (testv, 60);
val2v = _mm_xor_si128 (val0v, _mm_slli_epi64 (testv, 4));
_mm_store_si128 (&res[i], val2v);
val0v = val1v;
}
j--;
}
I want to reduce the computation time of my program. Any suggestions?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您在加载和存储之间几乎不执行任何计算,因此您的执行时间很可能由缓存/内存的 I/O 成本主导。更糟糕的是,您的数据集似乎相对较小。进一步优化的唯一方法可能是改进内存访问模式(尽可能使访问顺序化,并确保缓存行不被浪费等)和/或将这些操作与在同一数据集上运行的其他代码组合起来在此例程之前/之后(以便加载/存储的成本在一定程度上摊销)。
编辑:请注意,当您对该例程的明显较早版本提出同样的问题时,我给出了非常相似的答案: 如何使以下代码更快 - 您似乎忽略了一点,这里的主要性能问题是内存访问,而不是计算。
You are performing almost no computation in between loads and stores, hence your execution time will most likely be dominated by the cost of I/O to/from cache/memory. Even worse, your data set appears to be relatively small. Probably the only way you can optimise this further is to improve the memory access pattern (make accesses sequential where possible, and ensure that cache lines are not wasted, etc) and/or combine these operations with other code which operates on the same data set before/after this routine (so that the cost of loads/stores in amortised somewhat).
EDIT: note that I gave a very similar answer when you asked much the same question for an apparently earlier version of this routine: How to make the following code faster - you seem to have missed the point that your main performance problem here is memory access, not computation.
计算机很复杂。后台进程很容易以某种方式进行干扰。如果没有额外的信息,很难提出改进建议。一般来说,最好的优化是高级优化。选择更好的算法,最大限度地减少昂贵的操作。如果你认为那里没有太大的改进空间,就不要期望太高的收益。你说你的内存访问需要很多周期。我可以建议您尽可能使用受限指针,但很难就优化问题提供一般性建议。你必须亲自尝试一些事情。
Computers are complicated. Could easily be background processes interfering in some way. It is hard to suggest improvements without additional info. Generally, the best optimizations are the high-level ones. Choose better algorithms, minimize expensive operations. If you don't think there is much room for improvement there, don't expect too high gains. You say that your memory accesses take a lot of cycles. I could suggest that you use restricted pointers where possible, but it's hard to give general advice on optimization issues. You sort of have to try out things yourself.
8 个周期对于内存访问来说是相当长的时间。另一个进程可能会对 CPU 缓存产生负面影响,导致您的程序出现大量缓存未命中,或者如果您的内存是动态分配的,您可能会看到未对齐的内存访问惩罚。
它可以是任何东西。
8 cycles for a memory access is quite a long time. Another process might be having a negative impact on the CPU caches causing your program a lot of cache-misses, or if your memory is dynamically allocated you might be seeing unaligned memory access penalties.
It could be anything.