C++堆内存性能提升

发布于 2024-11-02 09:03:19 字数 832 浏览 5 评论 0原文

我正在编写一个需要大量堆内存的函数。是否可以告诉编译器这些数据将在特定的 for 循环中被频繁访问，从而提高性能（通过编译选项或类似选项）？

我无法使用堆栈的原因是我需要存储的元素数量很大，如果我尝试这样做，就会出现分段错误。

现在代码正在运行，但我认为它可以更快。

更新：我正在做这样的事情

vector< set<uint> > vec(node_vec.size());
for(uint i = 0; i < node_vec.size(); i++)
  for(uint j = i+1; j < node_vec.size(); j++)
    // some computation, basic math, store the result in variable x
      if( x > threshold ) {
         vec[i].insert(j);
         vec[j].insert(i);
      }

一些细节：
- 我使用了 hash_set，几乎没有什么改进，除了 hash_set 并不在我用于模拟目的的所有机器上可用的事实
- 我尝试使用数组在堆栈上分配 vec，但是，正如我所说，如果元素数量太大，我可能会出现分段错误如果

node_vec.size() 等于 k，其中 k 的顺序几千个中，我预计 vec 比 node_vec 大 4 或 5 倍。考虑到我必须多次运行代码，在这个数量级下，代码看起来很慢。当然，我使用多线程来并行化这些调用，但我无法让函数本身运行得比我现在看到的更快。

例如，是否可以在高速缓存中分配 vec 以进行快速数据检索或类似的操作？

原文

I'm writing a function where I need a significant amount of heap memory. Is it possible to tell the compiler that those data will be accessed frequently within a specific for loop, so as to improve performance (through compile options or similar)?

The reason I cannot use the stack is that the number of elements I need to store is big, and I get segmentation fault if I try to do it.

Right now the code is working but I think it could be faster.

UPDATE:
I'm doing something like this

vector< set<uint> > vec(node_vec.size());
for(uint i = 0; i < node_vec.size(); i++)
  for(uint j = i+1; j < node_vec.size(); j++)
    // some computation, basic math, store the result in variable x
      if( x > threshold ) {
         vec[i].insert(j);
         vec[j].insert(i);
      }

some details:
- I used hash_set, little improvement, beside the fact that hash_set is not available in all machines I have for simulation purposes
- I tried to allocate vec on the stack using arrays but, as I said, I might get segmentation fault if the number of elements is too big

If node_vec.size() is, say, equal to k, where k is of the order of a few thousands, I expect vec to be 4 or 5 times bigger than node_vec. With this order of magnitude the code appears to be slow, considering the fact that I have to run it many times. Of course, I am using multithreading to parallelize these calls, but I can't get the function per se to run much faster than what I'm seeing right now.

Would it be possible, for example, to have vec allocated in the cache memory for fast data retrieval, or something similar?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

小镇女孩 2024-11-09 09:03:19

我正在编写一个需要大量堆内存的函数...将在特定的 for 循环中频繁访问

这不是您可以在编译器级别真正优化的东西。我认为您担心的是您有很多内存可能“陈旧”（调出），但在特定时间点您将需要迭代所有内存，可能需要迭代几次，并且您不想要这些内存要调出到磁盘的页面。

您需要研究特定于平台的策略来提高性能。将页面保留在内存中可以通过 mlockall 或 VirtualLock 但您确实不需要这样做。不过，请确保您了解将应用程序的内存页锁定到 RAM 中的含义。您正在占用其他进程的内存。

您可能还想研究低碎片堆（但是它可能与这个问题根本不相关）和此页面描述了与for循环相关的缓存行。

后一页介绍了 CPU 在内存访问方面的工作原理（通常不必关心的细节）。

示例 1：内存访问和性能
与循环 1 相比，您预计循环 2 的运行速度要快多少？
int[] arr = new int[64 * 1024 * 1024];

// 循环1
for (int i = 0; i < arr.Length; i++) arr[i] *= 3;

// 循环2
for (int i = 0; i < arr.Length; i += 16) arr[i] *= 3;
第一个循环将数组中的每个值乘以 3，第二个循环仅每 16 个值相乘。 第二个循环只完成第一个循环大约 6% 的工作，但在现代机器上，两个 for 循环花费的时间大约相同：80 毫秒和 78 毫秒分别在我的机器上。

I'm writing a function where I need a significant amount of heap memory ... will be accessed frequently within a specific for loop

This isn't something you can really optimize at a compiler level. I think your concern is that you have a lot of memory that may be "stale" (paged out) but at a particular point in time you will need to iterate over all of it, maybe several times and you don't want the memory pages to be paged out to disk.

You will need to investigate strategies that are platform specific to improve performance. Keeping the pages in memory can be achieved with mlockall or VirtualLock but you really shouldn't need to do this. Make sure you know what the implications of locking your application's memory pages into RAM is, however. You're hogging memory from other processes.

You might also want to investigate a low fragmentation heap (however it may not be relevant at all to this problem) and this page which describes cache lines with respect to for loops.

The latter page is about the nitty-gritty of how CPUs work (a detail you normally shouldn't have to be concerned with) with respect to memory access.

Example 1: Memory accesses and performance
How much faster do you expect Loop 2 to run, compared Loop 1?
int[] arr = new int[64 * 1024 * 1024];

// Loop 1
for (int i = 0; i < arr.Length; i++) arr[i] *= 3;

// Loop 2
for (int i = 0; i < arr.Length; i += 16) arr[i] *= 3;
The first loop multiplies every value in the array by 3, and the second loop multiplies only every 16-th. The second loop only does about 6% of the work of the first loop, but on modern machines, the two for-loops take about the same time: 80 and 78 ms respectively on my machine.

回复收藏 0 原文

再浓的妆也掩不了殇 2024-11-09 09:03:19

更新

vector< set<uint> > vec(node_vec.size());
for(uint i = 0; i < node_vec.size(); i++)
  for(uint j = i+1; j < node_vec.size(); j++)
    // some computation, basic math, store the result in variable x
      if( x > threshold ) {
         vec[i].insert(j);
         vec[j].insert(i);
      }

这仍然没有显示出太多信息，因为我们无法知道条件 x > 出现的频率。阈值将为真。如果x > Threshold 通常为 true，那么 std::set 可能是瓶颈，因为它必须为您插入的每个 uint 进行动态内存分配。

而且我们不知道“某些计算”实际上意味着/做什么/是什么。如果它做得太多，或者做得不对，那可能会成为瓶颈。

我们不知道您需要如何访问结果。

不管怎样，凭直觉：

    vector<pair<int, int> > vec1;
    vector<pair<int, int> > vec2;

    for (uint i = 0; i < node_vec.size(); i++)
    {
        for (uint j = i+1; j < node_vec.size(); j++)
        {
            // some computation, basic math, store the result in variable x
            if (x > threshold)
            {
                vec1.push_back(make_pair(i, j));
                vec2.push_back(make_pair(j, i));
            }
        }
    }

如果你能以这种形式使用结果，那么你就完成了。否则你可以做一些后处理。只是不要再次将其复制到 std::set 中（显然）。尝试坚持使用 std::vector。例如，您可以像这样在向量中构建索引：

    // ...
    vector<int> index1 = build_index(node_vec.size(), vec1);
    vector<int> index2 = build_index(node_vec.size(), vec2);
    // ...
}    

vector<int> build_index(size_t count, vector<pair<int, int> > const& vec)
{
    vector<int> index(count, -1);

    size_t i = vec.size();
    do
    {
        i--;
        assert(vec[i].first >= 0);
        assert(vec[i].first < count);
        index[vec[i].first] = i;
    }
    while (i != 0);

    return index;
}

ps.: 我几乎确定您的循环不受内存限制。但不能确定......如果你没有向我们展示的“节点”真的很大，它可能仍然很大。

原始答案：

没有简单的 I_will_access_this_frequently_so_make_it_fast(void* ptr, size_t len) 类型的解决方案。

不过你可以做一些事情。

确保编译器可以“看到”关键循环内调用的每个函数的实现。编译器能够“看到”实现所需的条件取决于编译器。不过，有一种方法可以确保：在循环之前的同一翻译单元中定义所有相关函数，并将它们声明为内联函数。
这也意味着您不应该以任何方式在这些关键循环中调用“外部”函数。我所说的“外部”函数是指系统调用、运行时库内容或在 DLL/SO 中实现的内容。也不要调用虚函数，也不要使用函数指针。或者当然不要分配或释放内存（在关键循环内）。
确保您使用最佳算法。如果算法的复杂度高于必要的程度，线性优化就没有意义。
使用尽可能小的类型。例如，如果 signed char 可以完成这项工作，则不要使用 int。我通常不会推荐这样做，但是当处理大块内存时，它可以大大提高性能。特别是在非常紧密的循环中。
如果您只是复制或填充内存，请使用 memcpy 或 memset。如果块大于大约 50 到 100 字节，则禁用这两个函数的固有版本。
确保以缓存友好的方式访问数据。最佳方式是“流式传输”——即以升序或降序地址访问存储器。您可以一次“跳转”一些字节，但不要跳得太远。最糟糕的是随机访问一大块内存。例如，如果您必须处理二维矩阵（如位图图像），其中 p[0] 到 p[1] 是“向右”的一步（x + 1），请确保内部循环递增 x 并且外部增量 y。如果您以相反的方式进行操作，性能将会大大变差。
如果你的指针是无别名的，你可以告诉编译器（如何完成取决于编译器）。如果您不知道无别名意味着什么，我建议您搜索网络和编译器的文档，因为解释超出了范围。
如果合适，请使用内部 SIMD 指令。
如果您知道不久的将来将需要哪些内存位置，请使用显式预取指令。

UPDATE

vector< set<uint> > vec(node_vec.size());
for(uint i = 0; i < node_vec.size(); i++)
  for(uint j = i+1; j < node_vec.size(); j++)
    // some computation, basic math, store the result in variable x
      if( x > threshold ) {
         vec[i].insert(j);
         vec[j].insert(i);
      }

That still doesn't show much, because we cannot know how often the condition x > threshold will be true. If x > threshold is very frequently true, then the std::set might be the bottleneck, because it has to do a dynamic memory allocation for every uint you insert.

Also we don't know what "some computation" actually means/does/is. If it does much, or does it in the wrong way that could be the bottleneck.

And we don't know how you need to access the result.

Anyway, on a hunch:

    vector<pair<int, int> > vec1;
    vector<pair<int, int> > vec2;

    for (uint i = 0; i < node_vec.size(); i++)
    {
        for (uint j = i+1; j < node_vec.size(); j++)
        {
            // some computation, basic math, store the result in variable x
            if (x > threshold)
            {
                vec1.push_back(make_pair(i, j));
                vec2.push_back(make_pair(j, i));
            }
        }
    }

If you can use the result in that form, you're done. Otherwise you could do some post-processing. Just don't copy it into a std::set again (obviously). Try to stick to std::vector<POD>. E.g. you could build an index into the vectors like this:

    // ...
    vector<int> index1 = build_index(node_vec.size(), vec1);
    vector<int> index2 = build_index(node_vec.size(), vec2);
    // ...
}    

vector<int> build_index(size_t count, vector<pair<int, int> > const& vec)
{
    vector<int> index(count, -1);

    size_t i = vec.size();
    do
    {
        i--;
        assert(vec[i].first >= 0);
        assert(vec[i].first < count);
        index[vec[i].first] = i;
    }
    while (i != 0);

    return index;
}

ps.: I'm almost sure your loop is not memory-bound. Can't be sure though... if the "nodes" you're not showing us are really big it might still be.

Original answer:

There is no easy I_will_access_this_frequently_so_make_it_fast(void* ptr, size_t len)-kind-of solution.

You can do some things though.

Make sure the compiler can "see" the implementation of every function that's called inside critical loops. What is necessary for the compiler to be able to "see" the implementation depends on the compiler. There is one way to be sure though: define all relevant functions in the same translation unit before the loop, and declare them as inline.
This also means you should not by any means call "external" functions in those critical loops. And by "external" functions I mean things like system-calls, runtime-library stuff or stuff implemented in a DLL/SO. Also don't call virtual functions and don't use function pointers. And or course don't allocate or free memory (inside the critical loops).
Make sure you use an optimal algorithm. Linear optimization is moot if the complexity of the algorithm is higher than necessary.
Use the smallest possible types. E.g. don't use int if signed char will do the job. That's something I wouldn't normally recommend, but when processing a large chunk of memory it can increase performance quite a lot. Especially in very tight loops.
If you're just copying or filling memory, use memcpy or memset. Disable the intrinsic version of those two functions if the chunks are larger then about 50 to 100 bytes.
Make sure you access the data in a cache-friendly manner. The optimum is "streaming" - i.e. accessing the memory with ascending or descending addresses. You can "jump" ahead some bytes at a time, but don't jump too far. The worst is random access to a big block of memory. E.g. if you have to work on a 2 dimensional matrix (like a bitmap image) where p[0] to p[1] is a step "to the right" (x + 1), make sure the inner loop increments x and the outer increments y. If you do it the other way around performance will be much much worse.
If your pointers are alias-free, you can tell the compiler (how that's done depends on the compiler). If you don't know what alias-free means I recommend searching the net and your compiler's documentation, since an explanation would be beyond the scope.
Use intrinsic SIMD instructions if appropriate.
Use explicit prefetch instructions if you know which memory locations will be needed in the near future.