glibc中重复内存分配的效率

发布于 2024-07-22 15:23:41 字数 1495 浏览 9 评论 0原文

下面是我对著名 LAPACK 数值库中的 Fortran ZHEEVR 例程的 C 包装器:

void zheevr(char jobz, char range, char uplo, int n, doublecomplex* a, int lda, double vl, double vu, int il, int iu, double abstol, double* w, doublecomplex* z, int ldz, int* info)
{
    int m;
    int lwork = -1;
    int liwork = -1;
    int lrwork = -1;
    int* isuppz = alloc_memory(sizeof(int) * 2 * n);
    zheevr_(&jobz, &range, &uplo, &n, a, &lda, &vl, &vu, &il, &iu, &abstol, &m, w, z, &ldz, isuppz, small_work_doublecomplex, &lwork, small_work_double, &lrwork, small_work_int, &liwork, &info);
    lwork = (int) small_work_doublecomplex[0].real;
    liwork = small_work_int[0];
    lrwork = (int) small_work_double[0];
    doublecomplex* work = alloc_memory(sizeof(doublecomplex) * lwork);
    double* rwork = alloc_memory(sizeof(double) * lrwork);
    int* iwork = alloc_memory(sizeof(int) * liwork);
    zheevr_(&jobz, &range, &uplo, &n, a, &lda, &vl, &vu, &il, &iu, &abstol, &m, w, z, &ldz, isuppz, work, &lwork, rwork, &lrwork, iwork, &liwork, info);
    free(iwork);
    free(rwork);
    free(work);
    free(isuppz);
}

在我的应用程序中,该函数被调用数十万次,以对角化复杂矩阵“a”(参数名称遵循该函数的 Fortran 约定)对于相同的矩阵大小。 我认为工作数组大小在大多数情况下都是相同的,因为对角矩阵具有相同的结构。 我的问题是:

  1. 重复的 alloc/free (“alloc_memory”是 glibc 的 malloc 的简单包装)调用会损害性能吗?有多严重?
  2. 免费的顺序重要吗? 我应该首先释放最后分配的数组,还是最后释放?

Below is my C wrapper for a Fortran ZHEEVR routine from well-known LAPACK numerical library:

void zheevr(char jobz, char range, char uplo, int n, doublecomplex* a, int lda, double vl, double vu, int il, int iu, double abstol, double* w, doublecomplex* z, int ldz, int* info)
{
    int m;
    int lwork = -1;
    int liwork = -1;
    int lrwork = -1;
    int* isuppz = alloc_memory(sizeof(int) * 2 * n);
    zheevr_(&jobz, &range, &uplo, &n, a, &lda, &vl, &vu, &il, &iu, &abstol, &m, w, z, &ldz, isuppz, small_work_doublecomplex, &lwork, small_work_double, &lrwork, small_work_int, &liwork, &info);
    lwork = (int) small_work_doublecomplex[0].real;
    liwork = small_work_int[0];
    lrwork = (int) small_work_double[0];
    doublecomplex* work = alloc_memory(sizeof(doublecomplex) * lwork);
    double* rwork = alloc_memory(sizeof(double) * lrwork);
    int* iwork = alloc_memory(sizeof(int) * liwork);
    zheevr_(&jobz, &range, &uplo, &n, a, &lda, &vl, &vu, &il, &iu, &abstol, &m, w, z, &ldz, isuppz, work, &lwork, rwork, &lrwork, iwork, &liwork, info);
    free(iwork);
    free(rwork);
    free(work);
    free(isuppz);
}

This function is called hundreds of thousands of times in my application, to diagonalize the complex matrix "a" (parameter names follow the Fortran convention for this function) for the same matrix size. I think that the work arrays sizes will be the same most of the time, as the diagonalized matrices will be of the same structure. My questions are:

  1. Can the repeated alloc/free ("alloc_memory" is a simple wrapper around glibc's malloc) calls hurt performance, and how badly?
  2. Does the order of free's matter? Should I free the last allocated array first, or last?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

一身骄傲 2024-07-29 15:23:41
  • 可以用C99吗? (答案:是的,您已经在使用 C99 符号 - 在需要时声明变量。)
  • 数组的大小是否正常(不是太大)?

如果两个答案都是“是”,请考虑使用 VLA - 可变长度数组:

void zheevr(char jobz, char range, char uplo, int n, doublecomplex* a, int lda, double vl, double vu, int il, int iu, double abstol, double* w, doublecomplex* z, int ldz, int* info)
{
    int m;
    int lwork = -1;
    int liwork = -1;
    int lrwork = -1;
    int isuppz[2*n];
    zheevr_(&jobz, &range, &uplo, &n, a, &lda, &vl, &vu, &il, &iu, &abstol, &m, w, z, &ldz, isuppz, small_work_doublecomplex, &lwork, small_work_double, &lrwork, small_work_int, &liwork, &info);
    lwork = (int) small_work_doublecomplex[0].real;
    liwork = small_work_int[0];
    lrwork = (int) small_work_double[0];
    doublecomplex work[lwork];
    double rwork[lrwork];
    int iwork[liwork];
    zheevr_(&jobz, &range, &uplo, &n, a, &lda, &vl, &vu, &il, &iu, &abstol, &m, w, z, &ldz, isuppz, work, &lwork, rwork, &lrwork, iwork, &liwork, info);
}

使用 VLA 的一大好处是您无需释放任何空间。

(未经测试的代码!)

  • Can you use C99? (Answer: yes, you already are using C99 notations - declaring variables when needed.)
  • Are the sizes of the arrays sane (not too huge)?

If both answers are 'yes', consider using VLA's - variable length arrays:

void zheevr(char jobz, char range, char uplo, int n, doublecomplex* a, int lda, double vl, double vu, int il, int iu, double abstol, double* w, doublecomplex* z, int ldz, int* info)
{
    int m;
    int lwork = -1;
    int liwork = -1;
    int lrwork = -1;
    int isuppz[2*n];
    zheevr_(&jobz, &range, &uplo, &n, a, &lda, &vl, &vu, &il, &iu, &abstol, &m, w, z, &ldz, isuppz, small_work_doublecomplex, &lwork, small_work_double, &lrwork, small_work_int, &liwork, &info);
    lwork = (int) small_work_doublecomplex[0].real;
    liwork = small_work_int[0];
    lrwork = (int) small_work_double[0];
    doublecomplex work[lwork];
    double rwork[lrwork];
    int iwork[liwork];
    zheevr_(&jobz, &range, &uplo, &n, a, &lda, &vl, &vu, &il, &iu, &abstol, &m, w, z, &ldz, isuppz, work, &lwork, rwork, &lrwork, iwork, &liwork, info);
}

One nice thing about using VLAs is that there is no freeing to be done by you.

(Untested code!)

还不是爱你 2024-07-29 15:23:41

1)是的,他们可以。

2)任何理智的libc都不应该担心free()的顺序。 就性能而言,这也不重要。

我建议从这个函数中删除内存管理——这样调用者将提供矩阵大小和分配的临时缓冲区。 如果从相同大小的矩阵上的相同位置调用此函数,这将显着减少 malloc 的数量。

1) Yes they can.

2) Any sane libc shouldn't worry about order of free(). Performance wise that shouldn't matter too.

I'd recommend removing memory management from this function -- so caller will be supplying the matrix size and allocated temporary buffers. That'll cut number of mallocs significantly if this function is called from same place on matrix of the same size.

总以为 2024-07-29 15:23:41

它肯定会影响性能——影响多少只能通过计时来确定。 要创建避免大多数分配的版本,请分配给静态指针并记住另一个静态整数的大小。 如果下一次调用使用相同的大小,则只需重复使用上次分配的大小即可。 仅当您因大小已更改而需要创建新矩阵时才释放任何内容。

请注意,此解决方案仅适用于单线程代码。

It certainly will affect performance - how much youb can only find out for sure by timing. To create a version that avoids most allocations, allocate to a static pointer and remember the size in another static integer. If the next call uses the same size, just reuse what was allocated last time. Only free anything when you need to create a new matrix because the size has changed.

Note this solution is only suitable for single-threaded code.

伴随着你 2024-07-29 15:23:41

好吧。 您很快就会得到分析器的答案。 如果您有 AMD 机器,我强烈推荐免费的 AMD 的 CodeAnalyst。

至于你的内存问题,我认为在这种情况下你可以使用本地内存管理。 只需确定可以为此函数分配的最大内存数即可。
接下来,您声明一个静态缓冲区,并像编译器处理堆栈的方式一样使用它。 我在 VirtualAlloc 上做了一次这样的包装,速度非常快。

Alright. You are going to get the profiler answer anytime soon. If you have an AMD machine, I strongly recommend the free AMD's CodeAnalyst.

As for your memory problem, I think that you could work with local memory management in this case. Just determine the maximum number of memory that you can allocate for this function.
Next you declare a static buffer and you work with it a bit like how a compiler handles the stack. I did a wrapper like this over VirtualAlloc once and it's VERY fast.

柏林苍穹下 2024-07-29 15:23:41

如果您分配相同大小的项目数十万次,那么为什么不只维护一个对象堆(因为这些对象似乎相对简单,即不包含指向其他已分配内存的指针)并释放到您自己的堆上(或者实际上是堆栈)?

堆可以使用 glib malloc 延迟分配新对象,但在释放时只需将项目推送到堆上。 当您需要分配时,如果有可用的已释放对象,则只需分配该对象即可。

这也将节省您对分配的多次调用(因为您不需要进行任何分配,并且看起来您的例程对 malloc 进行了多次调用),并且至少在重复使用的内存上也将避免碎片(在某种程度上) 。 当然,初始分配(以及程序运行时需要扩展内存时的其他分配)可能会导致碎片,但如果您真的担心这一点,您可以运行一些统计数据并找到您的平均/最大/典型大小运行期间堆并在程序启动时立即预分配它,避免碎片。

If you are allocating the same size item hundreds of thousands of times, then why not just maintain a heap of your objects (since these seem to be relatively simple, i.e. don't contain pointers to other allocated memory) and free onto your own heap (or stack actually)?

The heap can lazily allocate new objects using the glib malloc, but when freeing just push the item onto the heap. When you need to allocate, if there is a freed object available it can just allocate that one.

This will also save you multiple calls to allocation (since you won't need to do any allocation and it looks like your routine makes several calls to malloc) and will also avoid fragmentation (to some extent) at least on the re-used memory. Of course the initial allocations (and other allocations as the program is running when it needs to expand this memory) may cause fragmentation, but if you are really worried about this you can run some stats and find the average/max/typical size of your heap during runs and pre-allocate this at once when the program starts up, avoiding fragmentation.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文