有没有一种方法可以仅查看生成的核心文件来定位进程的哪一部分使用了最多的内存?

发布于 2024-12-24 17:03:33 字数 560 浏览 1 评论 0原文

我有一个进程(每次都由看门狗启动,由于某种原因停止),通常使用大约 200MB 内存。一旦我看到它正在耗尽内存 - 内存使用量约为 1.5-2GB,这绝对意味着某处存在“内存泄漏”(引号中的“内存泄漏”,因为这不是真正的内存泄漏 - 就像分配的内存,从未释放< strong>并且无法访问 - 请注意,仅使用智能指针所以,我想到了一些巨大的容器(我没有找到)或类似的东西)

后来的过程。由于内存过高而崩溃使用情况并生成核心转储 - 大约 2GB。但问题是,我无法重现该问题,因此 valgrind 在这里无济于事(我猜)。这种情况很少发生,我无法“抓住”它。

所以,我的问题是 - 有没有办法使用 exe 和核心文件来定位进程的哪一部分使用了大部分内存?

我用gdb,没有什么异常。但核心很大,所以一定有东西。有没有一种聪明的方法来理解发生了什么,或者只是猜测可能有帮助(但是对于这么大的exe..,12个线程,大约50-100(可能更多)类,等等)

这是一个C++ 应用程序,在 RHEL5U3 上运行。

I have a process (that is started by a watch-dog every time, it's stopped for some reason), that uses usually about 200MB memory. Once I saw it's eating up the memory - with memory usage about 1.5-2GB, which definitely means a "memory leak" somewhere ( "memory leak" in quotes, as that is not a real memory leak - like allocated memory, never freed and unreachable - please note, that only smart pointers are used. So, I think about some huge container (I didn't find) or something like this )

Later, the process crashed, because of the high memory usage and a core dump was generated - about 2GB. But the problem is, that I can't reproduce the issue, so valgrind won't help here (I guess). It happens very rarely and I can't "catch" it.

So, my question is - is there a way, using the exe and the core file, to locate which part of the process, has used most of the memory?

I took a look at the core file with gdb, there's nothing unusual. But the core is big, so there must be something. Is there a clever way to understand what has happened, or only guessing may help (but for such big exe.., 12 threads, about 50-100 (may be more) classes, etc, etc. )

It's a C++ application, running on RHEL5U3.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

带上头具痛哭 2024-12-31 17:03:33

以十六进制格式(如 bytes/words/dwords/qwords)打开此核心转储。从文件的中间开始尝试注意任何重复的模式。如果发现任何东西,请尝试确定起始地址和某些可能的数据结构的长度。使用该结构的长度和内容,尝试猜测它可能是什么。使用该地址,尝试找到指向该结构的一些指针。重复直到到达堆栈或某个全局变量。如果是堆栈变量,您将很容易知道该链在哪个函数中开始。对于全局变量,您至少知道它的类型。

如果您在核心转储中找不到任何模式,则泄漏结构很可能非常大。只需将您在文件中看到的内容与程序中所有大型结构的可能内容进行比较即可。

更新

如果您的核心转储具有有效的调用堆栈,您可以从检查其函数开始。寻找任何不寻常的东西。检查调用堆栈顶部附近的内存分配是否不会请求太多。检查调用堆栈函数中是否存在可能的无限循环。

仅使用智能指针”这句话让我感到害怕。如果这些智能指针的重要部分是共享指针(shared_ptr,intrusive_ptr,...),则值得搜索共享指针循环,而不是搜索巨大的容器。

更新 2

尝试确定堆在核心文件中的结束位置(brk 值)。在 gdb 下运行 coredumped 进程并使用 pmap 命令(从其他终端)。 gdb 也应该知道这个值,但我不知道如何询问它...如果大部分进程的内存都在 brk 之上,您可以通过大内存分配来限制您的搜索(最有可能的是, std::向量)。

为了提高在现有核心转储的堆区域中发现泄漏的机会,可以使用一些编码(我自己没有这样做,只是一个理论):

  • 读取核心转储文件,将每个值解释为指针(忽略代码段,未对齐的值) ,以及指向非堆区域的指针)。对列表进行排序,计算相邻元素的差异。
  • 此时整个内存被分割成许多可能的结构。计算结构大小的直方图,删除任何无关紧要的值。
  • 计算指针和结构体的地址差,这些指针所属的位置。对于每个结构大小,计算指针位移的直方图,再次删除任何不重要的值。
  • 现在您有足够的信息来猜测结构类型或构建结构的有向图。找到该图的源节点和循环。您甚至可以像“列出“冷”内存区域”中那样可视化此图表。

Coredump 文件采用 elf 格式。仅需要数据段的开头和大小。为了简化过程,只需将其读取为线性文件,忽略结构。

Open this coredump in hexadecimal format (as bytes/words/dwords/qwords). Starting from the file's middle try to notice any repeating pattern. If anything is found, try to determine starting address and the length of some possible data structure. Using length and contents of this structure, try to guess what might it be. Using the address, try to find some pointer to this structure. Repeat until you come to either stack or some global variable. In case of stack variable, you'll easily know in which function this chain starts. In case of global variable, you know at least its type.

If you cannot find any pattern in the coredump, chances are that leaking structure is very big. Just compare what you see in the file with possible contents of all large structures in the program.

Update

If your coredump has valid call stack, you can start with inspecting its functions. Search for anything unusual. Check if memory allocations near the top of the call stack do not request too much. Check for possible infinite loops in the call stack functions.

Words "only smart pointers are used" frighten me. If significant part of these smart pointers are shared pointers (shared_ptr, intrusive_ptr, ...), instead of searching for huge containers, it is worth to search for shared pointer cycles.

Update 2

Try to determine where your heap ends in the corefile (brk value). Run coredumped process under gdb and use pmap command (from other terminal). gdb should also know this value, but I have no idea how to ask it... If most of the process' memory is above brk, you can limit your search by large memory allocations (most likely, std::vector).

To improve chances of finding leaks in heap area of the existing coredump, some coding may be used (I didn't do it myself, just a theory):

  • Read coredump file, interpreting each value as a pointer (ignore code segment, unaligned values, and pointers to non-heap area). Sort the list, calculate differences of adjacent elements.
  • At this point whole memory is split to many possible structures. Compute a histogram of structure's sizes, drop any insignificant values.
  • Calculate difference of addresses of pointers and structures, where these pointers belong. For each structure size, compute a histogram of pointers' displacement, again drop any insignificant values.
  • Now you have enough information to guess structure types or to construct a directed graph of structures. Find source nodes and cycles of this graph. You can even visualize this graph as in "list “cold” memory areas".

Coredump file is in elf format. Only start and size of data segment is needed from its header. To simplify process, just read it as linear file, ignoring structure.

瑾兮 2024-12-31 17:03:33

一旦我看到它正在耗尽内存 - 内存使用量约为 1.5-2GB

这通常是错误循环误入歧途的最终结果。例如:

size_t size = 1;
p = malloc(size);
while (!enough_space(size)) {
  size *= 2;
  p = realloc(p, size);
}
// now use p to do whatever

如果 enough_space() 在某些情况下错误地返回 false,您的进程将快速增长以消耗所有可用内存。

仅使用智能指针

除非您控制链接到进程的所有代码,否则上述语句为。错误循环可能位于 libc 内,或您不拥有的任何其他库内。

只有猜测才有帮助

就这样。叶夫根尼的答案有很好的起点来帮助你猜测。

Once I saw it's eating up the memory - with memory usage about 1.5-2GB

Quite often this would be an end result of an error loop going astray. Something like:

size_t size = 1;
p = malloc(size);
while (!enough_space(size)) {
  size *= 2;
  p = realloc(p, size);
}
// now use p to do whatever

If enough_space() erroneously returns false under some conditions, your process will quickly grow to consume all memory available.

only smart pointers are used

Unless you control all code linked into the process, above statement is false. The error loop could be inside libc, or any other library that you don't own.

only guessing may help

That's pretty much it. Evgeny's answer has good starting points to help you guess.

寄居人 2024-12-31 17:03:33

普通的内存分配器不会跟踪进程的哪一部分分配了内存——毕竟,内存无论如何都会被释放,并且指针由客户端代码保存。如果内存确实泄漏了(即没有剩下指向它的指针),那么您几乎已经丢失并且正在查看一大块非结构化内存。

Normal memory allocators don't keep track which part of the process allocated memory - after all, the memory will be freed anyway and pointers are held by the client code. If the memory has truly leaked (i.e. there are no pointers to it left), you have pretty much lost and are looking at a huge block of unstructured memory.

晚雾 2024-12-31 17:03:33

Valgrind 可能会发现几个可能的错误,并且值得对所有错误进行分析。您需要创建一个抑制文件,并像 --suppressions=/path/to/file.supp 一样使用它。对于 valgrind 标记的每个可能的错误,要么向抑制文件添加一个子句,要么更改您的程序。

您的程序在 Valgrind 中运行速度会变慢,因此事件发生的时间会有所不同,因此您无法确定是否看到错误发生。

valgrind 有一个名为 Alleyoop 的 GUI,但我用得不多。

Valgrind will likely find several possible errors, and it is worthwhile to analyse all of them. You need to create a suppression file, and use it like this --suppressions=/path/to/file.supp. For each possible error that valgrind flags, either add a clause to the suppression file, or change your program.

Your program will be running slower in Valgrind, and so the timing of events will be different, so you can't be sure of seeing your error occur.

There is a GUI for valgrind called Alleyoop, but I have not used it much.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文