我使用代码魔法工具链在 Linux (Fedora) 上构建了纯 C 代码。这是针对 ARM Cortex-A8 目标的。该代码在 Cortex A8 板上运行,运行嵌入式 Linux。
当我针对某些测试用例运行此代码时,该代码会对某些大尺寸(10MB)进行动态内存分配(malloc
),一段时间后它会崩溃,并给出如下错误消息:
select 1 (init), adj 0, size 61, to kill
select 1030 (syslogd), adj 0, size 64, to kill
select 1032 (klogd), adj 0, size 74, to kill
select 1227 (bash), adj 0, size 378, to kill
select 1254 (ppp), adj 0, size 1069, to kill
select 1255 (TheoraDec_Corte), adj 0, size 1159, to kill
send sigkill to 1255 (TheoraDec_Corte), adj 0, size 1159
Program terminated with signal SIGKILL, Killed.
然后,当我调试此代码时使用为目标构建的 gdb 为同一测试用例编写代码,在发生动态内存分配时,代码无法分配该内存并且 malloc
返回 NULL
。但在正常的独立运行期间,我相信 malloc 应该无法分配,但奇怪的是它可能不会返回 NULL,但它崩溃了并且操作系统杀死了我的进程。
- 为什么在 gdb 下运行和没有调试器时此行为不同?
- 为什么
malloc
会失败但不返回NULL
。这可能吗,还是我收到错误消息的原因是其他?
- 我该如何解决这个问题?
谢谢,
-AD
I have built a plain C code on Linux (Fedora) using code-sorcery tool-chain. This is for ARM Cortex-A8 target. This code is running on a Cortex A8 board, running embedded Linux.
When I run this code for some test case, which does dynamic memory allocation (malloc
) for some large size (10MB), it crashes after some time giving error message as below:
select 1 (init), adj 0, size 61, to kill
select 1030 (syslogd), adj 0, size 64, to kill
select 1032 (klogd), adj 0, size 74, to kill
select 1227 (bash), adj 0, size 378, to kill
select 1254 (ppp), adj 0, size 1069, to kill
select 1255 (TheoraDec_Corte), adj 0, size 1159, to kill
send sigkill to 1255 (TheoraDec_Corte), adj 0, size 1159
Program terminated with signal SIGKILL, Killed.
Then, when I debug this code for the same test case using gdb built for the target, the point where this dynamic memory allocation happens, code fails to allocate that memory and malloc
returns NULL
. But during normal stand-alone run, I believe malloc
should be failing to allocate but it strangely might not be returning NULL
, but it crashes and the OS kills my process.
- Why is this behaviour different when run under gdb and when without debugger?
- Why would
malloc
fails yet not return a NULL
. Could this be possible, or the reason for the error message I am getting is else?
- How do I fix this?
thanks,
-AD
发布评论
评论(2)
因此,对于问题的这一部分,有一个肯定的答案:
在 Linux 中,默认情况下用于分配内存的内核接口几乎不会彻底失败。相反,他们设置您的页表,以便在第一次访问内存时如果您要求,CPU 将生成一个 页面错误,此时内核会处理此问题并查找将用于该(虚拟)页面的物理内存。因此,在内存不足的情况下,您可以向内核请求内存,它会“成功”,并且当您第一次尝试触及它返回的内存时,此时分配实际上失败了,杀死了你的进程。 (或者也许是其他一些不幸的受害者。有一些启发式的方法,我对此不太熟悉。请参阅“oom-killer"。)
你的一些其他问题,答案对我来说不太清楚。
It could be (just a guess really) that GDB has its own
malloc
, and is tracking your allocations somehow. On a somewhat related point, I've actually frequently found that heap bugs in my code often aren't reproducible under debuggers. This is frustrating and makes me scratch my head, but it's basically something I've pretty much figured one has to live with...这是一个有点大锤的解决方案(也就是说,它改变了所有进程的行为,而不仅仅是您自己的进程,并且让您的程序像这样改变全局状态通常不是一个好主意),但您可以将字符串
2
写入/proc/sys/vm/overcommit_memory
。请参阅我通过 Google 搜索获得的此链接。如果做不到这一点...我只是确保您分配的金额不会超出您的预期。
So, for this part of the question, there is a surefire answer:
In Linux, by default the kernel interfaces for allocating memory almost never fail outright. Instead, they set up your page table in such a way that on the first access to the memory you asked for, the CPU will generate a page fault, at which point the kernel handles this and looks for physical memory that will be used for that (virtual) page. So, in an out-of-memory situation, you can ask the kernel for memory, it will "succeed", and the first time you try to touch that memory it returned back, this is when the allocation actually fails, killing your process. (Or perhaps some other unfortunate victim. There are some heuristics for that, which I'm not incredibly familiar with. See "oom-killer".)
Some of your other questions, the answers are less clear for me.
It could be (just a guess really) that GDB has its own
malloc
, and is tracking your allocations somehow. On a somewhat related point, I've actually frequently found that heap bugs in my code often aren't reproducible under debuggers. This is frustrating and makes me scratch my head, but it's basically something I've pretty much figured one has to live with...This is a bit of a sledgehammer solution (that is, it changes the behavior for all processes rather than just your own, and it's generally not a good idea to have your program alter global state like that), but you can write the string
2
to/proc/sys/vm/overcommit_memory
. See this link that I got from a Google search.Failing that... I'd just make sure you're not allocating more than you expect to.
根据定义,在调试器下运行与独立运行不同。调试器可以并且确实隐藏了许多错误。如果您进行调试编译,则可以添加大量代码,类似于完全未优化的编译(例如,允许您单步执行或监视变量)。如果编译发布可以删除调试选项并删除您需要的代码,那么您可能会陷入许多优化陷阱。从您的帖子中我不知道谁在控制编译选项或它们是什么。
除非您计划交付在调试器下运行的产品,否则您应该独立进行测试。理想情况下,您的开发也无需调试器,这样您就不必将所有事情都做两次。
这听起来像是你的代码中的一个错误,用新的眼睛慢慢地重新阅读你的代码,就像你正在向某人解释它一样,或者也许实际上是逐行地向某人解释它。可能有一些东西你看不到,因为你已经用同样的方式看它太久了。令人惊讶的是,这种方法进行了多少次,效果如何。
我也可能是一个编译器错误。执行或不执行诸如打印返回值之类的操作可能会导致编译器生成不同的代码。添加另一个变量并将结果保存到该变量可以让编译器执行不同的操作。尝试更改编译器选项,减少或删除任何优化选项,减少或删除调试器编译器选项等。
这是一个经过验证的系统还是您正在新硬件上进行开发?例如,尝试在不启用任何缓存的情况下运行。在调试器中工作而不是在独立中工作,如果不是编译器错误,则可能是计时问题,单步刷新管道,以不同的方式混合缓存,使缓存和内存系统有永恒的时间来得出它没有的结果实时。
简而言之,有一个很长的原因可以解释为什么在调试器下运行会隐藏错误,而这些错误只有在最终可交付的环境中进行测试才能找到,我只触及了其中几个。让它在调试器中工作而不是在独立中工作并不意外,这只是工具的工作方式。根据您目前给出的描述,这可能是您的代码、硬件或工具。
消除代码或工具的最快方法是反汇编该部分并检查如何处理传递的值和返回值。如果返回值被优化了,那就是你的答案。
您正在编译共享 C 库还是静态库?也许编译为静态...
By definition running under a debugger is different than running standalone. Debuggers can and do hide many of the bugs. If you compile for debugging you can add a fair amount of code, similar to compiling completely unoptimized (allowing you to single step or watch variables for example). Where compiling for release can remove debugging options and remove code that you needed, there are many optimization traps you can fall into. I dont know from your post who is controlling the compile options or what they are.
Unless you plan to deliver the product to be run under the debugger you should do your testing standalone. Ideally do your development without the debugger as well, saves you from having to do everything twice.
It sounds like a bug in your code, slowly re-read your code using new eyes as if you were explaining it to someone, or perhaps actually explain it to someone, line by line. There may be something right there that you cannot see because you have been looking at it the same way for too long. It is amazing how many times and how well that works.
I could also be a compiler bug. Doing things like printing out the return value, or not can cause the compiler to generate different code. Adding another variable and saving the result to that variable can kick the compiler to do something different. Try changing the compiler options, reduce or remove any optimization options, reduce or remove the debugger compiler options, etc.
Is this a proven system or are you developing on new hardware? Try running without any of the caches enabled for example. Working in a debugger and not in standalone, if not a compiler bug can be a timing issue, single stepping flushes the pipline, mixes the cache up differently, gives the cache and memory system an eternity to come up with a result which it doesnt have in real time.
In short there is a very long list of reasons why running under a debugger hides bugs that you cannot find until you test in the final deliverable like environment, I have only touched on a few. Having it work in the debugger and not in standalone is not unexpected, it is simply how the tools work. It is likely your code, the hardware, or your tools based on the description you have given so far.
The fastest way to eliminate it being your code or the tools is to disassemble the section and inspect how the passed values and return values are handled. If the return value is optimized out there is your answer.
Are you compiling for a shared C library or static? Perhaps compile for static...