静态函数范围指针和内存泄漏
我编写了一个简单的库文件,其中包含从任意大小的文件中读取行的函数。该函数是通过传入堆栈分配的缓冲区和大小来调用的,但如果行太大,则会初始化一个特殊的堆分配的缓冲区并用于传回较大的行。
这个堆分配的缓冲区是函数范围的并声明为静态,当然在开始时初始化为 NULL。我在函数的开头编写了一些检查,以检查堆缓冲区是否为非空;如果是这种情况,则前一行读取的时间太长。当然,我释放堆缓冲区并将其设置回 NULL,认为下一次读取可能只需要填充堆栈分配的缓冲区(即使在我们的应用程序中,也很少会看到超过 1MB 长的行!)。
我仔细阅读了代码并进行了一些测试,并对其进行了相当彻底的测试。我有相当的信心维持以下不变式:
- 如果堆栈缓冲区是所需要的,则函数返回时堆缓冲区将为空(并且不会泄漏任何内存)。
- 如果堆缓冲区不为空,因为需要它,它将在下一个函数调用时释放(并且如果需要在下一行中可能会重用)。
但我想到了一个潜在的问题:如果文件中的最后一行太长,那么由于该函数可能不会再次被调用,所以我不确定是否有任何方法来释放堆缓冲区——它是函数毕竟,是有范围的。
所以我的问题是,如何释放函数范围的静态指针中动态分配的内存,最好不要再次调用该函数? (理想情况下也不要将其设为全局变量!)
可根据要求提供代码。 (抱歉,我现在还没有访问权限。我希望这个问题足够普遍并且解释得很好,因此不需要它,但无论如何请随意消除我的这个想法!)
编辑:我感觉我应该添加一些关于该函数的使用的注释。
此特定函数以从文件中串行读取行的形式使用,然后立即复制到 POD 结构中,每个结构一行。这些结构是在读取文件时在堆上创建的,每个结构都有一个 char 指针,其中包含文件中的一行(其清理版本)。为了使这些持续存在,必须已经发生副本。 (这是许多答案中提出的重大反驳之一 - 哦不,该行需要复制,哦亲爱的我)。
至于多线程,正如我所说,它是设计用于串行使用的。不,它不是线程安全的,但我不在乎。
不过还是谢谢大家的回复!当我有时间时,我会更仔细地阅读它们。目前,我倾向于传递一个额外的指针或重新设计该函数,以便当 fgets 显示 EOF 时,我可能只是在那里构建释放逻辑,并且用户希望不需要担心它。
I've written a simple library file with a function for reading lines from a file of any size. The function is called by passing in a stack-allocated buffer and size, but if the line is too big, a special heap-allocated buffer is initialized and used to pass back a larger line.
This heap-allocated buffer is function-scoped and declared static, initialized to NULL at the beginning of course. I've written in some checks at the beginning of the function, to check if the heap buffer is non-null; if this is the case, then the previous line read was too long. Naturally, I free the heap buffer and set it back to NULL, thinking that the next read will likely only need to fill the stack-allocated buffer (it should be very rare to see lines over 1MB long, even in our application!).
I've gone over the code and tested it fairly thoroughly, both by reading it carefully and by running a few tests. I am reasonably confident that the following invariant is maintained:
- The heap buffer will be null (and will not leak any memory) on function return if the stack buffer is all that is needed.
- If the heap buffer is not null, because it was needed, it will be freed on the next function call (and possibly reused if needed on that next line).
But I've thought of a potential problem: If the last line in a file is too long, then since the function is presumably not called again, I'm not sure I have any way to free the heap buffer-- it is function-scoped, after all.
So my question is, how do I go about freeing dynamically allocated memory in a function-scoped static pointer, ideally without calling the function again? (And ideally without making it a global variable, either!)
Code available on request. (I just haven't got access now, sorry. And I'm hoping the question is sufficiently general and well-explained for it not to be needed, but by all means feel free to disabuse me of that notion!)
EDIT: I feel I should add a couple of notes about the usage of the function.
This particular function is used in the form of lines being read serially from a file, and then immediately copied into POD structs, one line per struct. Those are created on the heap as the file is read, and each one of those structs has a char pointer containing (a cleaned up version of) a line from the file. In order for these to persist, a copy already has to occur. (That was one of the big counterarguments brought up in many of the answers-- oh no, the line needs to be COPIED, oh dearie me).
As for multithreading, as I said this is designed to be used serially. No, it isn't thread safe, but I don't care.
Thanks for the multitude of responses, though! I'll read them more thoroughly when I get time. Currently, I'm leaning towards either passing an extra pointer around or redesigning the function so that when fgets
shows EOF, then I might just build the freeing logic there instead and the user hopefully won't need to worry about it.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
如果可以更改功能,我建议更改功能界面本身。我知道你花了很多时间调试和测试它,但是你当前的实现存在一些问题:
malloc()
ed 的缓冲区中,从而抵消在函数中选择性使用malloc()
所获得的任何优势,您的用户不应该担心您的函数的实现奇怪,他们应该能够“只是使用它”。
除非您出于教育目的而这样做,否则我建议您查看此页面,它有一个“从流中读取任意长的行”的实现,并链接到其他此类实现(每个实现都与其他实现略有不同,因此您应该能够找到您喜欢的实现)。
根据您的编辑,MT 安全不是必需的,并且总是会发生副本。因此,最明显的设计是两者之一:
malloc()< 的组合提供一个
char **
,它指向您的函数将分配的缓冲区。 /code> 和realloc()
(如果需要)。用户有责任在完成后free()
它。这样,用户就不必再次复制数据,因为他可以将指针传递到数据的最终目的地所在的位置。char *
。同样,用户有责任free()
它。两者几乎是等价的。
对于当前的实现,如果最后一行很长并且不以换行符结尾,则始终可以返回“非文件结尾”。然后,用户将再次调用您的函数,然后您可以释放缓冲区。就我个人而言,我会更高兴有一个功能,它允许我读取尽可能多的行,而不是强迫我转到文件末尾。
If you can change the function, I would recommend changing the function interface itself. I know you have spent a lot of time debugging and testing it, but there are a few problems with your current implementation:
malloc()
ed, thus nullifying any advantage you got by the selective use ofmalloc()
in your function,Your users should not be worried by the implementation oddity of your function, they should be able to "just use it".
Unless you are doing it for educational purposes, I would recommend looking at this page, which has one implementation of "reading an arbitrarily long line from a stream", and links to other such implementations (each implementation is slightly different from the others, so you should be able to find one that you like).
Based upon your edit, MT-safe is not a requirement, and a copy is always going to happen. So, the most obvious design is one of the two:
char **
, which points to a buffer that your function will allocate, using a combination ofmalloc()
andrealloc()
(if needed). It is the user's responsibility tofree()
it when done. That way, the user doesn't have to copy the data again, since he can pass a pointer to wherever the final destination of the data is.char *
that is allocated by your function. Again, it's the user's responsibility tofree()
it.Both are pretty much equivalent.
For your current implementation, you can always return "not end of file" if the last line is very long, and doesn't end in a newline. Then, the user is going to call your function again, and then you can free your buffer. Personally, I would be happier with a function that allows me to read as many lines as I want, and not force me to go to the end of file.
除了释放动态分配的缓冲区的困难之外,还有另一个潜在的问题。它不是线程安全的。既然是库函数,那么以后总有可能在多线程环境中使用它。
要求调用函数通过相关的库函数释放缓冲区可能会更好。
Aside from the difficulty of freeing that dynamically allocated buffer, there is another potential problem. It is not thread safe. Since it is a library function, then there is always the possibility that it will be used in a multi-threaded environment in the future.
It would probably be better to require the calling function to free the buffer via a related library function.
如果您使用标准技术来指示文件结束(即让您的读取行函数返回 NULL),那仍然可以。
在这种情况下,在读取最后一行后,将需要再次调用 read-line 函数,以便它可以返回 NULL 来指示已到达文件末尾。在最后一次调用中,您可以释放缓冲区。
That could still be okay if you use the standard technique to indicate end-of-file (i.e have you read-line function return NULL).
What happens in this case is that after the final line is read, one more call to your read-line function will be needed so that it can return NULL to indicate that the end of file has been reached. In this last call, you can then free you buffer.
立即发生的两个选择:
将指向堆分配的缓冲区的指针设置为静态,但在文件范围内。添加一个(静态)函数来检查它是否不为空,如果不为空则释放它。在程序开始时调用atexit(free_func),其中free_func是静态函数。您可以使用一些全局设置例程(由 main() 调用)来完成此操作。
不用担心;当您的进程退出时,操作系统会释放堆分配的内存,并且内存泄漏不会累积,因此即使您的程序寿命很长,也不会引发 OOM 异常(除非您有其他错误)。
我假设你的应用程序不是多线程的;在这种情况下,您根本不应该使用静态缓冲区,或者应该使用线程本地数据。
Two choices that occur immediately:
Make the pointer to the heap-allocated buffer static but file scoped. Add a (static) function that checks if it is not null and if it is not null free()s it. Call atexit(free_func) at the start of the program, where free_func is the static function. You can have some global setup routine (caled by main()) where this is done.
Don't worry about it; heap-allocated memory is released by the OS when your process exits, and the memory leak is not cumulative, so even if your program has a long life it won't raise an OOM exception (unless you have some other bug).
I assume your app is NOT multithreaded; in this case, you should not use a static buffer at all, or you should use thread-local data.
您选择的接口使得这是一个无法解决的问题:
客户端不得知道返回值是否指向静态或动态内存。
返回值必须指向比调用寿命更长的内存。
任何调用都可能是最后一个。
我不知道你为什么对这次泄漏感到困扰。毕竟,如果客户端读取很长的行,对该行执行某些操作,然后在读取下一行之前进行大量计算和分配,那么您仍然有一大块内存未使用,从而堵塞系统。如果这对你来说没问题(在回收内存之前进行任意计算),你可以承认你愿意无限期地保留死内存。
如果您无法忍受泄漏,最简单的方法就是扩大接口,以便客户端可以在客户端使用完内存时通知您的函数。 (现在与客户端的合同规定客户端拥有内存,直到它再次调用您的函数,此时所有权恢复到您的函数。)当然,更改接口意味着
添加一个新函数,这需要您将指针提升为
静态
,但对于编译单元而言是本地的,或者向现有函数添加一些参数(或重载参数),以便您有一个调用,这意味着“我现在已经完成了您的记忆,但我不想要另一个 向现有函数添加
更根本的改变是重写该函数以在其整个生命周期中使用动态分配的内存,根据需要逐渐扩大块,直到它与有史以来读取的最大块一样大(或者可能四舍五入到下一个 2 的幂)。根据实际情况,此策略可能比保留大的静态缓冲区消耗更少的地址空间。
无论如何,我不相信您应该担心这个极端情况。如果您认为此案例很重要,请编辑您的问题以向我们展示证据。
The interface you have chosen makes this an unsolvable problem:
The client must not know if the return value points to static or dynamic memory.
The return value must point to memory that outlives the call.
Any call might be the last.
I'm not sure why you are troubled by this leak. After all, if the client reads a very long line, does something with the line, then does a ton of computation and allocation before reading the next line, you still have a big hunk of memory sitting around unused, clogging up the system. If this OK with you (arbitrary computation takes place before memory is reclaimed), you could just fess up that you're willing to retain dead memory indefinitely.
If you can't live with the leak, the simplest thing to do is to widen the interface so that the client can notify your function when the client is done with the memory. (Right now the contract with the client says that the client owns the memory until it calls your function again, at which point ownership reverts to your function.) Of course, to change the interface means either
adding a new function, which would require you to promote your pointer to be
static
but local to the compilation unit, oradding some argument to the existing function (or overloading an argument) so that you have a call which means "I am done with your memory now, but I don't want another line".
A more radical change would be to rewrite the function to use dynamically allocated memory throughout its lifetime, gradually enlarging the block as needed until it is as large as the largest block ever read (or perhaps rounded up to the next power of two). Depending on actual cases this strategy may consume less address space than keeping a big static buffer.
In any case I'm not convinced you should be worrying about this corner case. If you think this case matters, please edit your question to show us the evidence.
不要给它函数作用域,而是给它模块作用域(即在文件作用域,但是静态的,所以它在该文件之外不可见。添加一个释放缓冲区的小函数,并使用
atexit()
来确保它是或者,不用担心——仅发生一次并在程序退出时自动释放的泄漏并不是特别有害。然而,当您释放缓冲区时,几乎无法猜测它是否仍在使用中,用户(显然)必须跟踪数据返回的位置,并将数据复制到新的缓冲区。如果(且仅当)您在多线程环境中动态分配一个,则需要使内部指针成为线程本地的,以便有机会正确工作,该函数可能完全执行以下两种操作之一。不同的事情——要么返回用户拥有的缓冲区,要么返回函数拥有的缓冲区,并且只能通过分配另一个缓冲区并在再次调用函数之前将数据复制到另一个缓冲区来安全地使用。
Instead of function scope, give it module scope (i.e. at file scope, but static, so it's not visible outside that file. Add a small function that frees the buffer, and use
atexit()
to assure that's called before the program exits. Alternative, don't worry about it -- a leak that happens only once, and is freed automatically as the program exits isn't particularly harmful.I feel obliged to say that the design sounds to me like a recipe for disaster though. When you free the buffer, there's virtually no way to even guess whether it might still be in use. The user (apparently) has to keep track of where the data was returned, and copy the data to a new buffer if (and only if) you allocated one dynamically. In a multi-threading environment, you need to make the internal pointer thread-local to have any chance of working correctly at all. To the user, the function might do one of two entirely different things -- either return a buffer that's owned by the user, OR return a buffer that's owned by the function, and can only be used safely by allocating another buffer, and copying the data into the other buffer before the function is called again.
我能想到一些技巧,尽管两者都需要将静态声明移出函数。我无法想象为什么这会成为一个问题。
使用 GCC扩展,
使用C++,
无论如何,我不太喜欢这个设计。在 C 中,通常调用者负责分配/释放它需要使用的内存,即使它是由被调用者填充的。
顺便说一句,与 Glibc 的非标准 getline 进行比较。它从不使用静态内存。
There's a few hacks I can think of, although both require moving the static declaration out of the function. I can't imagine why that would be a problem.
Using a GCC extension,
Using C++,
In any case, I don't really like the design. In C, usually the caller is responsible for allocating/freeing memory that it needs to use, even if it's filled in by a callee.
BTW, compare with Glibc's nonstandard getline. It never uses static memory.
我本来想在马克的回答下面发表评论,但可能感觉有点局促。不过,这个答案本质上是对他的答案的评论,我发现除了快速之外,它还非常好:)。
您的函数不仅不是 MT 安全的,而且即使没有线程,正确使用它的接口也很复杂。调用者必须先完成先前的结果,然后才能再次调用该函数。如果这个代码两年后仍在使用,有人会摸不着头脑试图正确使用它……或者更糟糕的是,不假思索地使用它就错误了。那人甚至可能是你......
马克的建议(要求调用者释放缓冲区)是恕我直言最合理的。但也许您不相信
malloc
和free
从长远来看不会导致碎片,或者有其他原因更喜欢静态缓冲区解决方案。在这种情况下,您可以保留普通长度行的静态缓冲区,定义一个布尔标志来指示静态缓冲区当前是否繁忙,并记录应该调用以下函数(而不是
free
)当调用者不再使用缓冲区时,使用缓冲区的地址:触发断言的唯一情况是您以前的实现会默默地出错的情况,并且与您现有的解决方案相比,开销非常低(只需切换标志,并要求调用者在完成后调用
free_buffer
,这样更干净)。如果 get_line 中的断言特别触发,则意味着您毕竟需要动态分配,因为调用者在请求另一个缓冲区时无法完成缓冲区。注意:这仍然不是 MT 安全的。
I was just going to comment below Mark's answer, but it may feel a little bit cramped. Still, this answer is in essence a comment on his answer, which I find very good in addition to being quick :).
Not only is your function not MT-safe, but even without threads, the interface to use it correctly is complicated. The caller must have finished with the previous result before calling the function again. If this code is still in use two years from now, someone will scratch his head trying to use it right... or worse, use it wrong without even thinking about it. That person could even be you...
Mark's suggestion (requiring the caller to free the buffer) is IMHO the most reasonable. But perhaps you don't trust
malloc
andfree
not to cause fragmentation in the long run, or have some other reason to prefer the static buffer solution.In this case you can keep the static buffer for ordinary-length lines, define a boolean flag that indicates if the static buffer is currently busy, and document that the following function (and not
free
) should be called with the address of the buffer when the caller no longer uses it:The only circumstances in which the assertions will trigger are circumstances in which your previous implementation would have silently gone wrong, and the overhead is very low compared to your existing solution (only toggling the flag, and asking the caller to call
free_buffer
when he's finished, which is cleaner). If the assertion inget_line
in particular triggers, it means you needed dynamic allocation after all, because the caller could not be finished with a buffer at the time he was asking for another.Note: this is still not MT-safe.