如何确定 malloc 缓冲区的实际使用情况
我有一些压缩的二进制数据和一个 API 调用来解压缩它,这需要预先分配的目标缓冲区。没有任何方法可以通过 API 告诉我解压数据的大小。因此,我可以分配一个超大的缓冲区来解压缩,但我想然后调整大小(或将其复制到)正确大小的内存缓冲区。那么,我如何(实际上可以)确定超大缓冲区中解压后的二进制数据的实际大小?
(我不控制数据的压缩,因此我事先不知道预期的大小,并且无法为文件编写标头。)
I have some compressed binary data and an API call to decompress it which requires a pre-allocated target buffer. There is not any means via the API that tells me the size of the decompressed data. So I can malloc an oversized buffer to decompress into but I would like to then resize (or copy this to) a memory buffer of the correct size. So, how do I (indeed can I) determine the actual size of the decompressed binary data in the oversized buffer?
(I do not control the compression of the data so I do not know in advance what size to expect and I cannot write a header for the file.)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
正如其他人所说,如果您的 API 不提供,就没有好方法来做到这一点。
我几乎不想提出这个建议,因为担心您会接受这个建议,并且您的应用程序的一些关键任务部分依赖于它,但是......
启发式方法是用一些“有毒”模式填充您的缓冲区在解压进去之前。然后,解压后,扫描缓冲区以查找第一次出现的有毒模式。
这是一种启发式方法,因为完全可以想象,解压缩的数据可能恰好出现了您的有毒模式。除非您对数据有确切的领域知识,并且可以专门选择您知道不存在的模式。
即便如此,充其量也是一个不完美的解决方案。
As others have said, there is no good way to do this if your API doesn't provide it.
I almost don't want to suggest this for fear that you'll take this suggestion and have some mission-critical piece of your application depend on it, but...
A heurstic would be to fill your buffer with some 'poison' pattern before decompressing into it. Then, after decompression, scan the buffer for the first occurrence of the poison pattern.
This is a heuristic because it's perfectly conceivable that the decompressed data could just happen to have an occurrence of your poison pattern. Unless you have exact domain knowledge of what the data will be, and can choose a pattern specifically that you know cannot exist.
Even still, an imperfect solution at best.
通常此信息是在压缩时提供的(例如,查看 7-zips LZMA SDK)。
根据您现在提供的信息,无法知道解压数据的实际大小(或实际使用的部分的大小)。
Usually this information is supplied at compression time (take a look at 7-zips LZMA SDK for example).
There is no way to know the actual size of the decompressed data (or the size of the part that is actually in use) with the information you are giving now.
如果解压步骤没有以某种方式将解压后的大小作为返回值或“out”参数提供给您,则您不能这样做。
无法确定缓冲区中写入了多少数据(在调试器/valgrind 类型检查之外)。
If the decompression step doesn't give you the decompressed size as a return value or "out" parameter in some way, you can't.
There is no way to determine how much data was written in the buffer (outside of debugger/valgrind-type checks).
解决这个问题的一个复杂方法是解压缩两次到一个过大的缓冲区中。
在这两种情况下,您都需要一个“随机模式”。从末尾开始,计算与该模式相对应的字节数,并检测解压序列的结尾处的不同之处。
或者是吗?也许,偶然地,解压缩序列的最后一个字节之一对应于该确切位置处的随机字节。所以最终解压后的大小可能会比检测到的要大。如果您的模式确实是随机的,则它不应超过几个字节。
您需要用一种不同的随机模式再次填充缓冲区。确保在每个位置,新的随机模式与旧的随机模式具有不同的值。为了更快的速度,您不必填充完整的缓冲区:您可以将新模式限制为第一个检测到的结束之前的几个字节和之后的一些字节。 32 个字节就足够了,因为这么多字节不可能偶然对应于第一个生成的随机模式。
再次解压。再次检测模式不同的地方。取第一端检测和第二端检测之间的两个值中的较大者。这是你解压后的大小。
A complex way to answer this problem is by decompressing twice into an over-sized buffer.
In both cases, you need a "random pattern". Starting from the end, you count the number of bytes which correspond to the pattern, and detect the end of decompressed sequence where it differs.
Or does it ? Maybe, by chance, one of the final byte of the decompressed sequence corresponds to the random byte at this exact position. So the final decompressed size might be larger than the detected one. If your pattern is truly random, it should not be more than a few bytes.
You need to fill again the buffer with a random pattern, but a different one. Ensure that, at each position, the new random pattern has a different value than the old random pattern. For faster speed, you are not obliged to fill the full buffer : you may limit the new pattern to a few bytes before and some more bytes after the 1st detected end. 32 bytes shall be enough, since it is improbable that so many bytes does correspond by chance to the first generated random pattern.
Decompress a second time. Detect again where the pattern differ. Take the larger of the two values between the first and second end detection. It is your decompressed size.
你应该检查 free 对于你的编译器/操作系统是如何工作的
并做同样的事情。
free 不获取分配数据的大小,但它以某种方式知道要释放多少数据;)
通常大小存储在分配的缓冲区之前,但不知道之前到底有多少字节,具体取决于操作系统/架构/编译器
you should check how free works for your compiler/os
and do the same.
free doesn't take the size of the malloced data, but it somehow knows how much to free right ;)
usually the size is stored before the allocated buffer, don't know though exactly how maby bytes before again depending on the os/arch/compiler