如何从 C++ 中的二进制文件中删除部分
我想使用 C++ 从二进制文件中删除部分。二进制文件大约为 5-10 MB。
我想做的:
- 搜索 ANSI 字符串“something”
- 一旦找到这个字符串,我想删除以下 n 个字节,例如以下 1 MB 数据。我想删除这些字符,而不是用 NULL 填充它们,从而使文件更小。
- 我想将修改后的文件保存到一个新的二进制文件中,该文件与原始文件相同,除了我删除的丢失的n个字节之外。
您能给我一些如何最有效地执行此操作的建议/最佳实践吗?我应该先将文件加载到内存中吗?
如何有效地搜索 ANSI 字符串? 我的意思是,在找到该字符串之前,我可能必须跳过几兆字节的数据。 >>>有人告诉我应该在另一个问题中提出这个问题,所以在这里: 如何在二进制文件中查找 ANSI 字符串文件?
如何有效地删除 n 个字节并将其写入新文件?
好的,我不需要它非常高效,文件不会大于 10 MB,如果运行一段时间就可以了几秒钟。
I would like to delete parts from a binary file, using C++. The binary file is about about 5-10 MB.
What I would like to do:
- Search for a ANSI string "something"
- Once I found this string, I would like to delete the following n bytes, for example the following 1 MB of data. I would like to delete those character, not to fill them with NULL, thus make the file smaller.
- I would like to save the modified file into a new binary file, what is the same as the original file, except for the missing n bytes what I have deleted.
Can you give me some advice / best practices how to do this the most efficiently? Should I load the file into memory first?
How can I search efficiently for an ANSI string? I mean possibly I have to skip a few megabytes of data before I find that string. >> I have been told I should ask it in an other question, so its here:
How to look for an ANSI string in a binary file?
How can I delete n bytes and write it out to a new file efficiently?
OK, I don't need it to be super efficient, the file will not be bigger than 10 MB and its OK if it runs for a few seconds.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
有许多快速字符串搜索例程的性能比测试每个字符要好得多。例如,当尝试查找“某物”时,只需测试每第 9 个字符。
这是我为之前的问题编写的示例: 代码审查:查找对非 null 终止的 char str 进行标记反向搜索
There are a number of fast string search routines that perform much better than testing each and every character. For example, when trying to find "something", only every 9th character needs to be tested.
Here's an example I wrote for an earlier question: code review: finding </body> tag reverse search on a non-null terminated char str
对于 5-10MB 的文件,如果您的系统支持,我会查看 writev()它。将整个文件读入内存,因为它足够小。扫描您想要删除的字节。向 writev() 传递 iovec 列表(这只是指向读取缓冲区和长度的指针),然后您可以在单个系统调用中重写整个修改的内容。
For a 5-10MB file I would have a look at writev() if your system supports it. Read the entire file into memory since it is small enough. Scan for the bytes you want to drop. Pass writev() the list of iovecs (which will just be pointers into your read buffer and lenghts) and then you can rewrite the entire modified contents in a single system call.
首先,如果我理解您在“如何有效搜索”小节中的含义,那么如果目标字符串可能位于前几兆字节中,则您不能在搜索中跳过几兆数据。
至于将文件加载到内存中,如果这样做,请不要忘记确保内存中有足够的空间容纳整个文件。如果您在使用该实用程序时发现您想要使用该实用程序的 2GB 文件无法容纳您剩余的 1.5GB 内存,您将会感到沮丧。
我假设您将加载到内存或内存映射它以进行以下操作。
您确实明确表示这是一个二进制文件,因此这意味着您无法使用正常的 C++ 字符串搜索/匹配,因为文件数据中的空字符会混淆它(在没有匹配的情况下提前结束它)。相反,您可以使用 memchr 查找目标中第一个字节的第一次出现,并使用 memcmp 将接下来的几个字节与目标中的字节进行比较;继续使用 memchr/memcmp 对扫描整个内容,直到找到为止。这不是最有效的方法,因为有更好的模式匹配算法,但我认为这是一种有效的方法。
要“删除”n 个字节,您必须实际移动这 n 个字节之后的数据,将整个内容复制到新位置。
如果您实际上将数据从磁盘复制到内存,那么在那里操作它并写入新文件会更快。否则,一旦在磁盘上找到要开始删除的位置,就可以打开一个新文件进行写入,从第一个文件中读取 X 字节,其中 X 是第一个文件中的文件指针位置,然后写入它们直接进入第二个文件,然后在第一个文件中查找到 X+n 并从那里到 file1 的 eof 执行相同的操作,将其附加到已放入 file2 中的内容。
First, if I understand your meaning in your "How can I search efficiently" subsection, you cannot just skip a few megabytes of data in the search if the target string might be in those first few megabytes.
As for loading the file into memory, if you do that, don't forget to make sure you have enough space in memory for the entire file. You will be frustrated if you go to use your utility and find that the 2GB file you want to use it on can't fit in the 1.5GB of memory you have left.
I am going to assume you will load into memory or memory map it for the following.
You did specifically say this was a binary file, so this means that you cannot use the normal C++ string searching/matching, as the null characters in the file's data will confuse it (end it prematurely without a match). You might instead be able to use memchr to find the first occurrence of the first byte in your target, and memcmp to compare the next few bytes with the bytes in the target; keep using memchr/memcmp pairs to scan through the entire thing until found. This is not the most efficient way, as there are better pattern-matching algorithms, but this is a sort of efficient way, I suppose.
To "delete" n bytes you have to actually move the data after those n bytes, copying the entire thing up to the new location.
If you actually copy the data from disk to memory, then it'd be faster to manipulate it there and write to the new file. Otherwise, once you find the spot on the disk you want to start deleting from, you can open a new file for writing, read in X bytes from the first file, where X is the file pointer position into the first file, and write them right into the second file, then seek into the first file to X+n and do the same from there to file1's eof, appending that to what you've already put into file2.