使用 C 或 C++ 在大型二进制文件中查找模式?
我有一个约 700 MB 的二进制文件(非文本数据);我想做的是搜索整个文件中随机位置出现的特定字节模式。例如<代码>0x? 0x? 0x55 0x? 0x? 0x55 0x? 0x? 0x55 0x? 0x? 0x55 等等,依次为 50 个左右字节。我要搜索的模式是两个随机字节的序列,每两个字节出现 0x55。
即,以0x55为分隔符搜索文件中存储的表,然后保存表中包含的数据或进行其他操作。
最好的选择是一次简单地检查每个单独的字节,然后向前查看两个字节以查看该值是否为 0x55,如果是,则一次又一次地向前查看以确认该位置存在表?
加载整个内容?寻找?缓冲区块,一次搜索一个字节?
使用 C 或 C++ 查看这个大文件并查找模式的最佳方法是什么?
I have a ~700 MB binary file (non-text data); what I would like to do is search for a specific pattern of bytes that occurs in random locations throughout the file. e.g. 0x? 0x? 0x55 0x? 0x? 0x55 0x? 0x? 0x55 0x? 0x? 0x55
and so on for 50 or so bytes in sequence. The pattern I'd be searching for would be a sequence two random bytes with 0x55 occurring every two bytes.
That is, search for tables stored in the file with 0x55 being the delimiter, and then save the data contained in the tables or otherwise manipulate it.
Would the best option be simply going through every individual byte one at a time, and then looking ahead two bytes to see if the value is 0x55, and if it is, then looking ahead again and again to confirm that a table exists in that location?
Load the whole thing? fseek? Buffer chunks, searching those one byte at a time?
What would be the best way of looking through this large file, and finding the pattern, using C or C++?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
对于 正则表达式 匹配器或 确定性有限自动机。这些是高性能工具,旨在满足您的要求,如果您可以使用它们,那么您在进行此类搜索时应该不会遇到太多麻烦。在 C++ 中,请考虑查看 Boost.Regex< /a> 库,它应该具有解决此问题所需的所有功能。
This sounds like a great job for a regular expression matcher or a deterministic finite automaton. These are high-power tools designed to do just what you're asking, and if you have them at your disposal you shouldn't have much trouble doing this sort of search. In C++, consider looking into the Boost.Regex libraries, which should have all the functionality you need to knock this problem down.
最终对我有用的是 Boyer-Moore-Horspool 算法(由 Jerry Coffin 建议)和我自己的基于表结构和存储数据的算法的混合体。
基本上,BMH 算法捕获了我正在寻找的大部分内容。显而易见的东西。
但有些表格确实有奇怪的格式,我必须实现一个半智能搜索,它会查看每个
0x55
后面的数据,并弄清楚它是否可能是是好的数据,或者只是随机的垃圾。奇怪的是,我最终用 PHP 而不是 C++ 实现它,并将结果直接转储到 MySQL 数据库中进行查询。搜索过程只花了大约5分钟或更短的时间,而且结果基本上不错。我确实得到了很多垃圾数据,但它捕获了我需要的所有数据,并且(据我所知)没有留下任何好的数据。
What ultimately worked for me was a hybrid between the Boyer-Moore-Horspool algorithm (suggested by Jerry Coffin) and my own algorithm based on the structure of the tables and the data being stored.
Basically, the BMH algorithm caught most of the things I was looking for. The obvious stuff.
But some tables did turn out to have odd formatting, and I had to implement a semi-intelligent search that would look at the data following each
0x55
, and figure out whether or not it was it was likely to be good data, or just random junk.Oddly enough, I ended up implementing it in PHP rather than C++, and dumping the results right into a MySQL database for querying. The search process only took around 5 minutes or less, and the results were largely good. I did end up with a lot of junk data, but it caught everything that I needed it to, and (as far as I'm aware) did not leave any good data behind.
如果您可以将整个内容加载到内存中,那么您可能应该使用平台提供的内存映射功能。这样,操作系统可以决定是否应将文件的大部分保留在物理内存中(即系统当前有大量空闲 RAM),或者是否应仅以较小的块工作。
当然,只有当您可以将文件放入工作集中时,这才有效。
If you can load the whole thing into memory, you should probably use the memory mapping features provided by your platform. This way, the operating system can decide if it should keep large portions of the file in physical memory (i.e. the system has lots of free RAM at the moment), or if it should work only in smaller chunks.
Of course, this only works if you can fit the file into working set.