在二进制文件中查找模式
我正在用 C 开发一个小项目,我必须解析未记录文件格式的二进制文件。由于我对 CI 很陌生,所以有两个问题想问一些更有经验的程序员。
第一个似乎很简单。如何从二进制文件中提取所有字符串并将它们放入数组中?基本上,我正在寻找 C 语言中 strings 程序的简单实现。
当我在任何文本编辑器中打开二进制文件时,我会得到很多垃圾,其中混有一些可读的字符串。我可以使用以下命令提取这些字符串命令行中的字符串。现在我想在 C 中做类似的事情,就像下面的伪代码一样:
while (!EOF) {
if (string found) {
put it into array[i]
i++
}
return i;
}
第二个问题有点复杂,我相信这是实现相同目标的正确方法。当我在十六进制编辑器中查看该文件时,很容易注意到一些模式。例如,在每个字符串之前有一个字节值 02 (0x02),后跟字符串的长度和字符串本身。例如 02 18 52 4F 4F 54 4B 69 57 69 4B 61 4B 69 是一个字符串,字符串部分以粗体显示。
现在我尝试创建的函数将像这样工作:
while(!EOF) {
for(i=0; i<buffer_size; ++i) {
if(buffer[i] hex value == 02) {
int n = read the next byte;
string = read the next n bytes as char;
put string into array;
}
}
}
感谢您的任何指示。 :)
I'm working on a small project in C where I have to parse a binary file of undocumented file format. As I'm quite new to C I have two questions to some more experienced programmers.
The first seems to be an easy one. How do I extract all the strings from the binary file and put them into an array? Basically I am looking for a simple implementation of strings program in C.
When I open the binary file in any text editor I get a lot of rubbish with some readable strings mixed in. I can extract this strings using strings in the command line. Now I'd like to do something similar in C, like in the pseudocode below:
while (!EOF) {
if (string found) {
put it into array[i]
i++
}
return i;
}
The second problem is a little bit more complicated and is, I believe, the proper way of achieving the same thing. When I look at the file in HEX editor it's easy to notice some patterns. For example before each string there is a byte of value 02 (0x02) followed by the length of the string and the string itself. For example 02 18 52 4F 4F 54 4B 69 57 69 4B 61 4B 69 is a string with the string part in bold.
Now the function I'm trying to create would work like this:
while(!EOF) {
for(i=0; i<buffer_size; ++i) {
if(buffer[i] hex value == 02) {
int n = read the next byte;
string = read the next n bytes as char;
put string into array;
}
}
}
Thanks for any pointers. :)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
找出代表可打印 ASCII 字符的字符范围。遍历文件,检查字符是否为 ASCII 字符,并对相邻的 ASCII 字符进行计数。默认情况下,
strings
会将四个或更多字符的序列视为字符串;当找到下一个非ASCII字符时,检查是否超出了数量;如果有,则输出字符串。一些簿记是必要的。您的伪代码基本上是正确的。您可以手动将 buffer[i] 的内容与整数(例如 2)进行比较。读取一个字节就像递增
i
一样简单。确保没有超出缓冲区,并确保读取字符串的数组足够大(如果大小参数只有一个字节,则可以使用 255 长度的数组缓冲区。)Figure out what character range represents printable ASCII characters. Iterate across the file, checking if characters are ASCII characters, and counting up for adjacent ASCII characters. By default,
strings
will treat sequences of four or more characters as strings; when you find the next non-ASCII character, check if the number has been exceeded; if it has, output the string. Some book-keeping is necessary.Your pseudocode is essentially correct. You can manually compare the contents of
buffer[i]
with an integer (e.g. 2). Reading a byte is as simple as incrementingi
. Make sure you don't overrun the buffer, and make sure the array your reading the string to is big enough (if the size parameter is only one byte, you can get away with a 255 length array buffer.)我不确定您的解决方案是否有效:如果您找到长度为 350 个字符的字符串怎么办?
数字可以是字符串的一部分,或者您可以认为它们是“垃圾”?
我认为最安全的方法是
我知道,这很无聊,但我认为这是唯一安全的方法。祝你好运!
I'm not sure your solution will work: what if you find a string with 350 char length?
Numbers can be part of a string or you can consider them "rubbish"?
I think the most safe way is
I know, it's boring, but I think it's the only safe way. Good luck!