如何对二进制 Thrift 文件进行逆向工程?
我被要求处理一些序列化为二进制的文件(不幸的是不是文本/JSON) Thrift 对象,但我无权访问创建这些文件的程序或程序员,所以我不知道它们的结构、字段顺序等。有没有一种方法使用 Thrift 库打开二进制文件并分析它,得到一个字段类型、值、嵌套等的列表?
I've been asked to process some files serialized as binary (not text/JSON unfortunately) Thrift objects, but I don't have access to the program or programmer that created the files, so I have no idea of their structure, field order, etc. Is there a way using the Thrift libraries to open a binary file and analyze it, getting a list of the field types, values, nesting, etc.?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
不幸的是,Thrift 的二进制协议根本没有做太多的数据标记;要对其进行解码,似乎假设您手头有 .thrift 文件,因此您知道接下来的 4 个字节应该是整数,而实际上并不是浮点数的前半部分。因此,看起来您基本上只能在十六进制编辑器(或同等编辑器)中查看文件,并尝试根据您所看到的确切模式来推断字段。
有一些有用的位:
每个文件都以版本、协议标识符字符串和序列号开头。映射将以 6 个字节开始,用于标识键和值类型(前两个字节,作为整数代码)加上作为 4 字节整数的元素数量。类型代码似乎是标准的(它们定义的规范位置似乎是 Thrift 源中的 TProtocol.h,例如,布尔值由类型代码 2 指定,UTF-8 字符串由类型代码 16 指定,依此类推) 。字符串以 4 字节整数长度字段为前缀,列表以类型(1 字节)和 4 字节长度为前缀。看起来所有整数字段都以大尾数法保存,浮点数以 IEEE 格式保存(这至少应该使双精度数相对容易找到)。
Thrift 中的 TBinaryProtocol* 文件有一些更有用的详细信息;从好的方面来说,有许多不同的实现,因此您可以阅读用您最熟悉的语言实现的实现。
抱歉,我知道这可能没有那么有帮助,但看起来这确实是 Thrift 二进制格式提供的所有信息;显然,二进制格式的设计目的是让您始终知道确切的协议规范,并且目标是最小化线路空间,而不是使其易于盲目解码。
Unfortunately it appears that Thrift's binary protocol does not do very much tagging of data at all; to decode it appears to assume you have the .thrift file in hand so you know, say, the next 4 bytes are supposed to be an integer, and aren't actually the first half of a float. So it appears you are stuck with, basically, looking at the files in a hex editor (or equivalent) and trying to deduce fields based on the exact patterns you're seeing.
There are a very few helpful bits:
Each file begins with a version, protocol identifier string, and sequence number. Maps will begin with 6 bytes that identify the key and value types (first two bytes, as integer codes) plus the number of elements as a 4 byte integer. The type codes appear to be standard (the canonical location of their definitions seems to be TProtocol.h in the Thrift sources, for instance a boolean value is specified by type code 2, UTF-8 string by type code 16, and so on). Strings are prefixed by a 4 byte integer length field, and lists are prefixed by the type (1 byte) and a 4 byte length. It looks like all integer fields are saved big-endian, and floating points are saved in IEEE format (which should make doubles relatively easy to find, at least).
The TBinaryProtocol* files in Thrift have a few more helpful details; on the plus side, there are a number of different implementations so you can read the ones implemented in the language you are most comfortable with.
Sorry, I know this probably isn't that helpful but it really does appear this is all the information the Thrift binary format provides; clearly the binary format was designed with the intent that you would always know the exact protocol spec already, and that the goal was the minimize wire space, rather than make it at all easy to decode blindly.