逆向工程“UTF-8 Like”编码算法
我正在尝试对编码算法进行逆向工程,以确保与其他软件包的向后兼容性。对于输出文件中要编码的每种类型的数量,都有一个单独的编码过程。
给定的文档仅向最终用户展示如何解析编码文件中的值,而不是向其中写回任何内容。但是,我已经能够为除 read_string()
之外的每种文件类型的每个记录的 read_int()
成功创建相应的 write_int()
以下。
我目前(并且已经有一段时间)努力理解下面列出的 read_string() 函数中到底发生了什么。
我完全理解这是一个掩码问题,并且第一个操作 whilepartial_length & 0x80 > 0:
是一个简单的按位掩码,它要求我们仅在检查大于 128 的值时才进入循环,当我尝试从该 while 内的循环中分配或提取含义时,我开始失去理智声明。我了解了运算背后的数学机制,但我不明白为什么他们会以这种方式做事。
我已包含用于上下文的 read_byte()
函数,因为它是在 read_string()
函数中调用的。
def read_byte(handle):
return struct.unpack("<B", handle.read(1))[0]
def read_string(handle):
total_length = 0
partial_length = read_byte(handle)
num_bytes = 0
while partial_length & 0x80 > 0:
total_length += (partial_length & 0x7F) << (7 * num_bytes)
partial_length = ord(struct.unpack("c", handle.read(1))[0])
num_bytes += 1
total_length += partial_length << (7 * num_bytes)
result = handle.read(total_length)
result = result.decode("utf-8")
if len(result) < total_length:
raise Exception("Failed to read complete string")
else:
return result
这是否表明由于信息丢失而无法完成任务,或者我是否缺少执行与此 read_string
功能相反的明显方法?
我将非常感谢任何信息、见解(无论您认为它们多么明显)、帮助或可能的指示,即使它只是指向您认为可能有用的页面的链接。
干杯!
I'm attempting to reverse engineer an encoding algorithm to ensure backwards compatibility with other software packages. For each type of quantity to be encoded in the output file, there is a separate encoding procedure.
The given documentation only shows the end-user how to parse values from the encoded file, not write anything back to it. However, I have been able to successfully create a corresponding write_int()
for every documented read_int()
for every file type except the read_string()
below.
I am currently (and have been for a while) struggling to wrap my head around exactly what is going on in the read_string()
function listed below.
I understand fully that this is a masking problem, and that the first operation while partial_length & 0x80 > 0:
is a simple bitwise mask that mandates we only enter the loop when we examine values larger than 128, I begin to lose my head when trying to assign or extract meaning from the loop that is within that while
statement. I get the mathematical machinery behind the operations, but I can't see why they would be doing things in this way.
I have included the read_byte()
function for context, as it is called in the read_string()
function.
def read_byte(handle):
return struct.unpack("<B", handle.read(1))[0]
def read_string(handle):
total_length = 0
partial_length = read_byte(handle)
num_bytes = 0
while partial_length & 0x80 > 0:
total_length += (partial_length & 0x7F) << (7 * num_bytes)
partial_length = ord(struct.unpack("c", handle.read(1))[0])
num_bytes += 1
total_length += partial_length << (7 * num_bytes)
result = handle.read(total_length)
result = result.decode("utf-8")
if len(result) < total_length:
raise Exception("Failed to read complete string")
else:
return result
Is this indicative of an impossible task due to information loss, or am I missing an obvious way to perform the opposite of this read_string
function?
I would greatly appreciate any information, insights (however obvious you may think they may be), help, or pointers possible, even if it means just a link to a page that you think might prove useful.
Cheers!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
它只是读取一个长度,然后告诉它要读取多少个字符。 (我最后没有得到检查,但这是一个不同的问题。)
为了避免长度固定,长度被分为七位单元,首先发送低位块。每个 7 位单元均以单个 8 位字节的形式发送,并设置高位,最后一个单元除外,按原样发送。因此,读取器知道何时到达长度末尾,因为它读取的是高位为 0 的字节(换句话说,小于 0x80 的字节)。
It's just reading a length, which then tells it how many characters to read. (I don't get the check at the end but that's a different issue.)
In order to avoid a fixed length for the length, the length is divided into seven-bit units, which are sent low-order chunk first. Each seven-bit unit is sent in a single 8-bit byte with the high-order bit set, except the last unit which is sent as is. Thus, the reader knows when it gets to the end of the length, because it reads a byte whose high-order bit is 0 (in other words, a byte less than 0x80).