逆向工程“UTF-8 Like”编码算法

发布于 2025-01-09 16:24:34 字数 1405 浏览 4 评论 0原文

我正在尝试对编码算法进行逆向工程，以确保与其他软件包的向后兼容性。对于输出文件中要编码的每种类型的数量，都有一个单独的编码过程。

给定的文档仅向最终用户展示如何解析编码文件中的值，而不是向其中写回任何内容。但是，我已经能够为除 read_string() 之外的每种文件类型的每个记录的 read_int() 成功创建相应的 write_int()以下。

我目前（并且已经有一段时间）努力理解下面列出的 read_string() 函数中到底发生了什么。

我完全理解这是一个掩码问题，并且第一个操作 whilepartial_length & 0x80 > 0: 是一个简单的按位掩码，它要求我们仅在检查大于 128 的值时才进入循环，当我尝试从该 while 内的循环中分配或提取含义时，我开始失去理智声明。我了解了运算背后的数学机制，但我不明白为什么他们会以这种方式做事。

我已包含用于上下文的 read_byte() 函数，因为它是在 read_string() 函数中调用的。

def read_byte(handle):
    return struct.unpack("<B", handle.read(1))[0]

def read_string(handle):
    total_length = 0
    partial_length = read_byte(handle)
    num_bytes = 0
    while partial_length & 0x80 > 0:
        total_length += (partial_length & 0x7F) << (7 * num_bytes)
        partial_length = ord(struct.unpack("c", handle.read(1))[0])
        num_bytes += 1
    total_length += partial_length << (7 * num_bytes)
    result = handle.read(total_length)
    result = result.decode("utf-8")
    if len(result) < total_length:
        raise Exception("Failed to read complete string")
    else:
        return result

这是否表明由于信息丢失而无法完成任务，或者我是否缺少执行与此 read_string 功能相反的明显方法？

我将非常感谢任何信息、见解（无论您认为它们多么明显）、帮助或可能的指示，即使它只是指向您认为可能有用的页面的链接。

干杯!

原文

I'm attempting to reverse engineer an encoding algorithm to ensure backwards compatibility with other software packages. For each type of quantity to be encoded in the output file, there is a separate encoding procedure.

The given documentation only shows the end-user how to parse values from the encoded file, not write anything back to it. However, I have been able to successfully create a corresponding write_int() for every documented read_int() for every file type except the read_string() below.

I am currently (and have been for a while) struggling to wrap my head around exactly what is going on in the read_string() function listed below.

I understand fully that this is a masking problem, and that the first operation while partial_length & 0x80 > 0: is a simple bitwise mask that mandates we only enter the loop when we examine values larger than 128, I begin to lose my head when trying to assign or extract meaning from the loop that is within that while statement. I get the mathematical machinery behind the operations, but I can't see why they would be doing things in this way.

I have included the read_byte() function for context, as it is called in the read_string() function.

def read_byte(handle):
    return struct.unpack("<B", handle.read(1))[0]

def read_string(handle):
    total_length = 0
    partial_length = read_byte(handle)
    num_bytes = 0
    while partial_length & 0x80 > 0:
        total_length += (partial_length & 0x7F) << (7 * num_bytes)
        partial_length = ord(struct.unpack("c", handle.read(1))[0])
        num_bytes += 1
    total_length += partial_length << (7 * num_bytes)
    result = handle.read(total_length)
    result = result.decode("utf-8")
    if len(result) < total_length:
        raise Exception("Failed to read complete string")
    else:
        return result

Is this indicative of an impossible task due to information loss, or am I missing an obvious way to perform the opposite of this read_string function?

I would greatly appreciate any information, insights (however obvious you may think they may be), help, or pointers possible, even if it means just a link to a page that you think might prove useful.

Cheers!

分享到QQ

分享到微博