什么信息描述了两个相同大小的给定大文件之间的数量差异?

发布于 2024-12-11 11:55:31 字数 442 浏览 0 评论 0原文

通常,为了找出两个二进制文件的不同之处,我使用 diff 和 hexdump 工具。但在某些情况下,如果给出两个相同大小的大型二进制文件,我只想看到它们的数量差异,例如差异区域的数量、累积差异。

示例:2 个文件 A 和 B。它们有 2 个差异区域,它们的累积差异为 6c-a3 + 6c-11 + 6f-6e + 20-22。

File A = 48 65 6c 6c 6f 2c 20 57
File B = 48 65 a3 11 6e 2c 22 57
              |--------|  |--|
                 reg 1   reg 2

我如何使用标准 GNU 工具和 Bash 获取此类信息,或者我应该更好地使用简单的 Python 脚本?关于两个文件如何不同的其他统计数据也可能有用,但我不知道还有什么以及如何测量?熵差?方差差异?

Usually, in order to find how two binary files are different, I use diff and hexdump tools. But in some situations if two large binary files of the same size are given, I would like to see only their quantitative differences, like number of regions of differences, cumulative difference.

Example: 2 Files A and B. They have 2 diff regions, and their cumulative difference is
6c-a3 + 6c-11 + 6f-6e + 20-22.

File A = 48 65 6c 6c 6f 2c 20 57
File B = 48 65 a3 11 6e 2c 22 57
              |--------|  |--|
                 reg 1   reg 2

How can I get such information using standard GNU tools and Bash or should I better use a simple Python script? Other statistics about how 2 files are different can also be useful, but I don't know what else and how can be measured? Entropy difference? Variance difference?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

青衫负雪 2024-12-18 11:55:31

对于除区域之外的所有内容,您都可以使用 numpy。像这样的东西(未经测试):

import numpy as np
a = np.fromfile("file A", dtype="uint8")
b = np.fromfile("file B", dtype="uint8")

# Compute the number of bytes that are different
different_bytes = np.sum(a != b)

# Compute the sum of the differences
difference = np.sum(a - b)

# Compute the sum of the absolute value of the differences
absolute_difference = np.sum(np.abs(a - b))

# In some cases, the number of bits that have changed is a better
# measurement of change. To compute it we make a lookup array where 
# bitcount_lookup[byte] == number_of_1_bits_in_byte (so
# bitcount_lookup[0:16] == [0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4])
bitcount_lookup = np.array(
    [bin(i).count("1") for i in range(256)], dtype="uint8")

# Numpy allows using an array as an index. ^ computes the XOR of
# each pair of bytes. The result is a byte with a 1 bit where the
# bits of the input differed, and a 0 bit otherwise.
bit_diff_count = np.sum(bitcount_lookup[a ^ b])

我找不到用于计算区域的 numpy 函数,但只需使用 a != b 作为输入编写自己的函数,这应该不难。请参阅问题灵感。

For everything but the regions thing you can use numpy. Something like this (untested):

import numpy as np
a = np.fromfile("file A", dtype="uint8")
b = np.fromfile("file B", dtype="uint8")

# Compute the number of bytes that are different
different_bytes = np.sum(a != b)

# Compute the sum of the differences
difference = np.sum(a - b)

# Compute the sum of the absolute value of the differences
absolute_difference = np.sum(np.abs(a - b))

# In some cases, the number of bits that have changed is a better
# measurement of change. To compute it we make a lookup array where 
# bitcount_lookup[byte] == number_of_1_bits_in_byte (so
# bitcount_lookup[0:16] == [0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4])
bitcount_lookup = np.array(
    [bin(i).count("1") for i in range(256)], dtype="uint8")

# Numpy allows using an array as an index. ^ computes the XOR of
# each pair of bytes. The result is a byte with a 1 bit where the
# bits of the input differed, and a 0 bit otherwise.
bit_diff_count = np.sum(bitcount_lookup[a ^ b])

I couldn't find a numpy function for computing the regions, but just write your own using a != b as input, it shouldn't be hard. See this question for inspiration.

时光暖心i 2024-12-18 11:55:31

我想到的一种方法是对二进制比较算法进行一些修改。例如rsync算法的Python实现。从这里开始应该相对容易地获得文件不同的块范围列表,然后对这些块进行任何您想要做的统计。

One approach that springs to mind is to hack a bit on a binary diffing algorithm. E.g. a python implementation of the rsync algorithm. Starting from that should relatively easily get you a list of block ranges where the files differ, and then do whatever statistics you want to do on those blocks.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文