什么信息描述了两个相同大小的给定大文件之间的数量差异?
通常,为了找出两个二进制文件的不同之处,我使用 diff 和 hexdump 工具。但在某些情况下,如果给出两个相同大小的大型二进制文件,我只想看到它们的数量差异,例如差异区域的数量、累积差异。
示例:2 个文件 A 和 B。它们有 2 个差异区域,它们的累积差异为 6c-a3 + 6c-11 + 6f-6e + 20-22。
File A = 48 65 6c 6c 6f 2c 20 57
File B = 48 65 a3 11 6e 2c 22 57
|--------| |--|
reg 1 reg 2
我如何使用标准 GNU 工具和 Bash 获取此类信息,或者我应该更好地使用简单的 Python 脚本?关于两个文件如何不同的其他统计数据也可能有用,但我不知道还有什么以及如何测量?熵差?方差差异?
Usually, in order to find how two binary files are different, I use diff and hexdump tools. But in some situations if two large binary files of the same size are given, I would like to see only their quantitative differences, like number of regions of differences, cumulative difference.
Example: 2 Files A and B. They have 2 diff regions, and their cumulative difference is
6c-a3 + 6c-11 + 6f-6e + 20-22.
File A = 48 65 6c 6c 6f 2c 20 57
File B = 48 65 a3 11 6e 2c 22 57
|--------| |--|
reg 1 reg 2
How can I get such information using standard GNU tools and Bash or should I better use a simple Python script? Other statistics about how 2 files are different can also be useful, but I don't know what else and how can be measured? Entropy difference? Variance difference?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
对于除区域之外的所有内容,您都可以使用 numpy。像这样的东西(未经测试):
我找不到用于计算区域的 numpy 函数,但只需使用
a != b
作为输入编写自己的函数,这应该不难。请参阅此问题灵感。For everything but the regions thing you can use numpy. Something like this (untested):
I couldn't find a numpy function for computing the regions, but just write your own using
a != b
as input, it shouldn't be hard. See this question for inspiration.我想到的一种方法是对二进制比较算法进行一些修改。例如rsync算法的Python实现。从这里开始应该相对容易地获得文件不同的块范围列表,然后对这些块进行任何您想要做的统计。
One approach that springs to mind is to hack a bit on a binary diffing algorithm. E.g. a python implementation of the rsync algorithm. Starting from that should relatively easily get you a list of block ranges where the files differ, and then do whatever statistics you want to do on those blocks.