如何在没有文档的情况下解释旧的二进制数据文件？

发布于 2024-08-12 17:02:11 字数 1055 浏览 3 评论 0原文

数据通常存储在特定于程序的二进制文件中，而这些文件的文档很少或根本没有。我们领域的一个典型例子是来自仪器的数据，但我怀疑这个问题是普遍存在的。有哪些方法可以尝试理解和解释数据？

设定一些界限。文件未加密且没有 DRM。文件的类型和格式特定于程序的编写者（即它不是“标准文件” - 例如 *.tar - 其身份已丢失）。（可能）没有故意混淆，但可能有一些业余的努力来节省空间。我们可以假设我们对数据有一定的了解，并且我们可能认识一些（但可能不是全部）字段和数组。

假设大部分数据是数字、标量和数组（可能是一维和二维，有时是不规则或三角形）。还会有一些字符串，可能是人名、站点、日期，也可能是一些关键字。程序中会有读取二进制文件的代码，但我们无权访问源代码或汇编程序。例如，它可能是由 VAX Fortran 程序或某些早期的 Unix 或 Windows 作为 OLE 对象编写的。这些数字可能是大尾数或小尾数（一开始并不知道），但可能是一致的。我们在不同的机器上可能有不同的版本（例如Cray）。

我们可以假设我们有一个相当大的文件库——比如数百个。

我们可以假设两种情况：

我们可以使用不同的输入重新运行程序，以便我们可以进行实验。
我们无法重新运行该程序 - 我们有一组固定的文档。这与用未知语言（例如 Linear B）解码历史文档有一点相似之处。

部分解决方案可能是可以接受的——即可能有一些领域现在没有人能够理解，但大多数其他领域是可以解释的。

我只对开源方法感兴趣。

更新有一个相关的SO问题（如何出于兼容性目的对二进制文件格式进行逆向工程），但侧重点有所不同。更新@brianegge 对地址 (1) 的巧妙建议。使用 truss（或者 Linux 上可能是 strace）转储程序中的所有 write() 和类似调用。这至少应该允许将记录集合写入磁盘。

原文

Data is often stored in program-specific binary files for which there is little or no documentation. A typical example in our field is data that comes from an instrument, but I suspect the problem is general. What methods are there for trying to understand and interpret the data?

To set some boundaries. The files are not encrypted and there is no DRM. The type and format of the file is specific to the writer of the program (i.e. it is not a "standard file" - such as *.tar - whose identity has been lost). There is (probably) no deliberate obfuscation but there may be some amateur efforts to save space. We can assume that we have a general knowledge of what the data is and we may recognize some, but probably not all, of the fields and arrays.

Assume that the majority of the data is numeric, with scalars, and arrays (probably 1- and 2- dimensional and sometimes irregular or triangular). There will also be some character strings, probably names of people, sites, dates and maybe some keywords. There will be code in the program that reads the binary file, but we do not have access to the source or the assembler. As an example it may have been written by a VAX Fortran program or some early Unix or by Windows as OLE objects. The numbers may be big- or little-endian (which is not known at the start) but it's probably consistent. We
may have different versions on different machines (e.g. Cray).

We can assume we have a reasonably large corpus of files - some hundreds, say.

We can assume two scenarios:

We can rerun the program with different inputs so we can do experiments.
We cannot rerun the program - we have a fixed set of documents. This has a gentle similarity to decoding historical documents in an unknown language (e.g. Linear B).

A partial solution may be acceptable - i.e. there may be some fields that no living person now understands, but most of the others are interpretable.

I am only interested in Open Source approaches.

UPDATE There is a related SO question (How to reverse engineer binary file formats for compatibility purposes) but the emphasis is somewhat different.
UPDATE Clever suggestion from @brianegge to address (1). Use truss (or possibly strace on Linux) to dump all write() and similar calls in the program. This should allow at least the collection of records written to disk.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

窝囊感情。 2024-08-19 17:02:11

所有文件都有一个标题。从那里开始，看看两个文件之间有什么相似之处，消除共同的“签名”并处理差异。他们应该标记记录数量、导出日期和类似内容。

两个标头之间的公共部分可能仅被视为通用签名，我想您可以忽略它们

回复收藏 0 原文

怀中猫帐中妖 2024-08-19 17:02:11

如果您使用的系统提供 truss，只需观察你的系统调用来编写，你可能就会有一个好主意。程序也可能会映射文件并直接从内存复制，但这种情况不太常见。

$ truss -t write echo foo
foowrite(1, " f o o", 3)                                = 3
write(1, "\n", 1)                               = 1

查看二进制文件也可能有意义。在 Unix 系统上，您可以使用 objdump 来查看二进制文件的布局。这将指向代码和数据部分。然后，您可以使用十六进制编辑器打开二进制文件并转到特定的偏移量。您可能对我的Solaris 二进制文件提示感兴趣。

If you are on a system which offers truss, simply watch your system calls to write and you'll probably have a good idea. It's also possible that the program is going to mmap a file and copy directly from memory, but that's less common.

$ truss -t write echo foo
foowrite(1, " f o o", 3)                                = 3
write(1, "\n", 1)                               = 1

It also may make sense to take a look at the binary. On Unix systems, you can use objdump to view the layout of the binary. This will point to the code and data sections. You can then open the binary is a hex editor and go to the specific offsets. You may be interested in my tips for Solaris binary files.

回复收藏 0 原文

森末i 2024-08-19 17:02:11

比较 2 个或更多文件以查找相似之处。这通常可以帮助您识别标头块和文件的不同部分。
字节序通常很容易计算出来 - 更重要的字节比不那么重要的字节更容易为零，所以如果你看到像“00 78”或“78 00”这样的模式，你可以制作一个很好地猜测哪个字节是最高有效位。然而，只有当您（大致）计算出前面的数据是什么时，这才有任何帮助，这样您就知道数据是如何对齐的。
寻找容易识别的数据——字符串是第一个开始的地方，因为你可以很容易地发现它们。这些通常会给您提供线索，因为它们通常嵌入在相关数据附近，用作标头中的标准项等。如果字符串是 unicode，那么您通常会看到文本的字母由零字节分隔，这将帮助您识别字节序，以及数据中该点的数据对齐。
常见的格式方法（如 IFF）是存储数据块，每个数据块都有一个小标头（例如 2 或 4 字节 ID，然后是块的 2 或 4 字节大小，然后是块的数据）。一般来说，人们使用（对他们而言）有意义的块 ID，因此可以很容易地发现它们 - 如果您发现看起来像标签的内容，请检查以下数据以查看它是否看起来像一个长度（查看数据中的许多字节）看看是否有另一个标题）。如果您可以识别这样的格式，您就可以将“一个大文件”问题分解为“许多小文件”问题，这使得它变得更加容易。（但是，许多设备数据往往会被“优化”以使其紧凑，在这种情况下，程序员经常丢弃方便的可扩展格式，并将所有内容塞在一起，打包位，通常会让事情变得更加困难）
Look对于已知值。如果您的设备显示“温度：40”，那么您可能会发现该值直接存储在文件中。（使用比例因子或定点值也很常见，因此 40 可以表示为（例如）40*10 = 400 或 40*256 = 10240）
如果您可以足够控制设备：创建一些简单的文件。您想要实现的是可以从设备中获取的最小文件，以最大程度地减少您必须检查的数据。然后在设备上进行导致文件更改的更改 - 尝试最小化更改数量 - 并再次抓取文件。如果文件格式是“开放”（未压缩或加密），那么您应该能够识别已更改的字节。
如果您可以将文件“加载”回设备上，您也可以创建自己的文件，只需更改一个值即可看看您是否可以注意到设备上的行为发生任何变化。如果您设法达到简单的值，这可以很好地工作，但通常您可能会发现您只是破坏了文件格式，并且设备根本无法读取这些数据。

Diff 2 or more files to look for similarities. This often helps you identify header blocks and different sections of the file.
Endianness is usually pretty easy to work out - more-significant bytes tend to be zero a lot more often than less-significant ones, so if you see a pattern like "00 78" or "78 00" you can make a good guess at which byte is the msb. However, this is only of any help when you have worked out (roughly) what the preceeding data is, so that you know how the data is aligned.
Look for easily identified data - strings are the first place to start because you can spot them easily. These often give you clues, as they are usually embedded near related data, used as stanadard items in headers, etc. If the strings are unicode then you will usually see the letters of the text separated by zero bytes, which will help you identify endianness, and data alignment at that point in the data.
A common format approach (like IFF) is to store chunks of data, each with a small header (e.g. a 2 or 4 byte ID, then a 2 or 4 byte size for the block, then the data of the block). In general people use meaningful (to them) chunk IDs, so they can be easy to spot - If you find what looks like a tag, check the following data to see if it looks like a length (look that many bytes on in the data to see if it looks like there is another header). If you can identify such a format, you break the "one large file" problem down into a "many small files" problem whichmakes it much easier. (However, a lot of device data tends to be "optimised" to make it compact, in which case programmers often throw away convenient extensible formats and cram everything together, packing bits and generally making things much more difficult for you)
Look for known values. If your device is displaying "temperature: 40" then it's possible that you will find that value directly stored in the file. (It's also common to use scaling factors or fixed-point values, so 40 may be represented as (e.g.) 40*10 = 400 or 40*256 = 10240 though)
If you can control the device enough: create some simple files. What you're trying to achieve is the smallest files you can get out of the device to minimise the data you have to examine. Then make a change on the device that causes the file to change - try to minimise the number of changes - and grab the file again. If the file format is "open" (not compressed or encrypted) then you should be able to identify the bytes that have changed.
If you can "load" files back onto the device you may also be able to create your own files, just changing one value to see if you can notice any change of behaviour on the device. If you manage to hit simple values this can work well, but often you may find you just break the file format and the device won't be able to read ther data at all.

回复收藏 0 原文

﹂绝世的画 2024-08-19 17:02:11

这是一个有趣的问题，我认为答案是逆向工程二进制格式是一项必需的技能，但有一些工具可以提供帮助。

一种工具是 WinOLS，它的设计目的是用于解释和编辑车辆发动机管理计算机二进制图像（主要是查找表中的数字数据）。它支持各种字节序格式（我认为不是 PDP），并以各种宽度和偏移量查看数据、定义数组区域（映射）并使用各种缩放和偏移选项以 2D 或 3D 方式将它们可视化。它还具有启发式/统计自动地图查找器，可能适合您。

它是一个商业工具，但免费演示可以让您执行所有操作，但保存对二进制文件的更改并使用您不需要的引擎管理功能。你说你只对开源解决方案感兴趣，但这是 Stackoverflow，其他人可能不会那么挑剔。

回复收藏 0 原文