如何编辑文件以更改 md5 哈希而不损坏?

发布于 2024-12-27 15:19:51 字数 1005 浏览 0 评论 0原文

我需要复制各种文件类型,对它们进行一些更改,以便原始的 md5 哈希值与修改后的文件类型不匹配,但保持它们可读且不损坏。

TXT 文件——这是显而易见的。我只是在文件末尾添加一个随机字符串。

PDF 文件 - 好吧,我开始寻找一个 java 库来编辑 pdf 文件,但后来我不小心尝试在记事本++中打开一个 pdf 文件,并想 - 为什么我不尝试在不可读的末尾添加一个随机字符串我在那里看到的内容。嗯,令我惊讶的是它有效并且文件没有损坏。

ZIP 文件 - 我尝试了与 pdf 相同的方法,它也有效。

DOCX-同样的方法在这里停止工作。在文本编辑器中打开的 docx 文件的二进制内容末尾添加一个空格(“”)会损坏该文件。

所以我需要的是:

  1. 用于修改office文档的java库:doc,docx,xl​​s,xlsx,ppt,pptx。

  2. 仍然有一些文件类型需要更改 md5 哈希输出,但我认为它们在 java 中不可修改 - 例如,可执行文件等媒体文件。 那么,我怎样才能对这些文件执行我想要的操作呢?有没有一种方法可以“触摸”文件,更改标头或其他内容并使其与未更改的文件不同?

编辑: 好的,这就是动机 - 我想生成大量数据,正如我在这里所问的: 如何产生大量数据?

在提出这个问题时,我得到的答案已经足够了,但也不是没有。

  1. 我需要数据不相同。文件对必须无法通过 md5 哈希测试。

  2. 我不能只生成随机字符串,因为我需要模拟真实文件和文档网。

  3. 我无法使用现有的数据转储,因为我需要包含各种文件类型的各种大小的数据集。我需要一些东西作为输入的大小,它会为我生成数据。

所以我认为我应该使用我最终需要的所有文件类型的起始数据集,然后复制该数据集。

I need to duplicate various kinds of file types, change them a bit so that the original's md5 hash won't match the modified one, but keep them readable and not corrupted.

TXT files - that's obvious. I just add a random string to the end of the file.

PDF file - well I started looking for a java library to edit pdf files, but then I accidentally tried to open a pdf file in notepad++, and thought - why don't I try to add a random string to the end of the not readable content that I see there. Well, to my surprise it worked and the file wasn't corrupted.

ZIP file - I've tried the same that I did with pdf, and it also worked.

DOCX- the same method stopped working here. Appending just a space (" ") at the end of the binary content of a docx file that I open in a text editor, corrupts the file.

So what I need is:

  1. java libraries for modifying office documents :doc, docx, xls, xlsx, ppt, pptx.

  2. There are still file types that I need to change there md5 hash output, but I don't think they are modifiable in java - media files for example, executables and etc..
    So, nevertheless, how can i perform what I want on these files? Is there a way to just "touch" the file, change a header or something and make it nonidentical to an untouched one?

edit:
Ok, here's the motivation - I want to generate massive amount of data as I asked here: How to produce massive amount of data?

At the time of that question, the answers I got there were enough, but not they dont.

  1. I need the data to be nonidentical. Pairs of files must fail md5 hash test.

  2. i can't just generate random strings, because I need to simulate real files and documnets.

  3. I can't use existing data dumps, because I need various sizes of these data sets that include various file types. I need something that I'll give as an input the size, and it will generate the data for me.

So I figured that I should use a starting data set of all the file types that I eventually need, and just duplicate this data set.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

白云不回头 2025-01-03 15:19:51
  1. 用于修改 Office 文档的 Java 库:doc、docx、xls、xlsx、ppt、pptx。

Apache POI 用于修改 MS Office 文件。请注意,较新的格式(xlsxdocx 等)只是包含 XML 的 ZIP 文件。解压缩它们并修改纯文本 XML可能也可以。

同样的建议也适用于 ZIP 文件:尝试解压缩并修改最简单文件。

但你实际上想要实现什么目标?请注意,在文件末尾随机附加一些字符串只是偶然的。在其他计算机或其他版本的软件上,该文件可能被视为已损坏...

我建议您将一些元数据存储在文件外部,而不是比较 MD5 或更深入地研究文件格式。文件中几乎总是隐藏着标头和各种元数据(MP3 中的 ID3 标签、图像中的 EXIF 等),修改它会更安全。

还要查找保留/未使用的字节 - 这很常见。但再说一遍 - 为什么?你一开始就这样做吗?

  1. java libraries for modifying office documents :doc, docx, xls, xlsx, ppt, pptx.

Apache POI is used to modify MS Office files. Note that newer formats (xlsx, docx, etc.) are simply ZIP files containing XML. Unzipping them and modifying plain text XML might work as well.

The same advice goes to ZIP files: try unzipping and modifying the easiest file.

But what are you actually trying to achieve? Note that randomly attaching some string at the end of the file works only by chance. On other computer or other version of software the file might be considered as corrupted...

I would advice you to either store some metadata external to the file rather than comparing MD5 or look deeper into file formats. There are almost always headers and various pieces of metadata hidden in the file (ID3 tags in MP3, EXIF in images, etc.) It is much safer to modify it instead.

Also look for reserved/not used bytes - it is quite often. But again - why? are you doing it on the first place?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文