恢复损坏的 zip 或 gzip 文件?

发布于 2024-07-05 05:56:51 字数 320 浏览 9 评论 0原文

损坏压缩文件的最常见方法是无意中进行 ASCII 模式 FTP 传输,这会导致 CR 和/或 LF 字符的多对一废弃。

显然,存在信息丢失的情况,解决这个问题的最好方法就是重新传输,以FTP二进制模式。

但是,如果原始数据丢失,而且这很重要,那么数据如何恢复?

[实际上,我已经知道我认为最好的答案是什么(这非常困难,但有时是可能的 - 我稍后会发布更多),以及常见的非答案(许多现成的程序用于修复 CRC 而不修复数据) ),但我认为在 stackoverflow beta 期间尝试这个问题会很有趣,看看是否有其他人走上了成功恢复的道路或发现了我不知道的工具。]

The most common method for corrupting compressed files is to inadvertently do an ASCII-mode FTP transfer, which causes a many-to-one trashing of CR and/or LF characters.

Obviously, there is information loss, and the best way to fix this problem is to transfer again, in FTP binary mode.

However, if the original is lost, and it's important, how recoverable is the data?

[Actually, I already know what I think is the best answer (it's very difficult but sometimes possible - I'll post more later), and the common non-answers (lots of off-the-shelf programs for repairing CRCs without repairing data), but I thought it would be interesting to try out this question during the stackoverflow beta period, and see if anyone else has gone down the successful-recovery path or discovered tools I don't know about.]

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

演多会厌 2024-07-12 05:56:52

您可以尝试编写一个小脚本,将所有 CR 替换为 CRLF(假设垃圾的方向是 CRLF 到 CR),每个块随机交换它们,直到获得正确的 crc。 假设数据不是特别大,我猜想在宇宙热寂完成之前可能不会使用所有的 CPU。

由于肯定存在信息丢失,我不知道有更好的方法。 CR 到 CRLF 方向的丢失可能会更容易回滚。

You could try writing a little script to replace all of the CRs with CRLFs (assuming the direction of trashing was CRLF to CR), swapping them randomly per block until you had the correct crc. Assuming that the data wasn't particularly large, I guess that might not use all of your CPU until the heat death of the universe to complete.

As there is definite information loss, I don't know that there is a better way. Loss in the CR to CRLF direction might be slightly easier to roll back.

甜警司 2024-07-12 05:56:51

来自Bukys软件

已知大约 256 字节中有 1 个
会腐败,而腐败就是
已知仅以字节形式出现
值“\012”。 所以字节错误率
为 1/256(输入的 0.39%)和 2/256
字节(输入的 0.78%)是可疑的。
但由于每次只粉碎了三位
字节受到影响,误码率
只有 3/(256*8):0.15% 不好,0.29%
值得怀疑。

...

压缩输入错误
扰乱解压过程
所有后续字节...事实是
解压后的输出是
这么快就坏了是原因
为了希望——寻找正确的东西
答案可以识别错误答案
快点。

最终,我们采用了多种技术
结合成功提取
这些文件中的合理数据:

  • 特定于域的字段和带引号的字符串的解析
  • 机器从先前的数据中学习,损坏的可能性很低
  • 对由于其他原因造成的文件损坏的容忍度(例如磁盘已满而
    记录)
  • 用于引导沿着最高概率路径进行搜索的前瞻

这些技术可识别 75%
确定必要的维修,以及
其余部分正在探索中
最高概率优先,因此
合理的重建是
立即识别。

From Bukys Software

Approximately 1 in 256 bytes is known
to be corrupted, and the corruption is
known to occur only in bytes with the
value '\012'. So the byte error rate
is 1/256 (0.39% of input), and 2/256
bytes (0.78% of input) are suspect.
But since only three bits per smashed
byte are affected, the bit error rate
is only 3/(256*8): 0.15% is bad, 0.29%
is suspect.

...

An error in the compressed input
disrupts the decompression process for
all subsequent bytes...The fact that
the decompressed output is
recognizably bad so quickly is cause
for hope -- a search for the correct
answer can identify wrong answers
quickly.

Ultimately, several techniques were
combined to successfully extract
reasonable data from these files:

  • Domain-specific parsing of fields and quoted strings
  • Machine learning from previous data with low probability of damage
  • Tolerance for file damage due to other causes (e.g. disk full while
    logging)
  • Lookahead for guiding the search along the highest-probability paths

These techniques identify 75% of the
necessary repairs with certainty, and
the remainder are explored
highest-probability-first, so that
plausible reconstructions are
identified immediately.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文