Java - 计算文件压缩

发布于 2024-12-25 08:59:05 字数 188 浏览 1 评论 0原文

有没有办法获得仅读取文件的可能压缩率?
您知道,有些文件比其他文件更容易压缩...我的软件必须告诉我文件可能压缩的百分比。

例如
压缩比:50% ->如果压缩,我可以节省 50% 的文件空间
压缩率:99% ->如果压缩,我只能节省 1% 的文件空间

Is there a way to get the possible compression ratio of a file just reading it?
You know, some files are more compressible then others... my software has to tell me the percentage of possible compression of my files.

e.g.
Compression Ratio: 50% -> I can save 50% of my file's space if I compress it
Compression Ratio: 99% -> I can save only 1% of my file's space if I compress it

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

明月松间行 2025-01-01 08:59:05

首先,这很大程度上取决于您选择的压缩方法。其次,我严重怀疑如果不计算与实际进行压缩相当的时间和空间复杂性,这是可能的。我想说你最好的选择是压缩文件,跟踪你已经生成的文件的大小并删除/释放它(显然,一旦你完成了它)而不是写出来。

要真正做到这一点,除非您真的想自己实现它,否则使用 java.util.zip 类,特别是 Deflater 类及其 deflate 方法。

First, this will depend largely on the compression method you choose. And second, I seriously doubt it's possible without computation of time and space complexity comparable to actually doing the compression. I'd say your best bet is to compress the file, keeping track of the size of what you've already produced and dropping/freeing it (once you're done with it, obviously) instead of writing it out.

To actually do this, unless you really want to implement it yourself, it'll probably be easiest to use the java.util.zip class, in particular the Deflater class and its deflate method.

秋风の叶未落 2025-01-01 08:59:05

首先,您需要研究信息论。关于信息论领域有两种理论:

  1. 根据香农的说法,可以通过使用源的符号概率来计算源的熵(即压缩大小)。因此,由统计模型定义的最小压缩大小,该统计模型在每一步都会产生符号概率。所有算法都隐式或显式地使用该方法来压缩数据。请参阅维基百科文章了解更多详细信息。
  2. 根据柯尔莫哥洛夫的说法,可以通过找到生成源代码的最小可能程序来找到最小的压缩大小。从这个意义上说,它是不可计算的。有些程序部分使用这种方法来压缩数据(例如,您可以编写一个小型控制台应用程序,它可以生成 100 万位 PI,而不是压缩这 100 万位 PI)。

因此,如果不评估实际压缩,您就无法找到压缩大小。但是,如果您需要近似值,您可以依靠香农的熵理论并构建一个简单的统计模型。这是一个非常简单的解决方案:

  1. 计算源文件中每个符号的 1 阶统计量。
  2. 使用这些统计数据计算熵。

您的估计将或多或少与 ZIP 的默认压缩算法(deflate)相同。 这里是相同想法的更高级版本(请注意它使用大量内存!)。它实际上使用熵来确定块边界,以应用分段将文件划分为同质数据。

Firstly, you need to work on information theory. There are two theory about information theory field:

  1. According to Shannon, one can compute entropy (i.e. compressed size) of a source by using it's symbol probabilities. So, smallest compression size defined by an statistical model which produces symbol probabilities at each step. All algorithms use that approach implicitly or explicitly to compress data. Look that Wikipedia article for more details.
  2. According to Kolmogorov, smallest compression size can be found by finding smallest possible program which produces the source. In that sense, it cannot be compute-able. Some program partially use that approach to compress data (e.g. you can write a small console application which can produce 1 million digits of PI instead of zipping that 1 million digits of PI).

So, you can't find compressed size without evaluating actual compression. But, if you need an approximation, you can rely on Shannon's entropy theory and build a simple statistical model. Here is a very simple solution:

  1. Compute order-1 statistics for each symbol in the source file.
  2. Calculate entropy by using those statistics.

Your estimation will be more or less same as ZIP's default compression algorithm (deflate). Here is a more advanced version of same idea (be aware it uses lots of memory!). It actually uses entropy to determine blocks boundaries to apply segmentation for dividing file into homogeneous data.

锦爱 2025-01-01 08:59:05

不检查文件就不可能。您唯一能做的就是根据通过实际压缩和测量从相对较大的样本中收集的统计数据,按文件扩展名获得近似比率。例如,统计分析可能会显示 .zip、.jpg 不可高度压缩,但 .txt 和 .doc 等文件可能高度可压缩。

其结果仅供粗略指导,在某些情况下可能会偏离目标,因为绝对不能保证文件扩展名的可压缩性。该文件可以包含任何内容,无论扩展名是什么,也可能不是。

更新:假设您可以检查该文件,那么您可以使用 java.util.zip API 读取原始文件并对其进行压缩,然后查看之前/之后的差异。

Not possible without examining the file. The only thing you can do is have an approximate ratio by file extension based on statistics gathered from a relative large sample by doing actual compression and measuring. For example a statistical analysis will likely show that .zip, .jpg are not heavily compressible, but files like .txt and .doc might be heavily compressible.

The results of this will be for rough guidance only and will probably be way off in some cases as there's absolutely no guarantee of compressible-ness by file extension. The file could contain anything no matter what the extension say it may or may not be.

UPDATE: Assuming you can examine the file then you can use the java.util.zip APIs to read the raw file and compress it and see what the before/after difference is.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文