Java - 计算文件压缩
有没有办法获得仅读取文件的可能压缩率?
您知道,有些文件比其他文件更容易压缩...我的软件必须告诉我文件可能压缩的百分比。
例如压缩比:50%
->如果压缩,我可以节省 50% 的文件空间压缩率:99%
->如果压缩,我只能节省 1% 的文件空间
Is there a way to get the possible compression ratio of a file just reading it?
You know, some files are more compressible then others... my software has to tell me the percentage of possible compression of my files.
e.g.Compression Ratio: 50%
-> I can save 50% of my file's space if I compress itCompression Ratio: 99%
-> I can save only 1% of my file's space if I compress it
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
首先,这很大程度上取决于您选择的压缩方法。其次,我严重怀疑如果不计算与实际进行压缩相当的时间和空间复杂性,这是可能的。我想说你最好的选择是压缩文件,跟踪你已经生成的文件的大小并删除/释放它(显然,一旦你完成了它)而不是写出来。
要真正做到这一点,除非您真的想自己实现它,否则使用 java.util.zip 类,特别是
Deflater
类及其deflate
方法。First, this will depend largely on the compression method you choose. And second, I seriously doubt it's possible without computation of time and space complexity comparable to actually doing the compression. I'd say your best bet is to compress the file, keeping track of the size of what you've already produced and dropping/freeing it (once you're done with it, obviously) instead of writing it out.
To actually do this, unless you really want to implement it yourself, it'll probably be easiest to use the java.util.zip class, in particular the
Deflater
class and itsdeflate
method.首先,您需要研究信息论。关于信息论领域有两种理论:
因此,如果不评估实际压缩,您就无法找到压缩大小。但是,如果您需要近似值,您可以依靠香农的熵理论并构建一个简单的统计模型。这是一个非常简单的解决方案:
您的估计将或多或少与 ZIP 的默认压缩算法(deflate)相同。 这里是相同想法的更高级版本(请注意它使用大量内存!)。它实际上使用熵来确定块边界,以应用分段将文件划分为同质数据。
Firstly, you need to work on information theory. There are two theory about information theory field:
So, you can't find compressed size without evaluating actual compression. But, if you need an approximation, you can rely on Shannon's entropy theory and build a simple statistical model. Here is a very simple solution:
Your estimation will be more or less same as ZIP's default compression algorithm (deflate). Here is a more advanced version of same idea (be aware it uses lots of memory!). It actually uses entropy to determine blocks boundaries to apply segmentation for dividing file into homogeneous data.
不检查文件就不可能。您唯一能做的就是根据通过实际压缩和测量从相对较大的样本中收集的统计数据,按文件扩展名获得近似比率。例如,统计分析可能会显示 .zip、.jpg 不可高度压缩,但 .txt 和 .doc 等文件可能高度可压缩。
其结果仅供粗略指导,在某些情况下可能会偏离目标,因为绝对不能保证文件扩展名的可压缩性。该文件可以包含任何内容,无论扩展名是什么,也可能不是。
更新:假设您可以检查该文件,那么您可以使用 java.util.zip API 读取原始文件并对其进行压缩,然后查看之前/之后的差异。
Not possible without examining the file. The only thing you can do is have an approximate ratio by file extension based on statistics gathered from a relative large sample by doing actual compression and measuring. For example a statistical analysis will likely show that .zip, .jpg are not heavily compressible, but files like .txt and .doc might be heavily compressible.
The results of this will be for rough guidance only and will probably be way off in some cases as there's absolutely no guarantee of compressible-ness by file extension. The file could contain anything no matter what the extension say it may or may not be.
UPDATE: Assuming you can examine the file then you can use the
java.util.zip
APIs to read the raw file and compress it and see what the before/after difference is.