有没有办法预先确定文件是否适合压缩?
我正在规划一个 .NET 项目,该项目涉及自动上传各种类型的文件,从各种分布式客户端到服务器群,有时文件扩展名可能与实际文件类型不匹配(长话短说)。
使用 HTTP 压缩并不总是一种选择,在本项目案例中,最好花费比带宽或服务器存储更多的客户端处理。但如果我们能够确定压缩是否会给出可行的结果,那么如果我们能够跳过压缩过程,那就更好了。
我知道没有“正确答案”,但我们将不胜感激任何想法。
I'm planning a .NET project that involves automated upload of files from the most diverse types, from various distributed clients to a constellation of servers, and sometimes the file extension may not match the real file type (long story).
Using HTTP compression will not always be an option, and in this project case, is preferrable to spend more client processing than bandwidth or server storage. But it would be really better if we could skip the compression process if we could determine if the compression would give feasible results.
I know that there is no "right answer", but we would appreciate any ideas.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
按文件类型过滤是个好主意。即使某些文件的扩展名错误,总体而言这应该是一个不错的选择。
例如,文本文件的压缩效果非常好。压缩 mp3、jpg/gif 或 divx 文件几乎没有用处。
Filtering by File types is a good idea. Even if some files will have the wrong extensions, overall it should be a good bet.
Text files for example compress extremely well. While compressing mp3's, jpg's/gifs or divx files has little use.
鉴于您所说的扩展名,我可以看到几种方法:
首先:您可以在不使用扩展名的情况下确定文件的类型吗?许多文件类型都有标准标头,因此您可以解析标头并确定这是否是您已实施过滤器的十几种常见文件类型之一。
第二:一个更简单的尝试是从文件中间抓取 100 个字节,看看这是否是标准的 ascii,例如每个字节的值在 9 到 126 之间。这在给定的时间内是错误的,不会工作多种语言的文本,不适用于 unicode 文本。
Given what you say about extensions I can see a couple of ways
First: Can you determine the type of the file with out using the extension? lots of file types have standard headers so you could parse the headers and determine is this is one of the dozen of so of common file type you have implemented filters for.
Second: A simpler hurestic would be to grab say 100 bytes from the middle of the file and see if this is standard ascii e.g. each byte has a value between 9 and 126. This will be wrong a given percent of time, will not work on text in a lot of languages and will not work on unicode text.
之前你的意思是在你实际压缩或发送之前?您可能会保留一些数据并据此做出决定;将文件类型、扩展名和大小映射到压缩时间和最终大小,看看您是否可以了解哪些方法有效
By previously do you mean before you actually compress or send? You might keep some data and base your decision on that; map file types, extensions and sizes to compression time and final size, and see if you can learn what works
您可以尝试使用非常快的压缩器来压缩文件。如果压缩器不能充分压缩它,那么尝试更好地重新压缩它是没有用的。是的,这是一个愚蠢的想法,但从技术上讲,.zip 文件可以包含使用“存储”格式的 txt 文件(因此无需压缩),并且 .zip 具有高度可压缩性,因此没有灵丹妙药。
(从技术上讲,您可以测量文件的熵,但按照此处的建议 如何计算文件的熵?,gzip它来测试它:-))
You could try compressing the file with a very fast compressor. If the compressor can't compress it enough, then it is useless to try to recompress it better. Yes, this is a stupid idea, but technically a .zip file could contain a txt file using the "stored" format (so no compression), and that .zip would be highly compressable, so there isn't a magic bullet.
(technically you could measure the entropy of the file, but then as suggested here How to calculate the entropy of a file? , gzip it to test it :-) )
您可以通过进行字节频率分析来获取指针,也许还可以使用 MTF 步骤将局部重复转换为更可测量的内容。成本便宜,对文件进行线性扫描。
You could get a pointer by doing a byte frequency analysis, perhaps also with a MTF step to transform local repetition into something more measurable. The cost is cheap, a linear scan of the file.
您可以在发送之前尝试在内部压缩每个文件的前几 KB,并查看它压缩到多少字节。如果结果看起来足够好,请在发送之前压缩整个内容。
使用这种方法时应该注意的一件事是,许多文件格式的第一个“几个”KB 可能是类似标头的数据,不代表文件的其余部分。因此,您可能想要增加样本大小,从文件的其他部分获取样本,从文件的不同部分获取多个子样本来形成样本,等等。
You can try compressing the first several KB of each file internally before sending it, and see how many bytes it compresses down to. If the result looks good enough, compress the whole thing before sending it.
One thing you should be careful about with this approach is that many file formats might have their first "few" KB be header-like data not representative of the rest of the file. So you might want to increase the sample size, take the sample from another part of the file, take multiple sub-samples from different parts of the file to form your sample, etc.