专门针对 HTML 内容优化的压缩算法?
是否有任何压缩算法(有损或无损)专门用于处理现实世界(混乱和无效)的 HTML 内容?
如果不是,我们可以利用 HTML 的哪些特性来创建这样的算法?潜在的性能提升是什么?
另外,我不是问提供此类内容(通过 Apache 或任何其他服务器)的问题,尽管这确实很有趣,而是存储和分析它。
更新:我指的不是 GZIP——这是显而易见的——而是一种专门为利用 HTML 内容的特性而设计的算法。例如,可预测的标签和树结构。
Are there any compression algorithms -- lossy or lossless -- that have been specifically adapted to deal with real-world (messy and invalid) HTML content?
If not, what characteristics of HTML could we take advantage of to create such an algorithm? What are the potential performance gains?
Also, I'm not asking the question to serve such content (via Apache or any other server), though that's certainly interesting, but to store and analyze it.
Update: I don't mean GZIP -- that's obvious -- but rather an algorithm specifically designed to take advantage of characteristics of HTML content. For example, the predictible tag and tree structure.
发布评论
评论(11)
Brotli 是一种专门的 HTML/英语压缩算法。
来源:https://en.wikipedia.org/wiki/Brotli
Brotli is a specialized HTML/english compression algorithm.
Source: https://en.wikipedia.org/wiki/Brotli
我不知道有什么“现成的”压缩库显式针对 HTML 内容进行了优化。
然而,HTML 文本应该可以通过通用算法很好地压缩(请阅读此答案的底部以获得更好的算法)。通常,由于特定语言习语的高度重复性,Lempel-Ziv 的所有变体在类 HTML 语言上都表现良好; GZip,经常被引用使用这种基于 LZ 的算法(LZ77 , 我认为)。
改进这些通用算法的一个想法可能是使用最常见的 html 标签和模式来填充 LZ 型循环缓冲区。以这种方式,我们可以通过使用来自此类模式的第一个实例的引用来减少压缩大小。这种增益对于较小的 html 文档尤其敏感。
一个补充的、类似的想法是让压缩和解压缩方法隐含(即不发送)LZ-x 算法的其他压缩算法的信息(例如 LZH 等情况下的霍夫曼树),并具有特定的统计数据对于典型的 HTML,请小心地从字符计数中排除由引用编码的字符的[统计加权]实例。这种经过过滤的字符分布可能会比完整的 HTML 文本更接近纯英语(或目标网站的国家语言)。
与上述[受过教育的,我希望]猜测无关,我开始在网上搜索有关此主题的信息。
'发现这个2008弗罗茨瓦夫大学 Przemysław Skibiński 的学术论文(pdf 格式)。该论文的摘要表明在压缩速度相当的情况下比 GZIP 提高了 15%。
否则我可能会找错地方。似乎对此没有太大兴趣。可能只是相对于普通或适度调整的通用算法而言,额外的增益被认为不足以引起如此大的兴趣,即使是在支持网络的手机的早期(当带宽非常昂贵时...... .)。
I do not know of "off-the-shelf" compression library explicitly optimized for HTML content.
Yet, HTML text should compress quite nicely with generic algorithms (do read the bottom of this answer for better algorithms). Typically all variations on Lempel–Ziv perform well on HTML-like languages, owing to the highly repeatitive of specific language idioms; GZip, often cited uses such a LZ-based algoritm (LZ77, I think).
An idea to maybe improve upon these generic algorithms would be to prime a LZ-type circular buffer with the most common html tags and patterns at large. In this fashion, we'd reduce the compressed size by using citations from the very first instance of such a pattern. This gain would be particularly sensitive on smaller html documents.
A complementary, similar, idea, is to have the compression and decompression methods imply (i.e. not send) the info for other compression's algorithm of an LZ-x algorithm (say the Huffman tree in the case of LZH etc.), with statistics specific to typical HTML being careful to exclude from characters count the [statistically weighted] instances of character encoded by citation. Such a filtered character distribution would probably become closer to that of plain English (or targeted web sites' national languge) than the complete HTML text.
Unrelated to the above [educated, I hope] guesses, I started searching the web for information on this topic.
' found this 2008 scholarly paper (pdf format) by Przemysław Skibiński of University of Wrocław. The paper's abstract indicates a 15% improvement over GZIP, with comparable compression speed.
I may be otherwise looking in the wrong places. There doesn't seem to be much interest for this. It could just be that the additional gain, relative to a plain or moderately tuned generic algorithm wasn't deemed sufficient enough to warrant such interest, even in the early days of Web-enabled cell phones (when bandwidth was at quite a premium...).
在 HTML 内容中,无论是否混乱,我愿意处理的唯一“有损”就是空白扁平化。这是大容量网站对其内容执行的典型发布后步骤,也称为扁平化。
您还可以使用 YUI 压缩器压平大型 Javascript 库,它将所有 Javascript 变量重命名为短名称、删除空格等。这对于使用 ExtJS、Dojo 等套件的大型应用程序非常重要。
About the only "lossy" I am willing to deal with in HTML content, messy or not, is whitespace flattening. This is a typical post-publish step that high volume sites perform on their content, also called flattening.
You can also flatten large Javascript libs using the YUI compressor, which renames all Javascript vars to short names, removes whitespace, etc. It is very important for large apps using kits like ExtJS, Dojo, etc.
gzip
压缩不足以满足您的需求吗?< /strong> 它为您提供大约 10:1 的压缩比,不仅适用于 HTML 内容,还适用于 JavaScript、CSS 等文件,并且可以在大多数服务器或反向代理上轻松使用(例如 Apache 的mod_deflate
, Nginx 的
NginxHttpGzipModule
等)和所有现代浏览器(您可以指示 Apache 和 Nginx 跳过基于User-Agent
的特定浏览器的压缩。)您会惊讶地发现
gzip 是如此接近
压缩达到最佳状态。 有些人建议缩小您的文件;但是,除非您的文件包含大量注释(压缩器可以完全丢弃这些注释,即您可能称之为“有损”的内容),但您可能无论如何都不想对 HTML 执行此操作,除非您确定您的或