专门针对 HTML 内容优化的压缩算法?

发布于 2024-08-24 10:28:49 字数 239 浏览 3 评论 0原文

是否有任何压缩算法(有损或无损)专门用于处理现实世界(混乱和无效)的 HTML 内容?

如果不是,我们可以利用 HTML 的哪些特性来创建这样的算法?潜在的性能提升是什么?

另外,我不是问提供此类内容(通过 Apache 或任何其他服务器)的问题,尽管这确实很有趣,而是存储和分析它。

更新:我指的不是 GZIP——这是显而易见的——而是一种专门为利用 HTML 内容的特性而设计的算法。例如,可预测的标签和树结构。

Are there any compression algorithms -- lossy or lossless -- that have been specifically adapted to deal with real-world (messy and invalid) HTML content?

If not, what characteristics of HTML could we take advantage of to create such an algorithm? What are the potential performance gains?

Also, I'm not asking the question to serve such content (via Apache or any other server), though that's certainly interesting, but to store and analyze it.

Update: I don't mean GZIP -- that's obvious -- but rather an algorithm specifically designed to take advantage of characteristics of HTML content. For example, the predictible tag and tree structure.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(11

灼痛 2024-08-31 10:28:49

Brotli 是一种专门的 HTML/英语压缩算法。

来源:https://en.wikipedia.org/wiki/Brotli

与大多数通用压缩算法不同,Brotli 使用
预定义的 120 KB 字典。该词典包含超过
源自大型数据库的 13000 个常用单词、短语和其他子字符串
文本和 HTML 文档的语料库。[6][7]预定义的字典可以
提高短数据文件的压缩密度。

Brotli is a specialized HTML/english compression algorithm.

Source: https://en.wikipedia.org/wiki/Brotli

Unlike most general purpose compression algorithms, Brotli uses a
pre-defined 120 kilobyte dictionary. The dictionary contains over
13000 common words, phrases and other substrings derived from a large
corpus of text and HTML documents.[6][7] A pre-defined dictionary can
give a compression density boost for short data files.

辞别 2024-08-31 10:28:49

我不知道有什么“现成的”压缩库显式针对 HTML 内容进行了优化

然而,HTML 文本应该可以通过通用算法很好地压缩(请阅读此答案的底部以获得更好的算法)。通常,由于特定语言习语的高度重复性,Lempel-Ziv 的所有变体在类 HTML 语言上都表现良好; GZip,经常被引用使用这种基于 LZ 的算法(LZ77 , 我认为)。

改进这些通用算法的一个想法可能是使用最常见的 html 标签和模式来填充 LZ 型循环缓冲区。以这种方式,我们可以通过使用来自此类模式的第一个实例的引用来减少压缩大小。这种增益对于较小的 html 文档尤其敏感。

一个补充的、类似的想法是让压缩和解压缩方法隐含(即不发送)LZ-x 算法的其他压缩算法的信息(例如 LZH 等情况下的霍夫曼树),并具有特定的统计数据对于典型的 HTML,请小心地从字符计数中排除由引用编码的字符的[统计加权]实例。这种经过过滤的字符分布可能会比完整的 HTML 文本更接近纯英语(或目标网站的国家语言)。


与上述[受过教育的,我希望]猜测无关,我开始在网上搜索有关此主题的信息。

'发现这个2008弗罗茨瓦夫大学 Przemysław Skibiński 的学术论文(pdf 格式)。该论文的摘要表明在压缩速度相当的情况下比 GZIP 提高了 15%

否则我可能会找错地方。似乎对此没有太大兴趣。可能只是相对于普通或适度调整的通用算法而言,额外的增益被认为不足以引起如此大的兴趣,即使是在支持网络的手机的早期(当带宽非常昂贵时...... .)。

I do not know of "off-the-shelf" compression library explicitly optimized for HTML content.

Yet, HTML text should compress quite nicely with generic algorithms (do read the bottom of this answer for better algorithms). Typically all variations on Lempel–Ziv perform well on HTML-like languages, owing to the highly repeatitive of specific language idioms; GZip, often cited uses such a LZ-based algoritm (LZ77, I think).

An idea to maybe improve upon these generic algorithms would be to prime a LZ-type circular buffer with the most common html tags and patterns at large. In this fashion, we'd reduce the compressed size by using citations from the very first instance of such a pattern. This gain would be particularly sensitive on smaller html documents.

A complementary, similar, idea, is to have the compression and decompression methods imply (i.e. not send) the info for other compression's algorithm of an LZ-x algorithm (say the Huffman tree in the case of LZH etc.), with statistics specific to typical HTML being careful to exclude from characters count the [statistically weighted] instances of character encoded by citation. Such a filtered character distribution would probably become closer to that of plain English (or targeted web sites' national languge) than the complete HTML text.


Unrelated to the above [educated, I hope] guesses, I started searching the web for information on this topic.

' found this 2008 scholarly paper (pdf format) by Przemysław Skibiński of University of Wrocław. The paper's abstract indicates a 15% improvement over GZIP, with comparable compression speed.

I may be otherwise looking in the wrong places. There doesn't seem to be much interest for this. It could just be that the additional gain, relative to a plain or moderately tuned generic algorithm wasn't deemed sufficient enough to warrant such interest, even in the early days of Web-enabled cell phones (when bandwidth was at quite a premium...).

蓝戈者 2024-08-31 10:28:49

在 HTML 内容中,无论是否混乱,我愿意处理的唯一“有损”就是空白扁平化。这是大容量网站对其内容执行的典型发布后步骤,也称为扁平化。

您还可以使用 YUI 压缩器压平大型 Javascript 库,它将所有 Javascript 变量重命名为短名称、删除空格等。这对于使用 ExtJS、Dojo 等套件的大型应用程序非常重要。

About the only "lossy" I am willing to deal with in HTML content, messy or not, is whitespace flattening. This is a typical post-publish step that high volume sites perform on their content, also called flattening.

You can also flatten large Javascript libs using the YUI compressor, which renames all Javascript vars to short names, removes whitespace, etc. It is very important for large apps using kits like ExtJS, Dojo, etc.

给不了的爱 2024-08-31 10:28:49

gzip压缩不足以满足您的需求吗?< /strong> 它为您提供大约 10:1 的压缩比,不仅适用于 HTML 内容,还适用于 JavaScript、CSS 等文件,并且可以在大多数服务器或反向代理上轻松使用(例如 Apache 的 mod_deflate, Nginx 的 NginxHttpGzipModule 等)和所有现代浏览器(您可以指示 Apache 和 Nginx 跳过基于 User-Agent 的特定浏览器的压缩。)

您会惊讶地发现 gzip 是如此接近 压缩达到最佳状态。 有些人建议缩小您的文件;但是,除非您的文件包含大量注释(压缩器可以完全丢弃这些注释,即您可能称之为“有损”的内容),但您可能无论如何都不想对 HTML 执行此操作,除非您确定您的

Is gzip compression not sufficient for your needs? It gives you about 10:1 compression ratio, not only with HTML contents but also with JavaScript, CSS etc. files, and is readily available on most servers or reverse proxies (e.g. Apache's mod_deflate, Nginx's NginxHttpGzipModule etc.) and all modern browsers (you can instruct both Apache and Nginx to skip compression for specific browsers based on User-Agent.)

You'll be surprised how close gzip compression comes to optimal. Some people have suggested minifying your files; however, unless your files contain lots of comments (which the minifier can discard completely, i.e. what you probably referred to as "lossy" -- but something you probably don't want to do with HTML anyway, not unless you're sure that none of your <script> or <style> tags are inside HTML comments <!-- --> to accommodate antediluvian browsers), remember that minifying achieves most of its gains from a technique similar to (yet more limited than) DEFLATE -- so expect a minified file to be larger or much larger than a gzipped original (particularly true with HTML, in which you are stuck with W3C's tags and attributes, and only gzip can help you there), and that gzipping a minified file will give you minimal gain over gziping the original file (again, unless the original file contained lots of comments which can be safely discarded by a minifier.)

猫烠⑼条掵仅有一顆心 2024-08-31 10:28:49

使用 S 表达式来代替,可以为每个标签节省一些字符:)

Use S-expressions instead, saves you a number of characters per tag :)

玩物 2024-08-31 10:28:49

如果我正确理解你的问题,你需要的是 gz 压缩,这可以通过 Apache 轻松实现。

if i understand your question correctly what you need is gz compression, which is available pretty easily with Apache.

我要还你自由 2024-08-31 10:28:49

通过一些 HTML 压缩器/混淆器运行您的代码,该器会删除尽可能多的标记,然后让您的 Web 服务器使用 gzip 对其进行压缩。

Run your code thru some HTML minificator/obfuscator that removes as much markup as possible, then let your web server compress it with gzip.

寻梦旅人 2024-08-31 10:28:49

不,没有任何特定于 HTML 的压缩算法,因为通用的压缩算法已被证明是足够的。

潜在的收益来自于提前了解 HTML 页面的可能元素 - 您可以从预定义的字典开始,该字典不必是压缩流的一部分。但这不会带来明显的增益,因为压缩算法非常擅长动态挑选常见的子表达式。

No, there are not any HTML-specific compression algorithms, because the general-purpose ones have proved adequate.

The potential gains would come from knowing ahead of time the likely elements of an HTML page - you could start with a predefined dictionary that would not have to be part of the compressed stream. But this would not give a noticeable gain, as compression algorithms are extraordinarily good at picking out common sub-expressions on the fly.

洛阳烟雨空心柳 2024-08-31 10:28:49

您通常会使用通用算法,例如 gzip,大多数浏览器通过 HTTP 协议支持该算法。 Apache 文档展示了如何在不破坏浏览器支持的情况下启用 mod_deflate你的网站。

此外,您还可以最小化静态 HTML 文件(或动态执行此操作)。

You would usually use a common algorithm like gzip which is supported by most browsers through the HTTP protocol. The Apache documentation shows how to enable mod_deflate without breaking the browser support of your website.

Additionally you can minimize static HTML files (or do that dynamically).

小情绪 2024-08-31 10:28:49

您可以将每个独特的分组(即标签和属性)视为一个符号,确定最小符号计数并使用香农熵重新编码;这将生成一大块具有最大压缩的字节。我会说这可能比 gzip 好不了多少。

You could consider each unique grouping (i.e. tags & attribs) as a symbol, determine a minimum symbol count and re-encode using Shannon's entropy; this would generate one large blob of bytes with maximal compression. I will say that this may not be much better than gzip.

Spring初心 2024-08-31 10:28:49

现在有高效 XML 交换 (EXI) 格式。从摘要来看:

EXI 是可扩展标记语言 (XML) 信息集的一种非常紧凑的表示形式,旨在同时优化性能和计算资源的利用率。 EXI 格式使用从信息和形式语言理论中提取的混合方法,加上经过测量验证的实用技术,对 XML 信息进行熵编码。使用相对简单的算法(适合快速且紧凑的实现)和一小组数据类型表示,它可以可靠地生成 XML 事件流的高效编码。

工作组页面链接到其他有用的文档,包括 简介以及实证评估

“Fast Infoset”是另一种紧凑的二进制 XML 表示形式。

这些适用于有效的 XML 文档,因此它们可能无法满足您的问题处理混乱和混乱的要求。无效的 HTML。

There is now Efficient XML Interchange (EXI) Format. From the abstract:

EXI is a very compact representation for the Extensible Markup Language (XML) Information Set that is intended to simultaneously optimize performance and the utilization of computational resources. The EXI format uses a hybrid approach drawn from the information and formal language theories, plus practical techniques verified by measurements, for entropy encoding XML information. Using a relatively simple algorithm, which is amenable to fast and compact implementation, and a small set of datatype representations, it reliably produces efficient encodings of XML event streams.

The Working Group's page links to other useful documents including a brief primer as well as empirical evaluation.

"Fast Infoset" is another compact binary XML representation.

These are for valid XML documents, so they may not meet your question's requirements for handling messy & invalid HTML.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文