XML 的最佳压缩算法?

发布于 2024-07-26 07:35:01 字数 624 浏览 5 评论 0原文

我对压缩几乎一无所知,所以请耐心等待(这可能是一个愚蠢且明显的痛苦问题)。

假设我有一个带有一些标签的 XML 文件。

<verylongtagnumberone>
  <verylongtagnumbertwo>
    text
  </verylongtagnumbertwo>
</verylongtagnumberone>

现在假设我的多个 XML 文件中有一堆非常长的标签,其中包含许多属性。 我需要将它们压缩到尽可能小的尺寸。 最好的方法是使用特定于 XML 的算法,该算法为各个标签分配假名,例如 vlt1 或 vlt2。 然而,这并不像我想要的那样“开放”,我想使用像 DEFLATE 或 LZ 这样的通用算法。 如果存档是 .zip 文件,它也会有所帮助。

由于我正在处理纯文本(没有像图像这样的二进制文件),所以我想要一个适合纯文本的算法。 哪一种生成的文件大小最小(首选无损算法)?

顺便说一句,场景是这样的:我正在为文档(例如 ODF 或 MS Office XML)创建一个标准,这些文档包含打包在 .zip 中的 XML 文件。

编辑:“加密”是一个拼写错误; 它应该是“压缩”。

I barely know a thing about compression, so bear with me (this is probably a stupid and painfully obvious question).

So lets say I have an XML file with a few tags.

<verylongtagnumberone>
  <verylongtagnumbertwo>
    text
  </verylongtagnumbertwo>
</verylongtagnumberone>

Now lets say I have a bunch of these very long tags with many attributes in my multiple XML files. I need to compress them to the smallest size possible. The best way would be to use an XML-specific algorithm which assigns individual tags pseudonyms like vlt1 or vlt2. However, this wouldn't be as 'open' of a way as I m trying to go for, and I want to use a common algorithm like DEFLATE or LZ. It also helpes if the archive was a .zip file.

Since I'm dealing with plain text (no binary files like images), I'd like an algorithm that suits plain text. Which one produces the smallest file size (lossless algorithms are preferred)?

By the way, the scenario is this: I am creating a standard for documents, like ODF or MS Office XML, that contain XML files, packaged in a .zip.

EDIT: The 'encryption' thing was a typo; it should ave ben 'compression'.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

神回复 2024-08-02 07:35:01

有一个名为 EXI(高效 XML 交换)

应该成为未来压缩XML数据的数据格式(号称是最后必要的二进制格式)。 它针对 XML 进行了优化,可以比任何传统压缩算法更高效地压缩 XML。

使用 EXI,您可以即时操作压缩的 XML 数据(无需解压缩或重新压缩)。

EXI = (XML + XMLSchema) 作为二进制。

这里是开源实现(不知道它是否已经稳定):
Exificient

There is a W3 (not-yet-released) standard named EXI (Efficient XML Interchange).

Should become THE data format for compressing XML data in the future (claimed to be the last necessary binary format). Being optimized for XML, it compresses XML more ways more efficient than any conventional compression algorithm.

With EXI, you can operate on compressed XML data on the fly (without the need to uncompress or re-compress it).

EXI = (XML + XMLSchema) as binary.

And here you go with the opensource implementation (don't know if it's already stable):
Exificient

命硬 2024-08-02 07:35:01

是的,*.zip 在实践中效果最好。 这篇 USENIX 论文< 中包含的血淋淋的内容/strong> 表明“最佳”压缩器不值得计算成本& 特定领域的压缩器[平均]无法击败 zip。

免责声明:那篇论文是我写的,据 Google 统计,该论文已被引用 60 多次。

Yes, *.zip best in practice. Gory deets contained in this USENIX paper showing that "optimal" compressors not worth computational cost & domain-specific compressors don't beat zip [on average].

Disclaimer: I wrote that paper, which has been cited 60+ times according to Google.

呢古 2024-08-02 07:35:01

“压缩”XML 的另一种替代方法是 FI(快速信息集)。

XML,存储为 FI,将只包含每个标签和属性一次
所有其他事件都引用第一个事件,
从而节省空间。

请参阅:

java.sun.com 上的非常好的文章,当然
维基百科条目

从压缩的角度来看,EXI 的区别是快速信息集
(结构化明文)效率较低。

其他重要区别
是:FI 是一个成熟的标准,有很多实现。
其中之一:快速信息集项目@dev.java.net

Another alternative to "compress" XML would be FI (Fast Infoset).

XML, stored as FI, would contain every tag and attribute only once,
all other occurrences are referencing the first one,
thus saving space.

See:

Very good article on java.sun.com, and of course
the Wikipedia entry

The difference to EXI from the compression point of view is that Fast Infoset
(being structured plaintext) is less efficient.

Other important difference
is: FI is a mature standard with many implementations.
One of them: Fast Infoset Project @ dev.java.net

天邊彩虹 2024-08-02 07:35:01

看来您对压缩比加密更感兴趣。 是这样吗? 如果是这样,可能会被证明是一本有趣的读物,尽管不是一个精确的解决方案。

It seems like you're more interested in compression rather than encryption. Is that the case? If so, this might prove an interesting read even though is not an exact solution.

夏末染殇 2024-08-02 07:35:01

顺便说一句,场景是这样的:我正在为文档(例如 ODF 或 MS Office XML)创建一个标准,其中包含打包在 .zip 中的 XML 文件。

那么我建议您使用 .zip 压缩,否则您的用户会感到困惑。

By the way, the scenario is this: I am creating a standard for documents, like ODF or MS Office XML, that contain XML files, packaged in a .zip.

then I'd suggest you use .zip compression, or your users will get confused.

没有伤那来痛 2024-08-02 07:35:01

您的替代方案是:

  • 使用支持 gzip 压缩的网络服务器。 它会自动压缩所有传出的 html。 不过,CPU 会受到一点影响。
  • 使用 JSON 之类的东西。 它将大大减少消息的大小。
  • 还有一个二进制 XML,但我自己还没有尝试过。

Your alternatives are:

  • Use a webserver that supports gzip compression. It'll auto compress all outgoing html. There's a small CPU penalty though.
  • Use something like JSON. It'll drastically reduce the size of the message
  • There's also a binary XML but I have not tried it myself.
故事与诗 2024-08-02 07:35:01

我希望我正确理解你需要做什么......
首先我想说的是压缩没有好坏之分
文本算法 - zip、bzip、gzip、rar、7zip 足以压缩
任何具有低熵的东西 - 即具有小字符集的大文件。
如果我必须使用它们,我会选择 7zip 作为我的第一选择,rar 作为
第二个,拉链第三个。 但差别很小,所以你应该尝试一下
对你来说更容易的事情。
其次 - 我无法理解你想要加密的内容。 假设
这是一个 XML 文件,那么您应该首先使用您喜欢的文件对其进行压缩
压缩算法,然后使用您最喜欢的加密方式对其进行加密
算法。 在大多数情况下,例如在 PGP 中实现的任何现代算法
对于任何事情都足够安全。
希望有帮助。

I hope I understood correctly what you need to do...
First thing I would like to say is that there are no good or bad compression
algorithmss for text - zip, bzip, gzip, rar, 7zip are good enough to compress
anything that has a low entrpy - i.e. large file with small character set.
If I would have to use them I would choose 7zip at my first choice, rar as
a second and zip as third. But the difference is very small so you should try
whatever easier for you.
Second - I could not understand what you are trying to encrypt. Suppose that
this is an XML file then you should first compress it using your favourite
compression algorithm and then encrypt it using your favourite encryption
algorithm. In most cases any modern algorithm implemented for instance in PGP
will be secure enough for anything.
Hope that helps.

我喜欢麦丽素 2024-08-02 07:35:01

没有一个默认值对于 XML 来说是理想的,但您仍然会获得很好的值,因为有很多可重复的值。

因为 XML 使用大量重复(标签。>),所以您希望这些重复少于一点,因此使用某种形式的算术而不是 Huffman 编码。 所以 rar / 7zip 理论上应该明显更好。这些算法提供高压缩,因此速度较慢。 理想情况下,您需要使用算术编码器进行简单压缩(对于 XML 来说,这种压缩速度会很快并且压缩率较高)。

None of the default ones are ideal for XML but you will still get good values since there is a lot of repeatables.

Because XML uses a lot of repeats ( tags . > ) you want these be less than a bit so some form of arithmetic rather than Huffman encoding . So rar / 7zip should be significantly better in theory..these algorithms offer high compression so are slower. Ideally you'd want a simple compression with an arithmetic encoder ( which for XML would be fast and give high compression) .

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文