当前位置：文江博客话题详情

XML compression

Xml可以用进行压缩吗？结束元素？

发布于 2024-07-12 21:40:28 字数 712 浏览 4 评论 0 原文

是否有任何原因导致像这样的 XML

<person>    
    <firstname>Joe</firstname>    
    <lastname>Plumber</lastname>
</person>

无法像这样压缩以进行客户端/服务器传输。

<person>    
    <firstname>Joe</>    
    <lastname>Plumber</>
</>

它会更小 - 并且解析速度会稍微快一些。

假设没有边缘条件意味着这不起作用 - 有没有库可以做这样的事情？

事实证明，这是一件很难用谷歌搜索的事情：

您的搜索 - - 与任何内容都不匹配文件。

建议：

尝试不同的关键字。

编辑：我的要求似乎很混乱。我正在谈论我自己的压缩形式。我完全清楚，就目前情况而言，这不是 XML。服务器和客户端必须“参与该计划”。这对于具有很长元素名称的模式特别有用，因为这些元素名称占用的带宽将减半。

原文

Is there any reason why XML such as this :

<person>    
    <firstname>Joe</firstname>    
    <lastname>Plumber</lastname>
</person>

couldn't be compressed like this for client/server transfer.

<person>    
    <firstname>Joe</>    
    <lastname>Plumber</>
</>

It would be smaller - and slightly faster to parse.

Assuming that there are no edge conditions meaning this wouldn't work - are there any libraries to do such a thing?

This is a hard thing to google it turns out :

Your search - </> - did not match any
documents.

Suggestions:

Try different keywords.

Edit: Seems to be confusion in what I'm asking. I am talkin about my own form of compression. I am fully aware that as it stands this is NOT XML. The server and client would have to be 'in on the scheme'. It would be especially helpful for schemas that have very long element names, becuase the bandwidth taken up by those element names would be halved.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

清晰传感 2024-07-19 21:40:28

这不是有效的 XML。结束标签必须命名。否则可能容易出错，坦率地说，我认为您的可读性会较差。

关于您为了节省一些字节而对 XML 标准的非标准违反的澄清，这是一个非常糟糕的主意，原因如下：

它是非标准的，并且可能需要在很远的将来才得到支持；
标准的存在是有原因的。标准和约定具有很大的力量，并且“自定义 XML”与象牙塔图形设计师一样，迫使程序员编写自定义按钮替换，因为标准按钮无法实现任何奇怪、奇妙和令人困惑的行为；
Gzip 压缩既简单又有效，而且不会违反标准。如果您看到 gzip 八位字节流，则不会将其误认为是 XML。您所拥有的速记方案的真正问题在于它仍然位于顶部，因此一些可怜的毫无戒心的解析器可能会错误地认为它是有效的，并以不同的误导性错误轰炸出来；
信息论：压缩的原理是消除信息的冗余。如果您手动执行此操作，则会使 gzip 压缩不再有效，因为重新预设了相同数量的信息；
将文档转换为该方案或从该方案转换文档会产生大量开销。它无法使用标准 XML 解析器来完成，因此您必须有效地编写自己的 XML 解析器和输出器来理解此方案（实际上可以使用解析器来转换到此格式；将其恢复更困难），这是大量的工作（以及很多错误）。

回复收藏 0 原文

紫竹語嫣☆ 2024-07-19 21:40:28

如果您编写了一个压缩例程来执行此操作，那么是的，您可以压缩流并在另一端恢复它。

没有这样做的原因是：

已经存在更好的 XML 不可知压缩方案（在压缩率方面，可能在 CPU 和空间方面 - 某个 7 N UTF-8 文档将获得 14％的压缩，但至少需要2 N 字节空间来解压缩，而不是大多数解压缩算法所需的恒定空间，
已经存在更好的 XML 感知压缩方案（google 'binary xml'），基于 ASN.1 的方案比减少压缩要好得多。用于指示元素类型的大小减半。
解压缩器必须解析非标准 XML 并保留它遇到的开放标记的堆栈，因此除非您将其插入而不是解析器，否则解析成本会增加一倍。如果您确实插入它而不是解析器，那么您将混合不同的层，这可能会在某些时候引起混乱。

回复收藏 0 原文

伏妖词 2024-07-19 21:40:28

正如您所说，这不是 XML，那么为什么要让它看起来像 XML呢？您已经失去了使用任何 XML 解析器或工具的能力。我要么

使用 XML，然后在线上压缩它，因为您会发现比使用您自己的方案节省更多的费用
使用另一种更紧凑的格式，例如 YAML 或 JSON

回复收藏 0 原文

猫烠⑼条掵仅有一顆心 2024-07-19 21:40:28

如果您需要更好的压缩和更容易的解析，您可以尝试使用 XML 属性：

<person firstname="Joe" lastname="Plumber" />

If you need better compression and easier parsing, you may try using XML attributes:

<person firstname="Joe" lastname="Plumber" />

回复收藏 0 原文

沫雨熙 2024-07-19 21:40:28

您所描述的是 SGML，它使用结束最近的前一个非空标签。

回复收藏 0 原文

嗼ふ静 2024-07-19 21:40:28

有什么理由吗

，SGML 确实允许关闭标签。关于是否允许将其纳入 XML 标准存在争议。拒绝它的原因是省略结束标记中的名称有时会导致 XML 可读性较差。所以，这就是“原因”。

现有的文本压缩率很难被超越，但“压缩”方案的一个优点是 XML 在网络上仍保持人类可读。另一个优点是，如果您必须手动输入 XML（例如用于测试），则不必关闭结束标记会带来（次要）便利。也就是说，它比标准 XML 更人类可写。我说“次要”，因为大多数编辑器都会为你完成字符串补全（例如 vim 中的 ^n 和 ^p）。

要删除结束标签：最简单的方法是使用如下内容：s___ （这不是正确的 QName 正则表达式，但您明白了）。

要将它们添加回来：您需要一个特殊的解析器，因为 SAX 和其他 XML 解析器无法识别它（因为它不是“XML”）。但（最简单的）解析只需要识别开放标签名称和关闭标签名称。

have a stack.
scan the XML, and output it, as-is.
if you recognize an open tag, push its name.
if you recognize close tag, pop to get its name, and
  insert that in the output (you can do this even when there is a proper close tag).

顺便说一句（回应上面的评论），这是有效的，因为在 XML 中，关闭标签只能对应于最近的打开标签。与嵌套括号相同。

然而，我认为你是对的，肯定有人已经这样做了。也许检查 Python 或 Perl 存储库？

编辑：您可以进一步省略尾随，因此您的示例变为（当解析器看到 EOF 时，它会为堆栈上剩余的内容添加关闭标记）：

<person>    
    <firstname>Joe</>    
    <lastname>Plumber

Is there any reason why

Taking your question philosophically, SGML did allow </> close tags. There was debate about allowing this into the XML standard. The reasoning for rejecting it was that omitting the names from end tags would sometimes result in less readable XML. So, that is a "reason why".

It's hard to beat existing text compression rates, but one advantage of your "compression" scheme is the XML remains human readable on the wire. Another advantage is that if you have to enter XML by hand (e.g. for testing), it's a (minor) convenience to not have to close end tags. That is, it's more human writable than standard XML. I say "minor", because most editors will do string completion for you (e.g. ^n and ^p in vim).

To strip the close tags: simplest is to use something like this: s_</[a-zA-Z0-9_$]+>_</>_ (that's not the right QName regex, but you get the idea).

To add them back: you need a special parser, because SAX and other XML parsers won't recognize this (as it's not "XML"). But the (simplest) parsing just needs to recognize open tag names and close tag names.

have a stack.
scan the XML, and output it, as-is.
if you recognize an open tag, push its name.
if you recognize close tag, pop to get its name, and
  insert that in the output (you can do this even when there is a proper close tag).

BTW (in response to a comment above), this works because in XML a close tag can only ever correspond to the most recent open tag. Same as nested parentheses.

However, I think you're right, that someone has surely done this already. Maybe check Python or Perl repositories?

EDIT: You can further omit trailing </>, so your example becomes (when the parser sees EOF, it adds close tags for whatever's left on the stack):

<person>    
    <firstname>Joe</>    
    <lastname>Plumber

回复收藏 0 原文

转身以后 2024-07-19 21:40:28

如果数据大小有任何问题，那么 XML 不适合您。

回复收藏 0 原文

风柔一江水 2024-07-19 21:40:28

即使这是可能的，也只会花费更长的时间来解析，因为现在解析器必须弄清楚正在关闭的内容，并且必须不断检查这是否正确。

如果您需要压缩，XML 是高度可压缩的。

回复收藏 0 原文

寻找一个思念的角度 2024-07-19 21:40:28

您可能有兴趣了解 SGML 中的不同标签格式。例如，以下内容可能是有效的 SGML：

<p/This paragraph contains a <em/bold/ word./

幸运的是，XML 的设计者选择省略这一疯狂的特定章节。

You may be interested to read about the different tag formats in SGML. For example, the following could be valid SGML:

<p/This paragraph contains a <em/bold/ word./

Fortunately, the designers of XML chose to omit this particular chapter of madness.

回复收藏 0 原文

凉宸 2024-07-19 21:40:28

如果不使用 gzip 或类似的东西，我只需在发送之前和在接收端使用 xml 之前将每个标签替换为更短的标签名。因此，您会得到这样的结果：

<a>
    <b>Joe</b>
    <c>Plumber</c>
</a>

使使用任何标准解析器迭代所有节点并相应地替换 nodeNames 变得非常容易。

If not using gzip or anything like that, I'd simply replace each tag with a shorter tagname before sending and before using the xml on the recieving end. Thus you'd get something like this:

<a>
    <b>Joe</b>
    <c>Plumber</c>
</a>

Making it very easy to use any standard parser to iterate through all nodes and replacing nodeNames accordingly.

回复收藏 0 原文

姜生凉生 2024-07-19 21:40:28

不要为 XML 的文本内优化和降低读/写性能/简单性而烦恼。使用 deflate 压缩来压缩客户端和服务器之间的有效负载。我做了一些测试，压缩一个普通的 10k XML 文件会得到一个 2.5k 的 blub。删除所有端点结束标记名称会将原始文件大小降低到 9k，但缩小后又会变成 2.5k。这是一个很好的例子，表明基于字典的压缩是压缩端点之间有效负载的简单方法。 “”和“”将（几乎）在压缩数据中使用相同的空间。

唯一的例外是如果文件/数据非常小，则可压缩性较差。