Xml可以用进行压缩吗? 结束元素?

发布于 2024-07-12 21:40:28 字数 712 浏览 4 评论 0 原文

是否有任何原因导致像这样的 XML

<person>    
    <firstname>Joe</firstname>    
    <lastname>Plumber</lastname>
</person>

无法像这样压缩以进行客户端/服务器传输。

<person>    
    <firstname>Joe</>    
    <lastname>Plumber</>
</>

它会更小 - 并且解析速度会稍微快一些。

假设没有边缘条件意味着这不起作用 - 有没有库可以做这样的事情?

事实证明,这是一件很难用谷歌搜索的事情:

您的搜索 - - 与任何内容都不匹配 文件。

建议:

尝试不同的关键字。

编辑:我的要求似乎很混乱。 我正在谈论我自己的压缩形式。 我完全清楚,就目前情况而言,这不是 XML。 服务器和客户端必须“参与该计划”。 这对于具有很长元素名称的模式特别有用,因为这些元素名称占用的带宽将减半。

Is there any reason why XML such as this :

<person>    
    <firstname>Joe</firstname>    
    <lastname>Plumber</lastname>
</person>

couldn't be compressed like this for client/server transfer.

<person>    
    <firstname>Joe</>    
    <lastname>Plumber</>
</>

It would be smaller - and slightly faster to parse.

Assuming that there are no edge conditions meaning this wouldn't work - are there any libraries to do such a thing?

This is a hard thing to google it turns out :

Your search - </> - did not match any
documents.

Suggestions:

Try different keywords.

Edit: Seems to be confusion in what I'm asking. I am talkin about my own form of compression. I am fully aware that as it stands this is NOT XML. The server and client would have to be 'in on the scheme'. It would be especially helpful for schemas that have very long element names, becuase the bandwidth taken up by those element names would be halved.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(14

清晰传感 2024-07-19 21:40:28

这不是有效的 XML。 结束标签必须命名。 否则可能容易出错,坦率地说,我认为您的可读性会较差。

关于您为了节省一些字节而对 XML 标准的非标准违反的澄清,这是一个非常糟糕的主意,原因如下:

  1. 它是非标准的,并且可能需要在很远的将来才得到支持;
  2. 标准的存在是有原因的。 标准和约定具有很大的力量,并且“自定义 XML”与象牙塔图形设计师一样,迫使程序员编写自定义按钮替换,因为标准按钮无法实现任何奇怪、奇妙和令人困惑的行为;
  3. Gzip 压缩既简单又有效,而且不会违反标准。 如果您看到 gzip 八位字节流,则不会将其误认为是 XML。 您所拥有的速记方案的真正问题在于它仍然位于顶部,因此一些可怜的毫无戒心的解析器可能会错误地认为它是有效的,并以不同的误导性错误轰炸出来;
  4. 信息论:压缩的原理是消除信息的冗余。 如果您手动执行此操作,则会使 gzip 压缩不再有效,因为重新预设了相同数量的信息;
  5. 将文档转换为该方案或从该方案转换文档会产生大量开销。 它无法使用标准 XML 解析器来完成,因此您必须有效地编写自己的 XML 解析器和输出器来理解此方案(实际上可以使用解析器来转换到此格式;将其恢复更困难),这是大量的工作(以及很多错误)。

That's not valid XML. Closing tags must be named. It's potentially error prone otherwise and frankly I think it'd be less readable your way.

In reference to your clarification about this being a nonstandard violation of the XML standard to save a few bytes, it is an incredibly bad idea for several reasons:

  1. It's nonstandard and possibly will have to be supported far in the future;
  2. Standards exist for a reason. Standards and conventions have a lot of power and having "custom XML" ranks up there with Ivory Tower graphic designers who force programmers to write a custom button replacement because the standard one can't do whatever weird, wonderful and confusing behaviour was dreamt up;
  3. Gzip compression is easy and far more effective and won't break standards. If you see a gzip octet stream, there's no mistaking it for XML. The real problem with the shorthand scheme you've got is that it still has at the top so some poor unsuspecting parser may make the mistake of thinking its valid and bomb out with a different, misleading error;
  4. Information theory: compression works by removing redundancy of information. If you do that by hand, it makes gzip compression no more effective because the same amount of information is represetned;
  5. There is a significant overhead on converting documents to and from this scheme. It can't be done with a standard XML parser so you'd have to effectively write your own XML parser and outputter that understands this scheme (actually conversion to this format can be done with a parser; getting it back is more difficult), which is a lot of work (and a lot of bugs).
紫竹語嫣☆ 2024-07-19 21:40:28

如果您编写了一个压缩例程来执行此操作,那么是的,您可以压缩流并在另一端恢复它。

没有这样做的原因是:

  • 已经存在更好的 XML 不可知压缩方案(在压缩率方面,可能在 CPU 和空间方面 - 某个 7 N UTF-8 文档将获得 14% 的压缩,但至少需要2 N 字节空间来解压缩,而不是大多数解压缩算法所需的恒定空间,
  • 已经存在更好的 XML 感知压缩方案(google 'binary xml'),基于 ASN.1 的方案比减少压缩要好得多。用于指示元素类型的大小减半。
  • 解压缩器必须解析非标准 XML 并保留它遇到的开放标记的堆栈,因此除非您将其插入而不是解析器,否则解析成本会增加一倍。如果您确实插入它而不是解析器,那么您将混合不同的层,这可能会在某些时候引起混乱。

If you wrote a compression routine which did that, then yes, you could compress a stream and restore it at the other end.

The reasons this isn't done are:

  • much better XML agnostic compression schemes already exist (in terms of compression ratio, and probably in terms of CPU and space - a certain 7 N UTF-8 document would get 14% compression but require at least 2 N bytes space to decompress, rather than constant space required by most decompression algorithms.
  • much better XML aware compression schemes already exist (google 'binary xml'). For schema aware compression, the schemes based on ASN.1 give much better than reducing the size devoted to indicating element type by half.
  • the decompressor must parse the non-standard XML and keep a stack of the open tags it has encountered. So unless you're plugging it in instead of a parser, you have doubled the parsing cost. If you do plug it instead of the parser, you're mixing a different layers, which is liable to cause confusion at some point
伏妖词 2024-07-19 21:40:28

正如您所说,这不是 XML,那么为什么要让它看起来像 XML呢? 您已经失去了使用任何 XML 解析器或工具的能力。 我要么

  • 使用 XML,然后在线上压缩它,因为您会发现比使用您自己的方案节省更多的费用
  • 使用另一种更紧凑的格式,例如 YAMLJSON

As you say, this isn't XML, so why make it even look like XML? You've already lost the ability to use any XML parsers or tools. I would either

  • Use XML, and compress it on the wire as you'll see far greater savings than with your own scheme
  • Use another more compact format like YAML or JSON
猫烠⑼条掵仅有一顆心 2024-07-19 21:40:28

如果您需要更好的压缩和更容易的解析,您可以尝试使用 XML 属性:

<person firstname="Joe" lastname="Plumber" />

If you need better compression and easier parsing, you may try using XML attributes:

<person firstname="Joe" lastname="Plumber" />
沫雨熙 2024-07-19 21:40:28

您所描述的是 SGML,它使用 结束最近的前一个非空标签。

What you are describing is SGML, which uses </> to end nearest previous nonempty tag.

嗼ふ静 2024-07-19 21:40:28

有什么理由吗

,SGML 确实允许关闭标签。 关于是否允许将其纳入 XML 标准存在争议。 拒绝它的原因是省略结束标记中的名称有时会导致 XML 可读性较差。 所以,这就是“原因”。

现有的文本压缩率很难被超越,但“压缩”方案的一个优点是 XML 在网络上仍保持人类可读。 另一个优点是,如果您必须手动输入 XML(例如用于测试),则不必关闭结束标记会带来(次要)便利。 也就是说,它比标准 XML 更人类可写。 我说“次要”,因为大多数编辑器都会为你完成字符串补全(例如 vim 中的 ^n 和 ^p)。

要删除结束标签:最简单的方法是使用如下内容:s___ (这不是正确的 QName 正则表达式,但您明白了)。

要将它们添加回来:您需要一个特殊的解析器,因为 SAX 和其他 XML 解析器无法识别它(因为它不是“XML”)。 但(最简单的)解析只需要识别开放标签名称和关闭标签名称。

have a stack.
scan the XML, and output it, as-is.
if you recognize an open tag, push its name.
if you recognize close tag, pop to get its name, and
  insert that in the output (you can do this even when there is a proper close tag).

顺便说一句(回应上面的评论),这是有效的,因为在 XML 中,关闭标签只能对应于最近的打开标签。 与嵌套括号相同。

然而,我认为你是对的,肯定有人已经这样做了。 也许检查 Python 或 Perl 存储库?

编辑:您可以进一步省略尾随 ,因此您的示例变为(当解析器看到 EOF 时,它会为堆栈上剩余的内容添加关闭标记):

<person>    
    <firstname>Joe</>    
    <lastname>Plumber

Is there any reason why

Taking your question philosophically, SGML did allow </> close tags. There was debate about allowing this into the XML standard. The reasoning for rejecting it was that omitting the names from end tags would sometimes result in less readable XML. So, that is a "reason why".

It's hard to beat existing text compression rates, but one advantage of your "compression" scheme is the XML remains human readable on the wire. Another advantage is that if you have to enter XML by hand (e.g. for testing), it's a (minor) convenience to not have to close end tags. That is, it's more human writable than standard XML. I say "minor", because most editors will do string completion for you (e.g. ^n and ^p in vim).

To strip the close tags: simplest is to use something like this: s_</[a-zA-Z0-9_$]+>_</>_ (that's not the right QName regex, but you get the idea).

To add them back: you need a special parser, because SAX and other XML parsers won't recognize this (as it's not "XML"). But the (simplest) parsing just needs to recognize open tag names and close tag names.

have a stack.
scan the XML, and output it, as-is.
if you recognize an open tag, push its name.
if you recognize close tag, pop to get its name, and
  insert that in the output (you can do this even when there is a proper close tag).

BTW (in response to a comment above), this works because in XML a close tag can only ever correspond to the most recent open tag. Same as nested parentheses.

However, I think you're right, that someone has surely done this already. Maybe check Python or Perl repositories?

EDIT: You can further omit trailing </>, so your example becomes (when the parser sees EOF, it adds close tags for whatever's left on the stack):

<person>    
    <firstname>Joe</>    
    <lastname>Plumber
转身以后 2024-07-19 21:40:28

如果数据大小有任何问题,那么 XML 不适合您。

If size of the data is any issue at all, XML is not for you.

风柔一江水 2024-07-19 21:40:28

即使这是可能的,也只会花费更长的时间来解析,因为现在解析器必须弄清楚正在关闭的内容,并且必须不断检查这是否正确。

如果您需要压缩,XML 是高度可压缩的。

Even if this were possible it could only take longer to parse because now the parser has to work out what's being closed and will have to keep checking if that's correct.

If you want compression, XML is highly gzip'able.

寻找一个思念的角度 2024-07-19 21:40:28

您可能有兴趣了解 SGML 中的不同标签格式。 例如,以下内容可能是有效的 SGML:

<p/This paragraph contains a <em/bold/ word./

幸运的是,XML 的设计者选择省略这一疯狂的特定章节。

You may be interested to read about the different tag formats in SGML. For example, the following could be valid SGML:

<p/This paragraph contains a <em/bold/ word./

Fortunately, the designers of XML chose to omit this particular chapter of madness.

凉宸 2024-07-19 21:40:28

如果不使用 gzip 或类似的东西,我只需在发送之前和在接收端使用 xml 之前将每个标签替换为更短的标签名。 因此,您会得到这样的结果:

<a>
    <b>Joe</b>
    <c>Plumber</c>
</a>

使使用任何标准解析器迭代所有节点并相应地替换 nodeNames 变得非常容易。

If not using gzip or anything like that, I'd simply replace each tag with a shorter tagname before sending and before using the xml on the recieving end. Thus you'd get something like this:

<a>
    <b>Joe</b>
    <c>Plumber</c>
</a>

Making it very easy to use any standard parser to iterate through all nodes and replacing nodeNames accordingly.

姜生凉生 2024-07-19 21:40:28

不要为 XML 的文本内优化和降低读/写性能/简单性而烦恼。 使用 deflate 压缩来压缩客户端和服务器之间的有效负载。 我做了一些测试,压缩一个普通的 10k XML 文件会得到一个 2.5k 的 blub。 删除所有端点结束标记名称会将原始文件大小降低到 9k,但缩小后又会变成 2.5k。 这是一个很好的例子,表明基于字典的压缩是压缩端点之间有效负载的简单方法。 “”和“”将(几乎)在压缩数据中使用相同的空间。

唯一的例外是如果文件/数据非常小,则可压缩性较差。

Do not bother with in-text optimizations of your XML and degrading reading/writing perf/simplicity. Use deflate compression to compress your payload between the client and the server. I made some tests, and compressing a normal 10k XML file results in a 2.5k blub. Removing all endpoint end tag names lowers the original file size to 9k, but once deflated it's again 2.5k. This is a very good example that dictionary-based compression is the simple way to compress payloads between endpoints. "" and "" will (almost) use the same space in the compressed data.

The only exception would be if the files/data is very small, then less compressible.

蓝颜夕 2024-07-19 21:40:28

是的,xml 是一种重格式。 但它有一定的优点。

如果您认为 xml 对您的使用来说太繁琐,请查看 JSON。 它重量轻,但功能比 xml 少。

如果您想要非常小的文件,请使用二进制格式;-)。

Yes, xml is a kind og heavy format. But it has certain advantages.

If you think xml is to heavy for your use, have a look at JSON instead. It is light weight but has less functionality than xml.

And if you want really small files, use a binary format ;-).

Smile简单爱 2024-07-19 21:40:28

抱歉,规范中没有。 如果您有一个大的 XML 文件,您最好通过 zip、gzip 等进行压缩。

Sorry, not in the spec. If you have a big XML file you better compress via zip, gzip and such.

黄昏下泛黄的笔记 2024-07-19 21:40:28

您不使用 YAML 或 JSON 有什么原因吗?

Is there any reason you aren't using YAML or JSON?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文