是否有任何原因导致像这样的 XML
<person>
<firstname>Joe</firstname>
<lastname>Plumber</lastname>
</person>
无法像这样压缩以进行客户端/服务器传输。
<person>
<firstname>Joe</>
<lastname>Plumber</>
</>
它会更小 - 并且解析速度会稍微快一些。
假设没有边缘条件意味着这不起作用 - 有没有库可以做这样的事情?
事实证明,这是一件很难用谷歌搜索的事情:
您的搜索 -
- 与任何内容都不匹配
文件。
建议:
尝试不同的关键字。
编辑:我的要求似乎很混乱。 我正在谈论我自己的压缩形式。 我完全清楚,就目前情况而言,这不是 XML。 服务器和客户端必须“参与该计划”。 这对于具有很长元素名称的模式特别有用,因为这些元素名称占用的带宽将减半。
Is there any reason why XML such as this :
<person>
<firstname>Joe</firstname>
<lastname>Plumber</lastname>
</person>
couldn't be compressed like this for client/server transfer.
<person>
<firstname>Joe</>
<lastname>Plumber</>
</>
It would be smaller - and slightly faster to parse.
Assuming that there are no edge conditions meaning this wouldn't work - are there any libraries to do such a thing?
This is a hard thing to google it turns out :
Your search - </>
- did not match any
documents.
Suggestions:
Try different keywords.
Edit: Seems to be confusion in what I'm asking. I am talkin about my own form of compression. I am fully aware that as it stands this is NOT XML. The server and client would have to be 'in on the scheme'. It would be especially helpful for schemas that have very long element names, becuase the bandwidth taken up by those element names would be halved.
发布评论
评论(14)
这不是有效的 XML。 结束标签必须命名。 否则可能容易出错,坦率地说,我认为您的可读性会较差。
关于您为了节省一些字节而对 XML 标准的非标准违反的澄清,这是一个非常糟糕的主意,原因如下:
That's not valid XML. Closing tags must be named. It's potentially error prone otherwise and frankly I think it'd be less readable your way.
In reference to your clarification about this being a nonstandard violation of the XML standard to save a few bytes, it is an incredibly bad idea for several reasons:
如果您编写了一个压缩例程来执行此操作,那么是的,您可以压缩流并在另一端恢复它。
没有这样做的原因是:
If you wrote a compression routine which did that, then yes, you could compress a stream and restore it at the other end.
The reasons this isn't done are:
正如您所说,这不是 XML,那么为什么要让它看起来像 XML呢? 您已经失去了使用任何 XML 解析器或工具的能力。 我要么
As you say, this isn't XML, so why make it even look like XML? You've already lost the ability to use any XML parsers or tools. I would either
如果您需要更好的压缩和更容易的解析,您可以尝试使用 XML 属性:
If you need better compression and easier parsing, you may try using XML attributes:
您所描述的是 SGML,它使用
结束最近的前一个非空标签。
What you are describing is SGML, which uses
</>
to end nearest previous nonempty tag.,SGML 确实允许
关闭标签。 关于是否允许将其纳入 XML 标准存在争议。 拒绝它的原因是省略结束标记中的名称有时会导致 XML 可读性较差。 所以,这就是“原因”。
现有的文本压缩率很难被超越,但“压缩”方案的一个优点是 XML 在网络上仍保持人类可读。 另一个优点是,如果您必须手动输入 XML(例如用于测试),则不必关闭结束标记会带来(次要)便利。 也就是说,它比标准 XML 更人类可写。 我说“次要”,因为大多数编辑器都会为你完成字符串补全(例如 vim 中的 ^n 和 ^p)。
要删除结束标签:最简单的方法是使用如下内容:
s___
(这不是正确的 QName 正则表达式,但您明白了)。要将它们添加回来:您需要一个特殊的解析器,因为 SAX 和其他 XML 解析器无法识别它(因为它不是“XML”)。 但(最简单的)解析只需要识别开放标签名称和关闭标签名称。
顺便说一句(回应上面的评论),这是有效的,因为在 XML 中,关闭标签只能对应于最近的打开标签。 与嵌套括号相同。
然而,我认为你是对的,肯定有人已经这样做了。 也许检查 Python 或 Perl 存储库?
编辑:您可以进一步省略尾随
,因此您的示例变为(当解析器看到 EOF 时,它会为堆栈上剩余的内容添加关闭标记):
Taking your question philosophically, SGML did allow
</>
close tags. There was debate about allowing this into the XML standard. The reasoning for rejecting it was that omitting the names from end tags would sometimes result in less readable XML. So, that is a "reason why".It's hard to beat existing text compression rates, but one advantage of your "compression" scheme is the XML remains human readable on the wire. Another advantage is that if you have to enter XML by hand (e.g. for testing), it's a (minor) convenience to not have to close end tags. That is, it's more human writable than standard XML. I say "minor", because most editors will do string completion for you (e.g. ^n and ^p in vim).
To strip the close tags: simplest is to use something like this:
s_</[a-zA-Z0-9_$]+>_</>_
(that's not the right QName regex, but you get the idea).To add them back: you need a special parser, because SAX and other XML parsers won't recognize this (as it's not "XML"). But the (simplest) parsing just needs to recognize open tag names and close tag names.
BTW (in response to a comment above), this works because in XML a close tag can only ever correspond to the most recent open tag. Same as nested parentheses.
However, I think you're right, that someone has surely done this already. Maybe check Python or Perl repositories?
EDIT: You can further omit trailing
</>
, so your example becomes (when the parser sees EOF, it adds close tags for whatever's left on the stack):如果数据大小有任何问题,那么 XML 不适合您。
If size of the data is any issue at all, XML is not for you.
即使这是可能的,也只会花费更长的时间来解析,因为现在解析器必须弄清楚正在关闭的内容,并且必须不断检查这是否正确。
如果您需要压缩,XML 是高度可压缩的。
Even if this were possible it could only take longer to parse because now the parser has to work out what's being closed and will have to keep checking if that's correct.
If you want compression, XML is highly gzip'able.
您可能有兴趣了解 SGML 中的不同标签格式。 例如,以下内容可能是有效的 SGML:
幸运的是,XML 的设计者选择省略这一疯狂的特定章节。
You may be interested to read about the different tag formats in SGML. For example, the following could be valid SGML:
Fortunately, the designers of XML chose to omit this particular chapter of madness.
如果不使用 gzip 或类似的东西,我只需在发送之前和在接收端使用 xml 之前将每个标签替换为更短的标签名。 因此,您会得到这样的结果:
使使用任何标准解析器迭代所有节点并相应地替换 nodeNames 变得非常容易。
If not using gzip or anything like that, I'd simply replace each tag with a shorter tagname before sending and before using the xml on the recieving end. Thus you'd get something like this:
Making it very easy to use any standard parser to iterate through all nodes and replacing nodeNames accordingly.
不要为 XML 的文本内优化和降低读/写性能/简单性而烦恼。 使用 deflate 压缩来压缩客户端和服务器之间的有效负载。 我做了一些测试,压缩一个普通的 10k XML 文件会得到一个 2.5k 的 blub。 删除所有端点结束标记名称会将原始文件大小降低到 9k,但缩小后又会变成 2.5k。 这是一个很好的例子,表明基于字典的压缩是压缩端点之间有效负载的简单方法。 “”和“”将(几乎)在压缩数据中使用相同的空间。
唯一的例外是如果文件/数据非常小,则可压缩性较差。
Do not bother with in-text optimizations of your XML and degrading reading/writing perf/simplicity. Use deflate compression to compress your payload between the client and the server. I made some tests, and compressing a normal 10k XML file results in a 2.5k blub. Removing all endpoint end tag names lowers the original file size to 9k, but once deflated it's again 2.5k. This is a very good example that dictionary-based compression is the simple way to compress payloads between endpoints. "" and "" will (almost) use the same space in the compressed data.
The only exception would be if the files/data is very small, then less compressible.
是的,xml 是一种重格式。 但它有一定的优点。
如果您认为 xml 对您的使用来说太繁琐,请查看 JSON。 它重量轻,但功能比 xml 少。
如果您想要非常小的文件,请使用二进制格式;-)。
Yes, xml is a kind og heavy format. But it has certain advantages.
If you think xml is to heavy for your use, have a look at JSON instead. It is light weight but has less functionality than xml.
And if you want really small files, use a binary format ;-).
抱歉,规范中没有。 如果您有一个大的 XML 文件,您最好通过 zip、gzip 等进行压缩。
Sorry, not in the spec. If you have a big XML file you better compress via zip, gzip and such.
您不使用 YAML 或 JSON 有什么原因吗?
Is there any reason you aren't using YAML or JSON?