为什么 ProtoBuf.NET 的 GZip 比制表符分隔值文件的 GZip 大?
我们最近比较了使用 ProtoBuf.NET 或 TSV(制表符分隔数据)序列化的相同表格数据(想想单个表格,六列,描述产品目录)各自的文件大小,这两个文件随后都用 GZip 压缩(默认 .NET 实现)。
我惊讶地发现压缩的 ProtoBuf.NET 版本比文本版本占用更多空间(最多多出 3 倍)。 我最喜欢的理论是 ProtoBuf 不尊重 byte
语义,因此与 GZip 频率压缩树不匹配;因此,压缩效率相对较低。
另一种可能性是,ProtoBuf 实际上编码了更多数据(例如,为了方便模式版本控制),因此序列化格式在信息方面并不具有严格可比性。
有人观察同样的问题吗?是否值得压缩 ProtoBuf?
We have recently compared the respective file sizes of the same tabular data (think single table, half a dozen of columns, describing a product catalog) serialized with ProtoBuf.NET or with TSV (tab separated data), both files compressed with GZip afterward (default .NET implementation).
I have been surprised to notice that the compressed ProtoBuf.NET version takes a lot more space than the text version (up to 3x more). My pet theory is that ProtoBuf does not respect the byte
semantic and consequently mismatches the GZip frequency compression tree; hence a relatively inefficient compression.
Another possibility is that ProtoBuf encodes, in fact, a lot more data (to facilitate schema versioning for example), hence the serialized formats are not strictly comparable information-wise.
Anybody observing the same problem? Is it even worth to compress ProtoBuf?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这里有很多可能的因素;首先,请注意,protocol buffers 有线格式对字符串使用直接 UTF-8 编码;如果您的数据以字符串为主,那么它最终需要的空间量与 TSV 的空间量大致相同。
Protocol buffers 还被设计用来帮助存储结构化数据,即比单表场景更复杂的模型。这对大小影响不大,但开始与 xml/json 等(在功能方面更相似)进行比较,差异更加明显。
此外,由于协议缓冲区非常密集(尽管有 UTF-8),在某些情况下压缩它实际上可以使其变得更大 - 您可能需要检查这里是否是这种情况。
在您所呈现的场景的快速示例中,两种格式的大小大致相同 - 没有巨大的跳跃:
在这种情况下,tsv 稍微小一些,但最终 TSV 确实是一种非常简单的格式(在结构化数据),因此速度很快也就不足为奇了。
的确;如果您要存储的只是一个非常简单的单个表,那么 TSV 不是一个坏选择 - 但是,它最终是一种非常有限的格式。我无法重现你的“更大”的例子。
除了对结构化数据(和其他功能)更丰富的支持之外,protobuf 还非常重视处理性能。现在,由于 TSV 非常简单,因此这里的优势不会很大(但在上面很明显),但同样:与测试的 xml、json 或内置 BinaryFormatter 对比与具有相似功能的格式相比,差异是显而易见的。
上述数字的示例(更新为使用 BufferedStream):
There are a number of factors possible here; firstly, note that the protocol buffers wire format uses straight UTF-8 encoding for strings; if you data is dominated by strings, it will ultimately need about the same amount of space as it would for TSV.
Protocol buffers is also designed to help store structured data i.e. more complex models that the single table scenario. This doesn't contribute hugely to the size, but start comparing with xml/json etc (which are more similar in terms of capability) and the difference is more obvious.
Additionally, since protocol buffers is pretty dense (UTF-8 notwithstanding), in some cases compressing it can actually make it bigger - you might want to check if this is the case here.
In a quick sample for the scenario you present, both formats give roughly the same sizes - there is no massive jump:
the tsv is marginally smaller in this case, but ultimately TSV is indeed a very simple format (with very limited capabilities in terms of structured data), so it is no surprise that it is quick.
Indeed; if all you are storing is a very simple single table, TSV is not a bad option - however, it is ultimately a very limited format. I can't reproduce your "much bigger" example.
In addition to the richer support for structured data (and other features), protobuf places a lot of emphasis on processing performance too. Now, since TSV is pretty simple the edge here won't be massive (but is noticeable in the above), but again: contrast to xml, json, or the inbuilt BinaryFormatter for a test against formats with similar features and the difference is obvious.
Example for the numbers above (updated to use BufferedStream):
GZip 是一个流压缩器。如果您没有正确缓冲数据,压缩效果将非常差,因为它只能对小块进行操作,从而导致压缩效果大大降低。
尝试在序列化器和 GZipStream 之间放置一个具有适当大小缓冲区的 BufferedStream。
示例:使用直接写入 GZipStream 的 BinaryWriter 压缩 Int32 序列 1..100'000 将产生约 650kb 的结果,而在其间使用 64kb BufferedStream 将仅产生约 340kb 的压缩数据。
GZip is a stream compressor. In case you do not buffer data properly, the compression will be very poor because it will only operate on small blocks, resulting in much less effective compression.
Try putting a BufferedStream between your serializer and the GZipStream with a properly sized buffer.
Example: Compressing the Int32 sequence 1..100'000 with a BinaryWriter directly writing to a GZipStream will result in ~650kb, while with a 64kb BufferedStream between will result in only ~340kb of compressed data.