为什么 ProtoBuf.NET 的 GZip 比制表符分隔值文件的 GZip 大?

发布于 2024-12-13 05:33:10 字数 361 浏览 0 评论 0原文

我们最近比较了使用 ProtoBuf.NET 或 TSV(制表符分隔数据)序列化的相同表格数据(想想单个表格,六列,描述产品目录)各自的文件大小,这两个文件随后都用 GZip 压缩(默认 .NET 实现)。

我惊讶地发现压缩的 ProtoBuf.NET 版本比文本版本占用更多空间(最多多出 3 倍)。 我最喜欢的理论是 ProtoBuf 不尊重 byte 语义,因此与 GZip 频率压缩树不匹配;因此,压缩效率相对较低。

另一种可能性是,ProtoBuf 实际上编码了更多数据(例如,为了方便模式版本控制),因此序列化格式在信息方面并不具有严格可比性。

有人观察同样的问题吗?是否值得压缩 ProtoBuf?

We have recently compared the respective file sizes of the same tabular data (think single table, half a dozen of columns, describing a product catalog) serialized with ProtoBuf.NET or with TSV (tab separated data), both files compressed with GZip afterward (default .NET implementation).

I have been surprised to notice that the compressed ProtoBuf.NET version takes a lot more space than the text version (up to 3x more). My pet theory is that ProtoBuf does not respect the byte semantic and consequently mismatches the GZip frequency compression tree; hence a relatively inefficient compression.

Another possibility is that ProtoBuf encodes, in fact, a lot more data (to facilitate schema versioning for example), hence the serialized formats are not strictly comparable information-wise.

Anybody observing the same problem? Is it even worth to compress ProtoBuf?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

素染倾城色 2024-12-20 05:33:11

这里有很多可能的因素;首先,请注意,protocol buffers 有线格式对字符串使用直接 UTF-8 编码;如果您的数据以字符串为主,那么它最终需要的空间量与 TSV 的空间量大致相同。

Protocol buffers 还被设计用来帮助存储结构化数据,即比单表场景更复杂的模型。这对大小影响不大,但开始与 xml/json 等(在功能方面更相似)进行比较,差异更加明显。

此外,由于协议缓冲区非常密集(尽管有 UTF-8),在某些情况下压缩它实际上可以使其变得更大 - 您可能需要检查这里是否是这种情况。

在您所呈现的场景的快速示例中,两种格式的大小大致相同 - 没有巨大的跳跃:

protobuf-net, no compression: 2498720 bytes, write 34ms, read 72ms, chk 50000
protobuf-net, gzip: 1521215 bytes, write 234ms, read 146ms, chk 50000
tsv, no compression: 2492591 bytes, write 74ms, read 122ms, chk 50000
tsv, gzip: 1258500 bytes, write 238ms, read 169ms, chk 50000

在这种情况下,tsv 稍微小一些,但最终 TSV 确实是一种非常简单的格式(在结构化数据),因此速度很快也就不足为奇了。

的确;如果您要存储的只是一个非常简单的单个表,那么 TSV 不是一个坏选择 - 但是,它最终是一种非常有限的格式。我无法重现你的“更大”的例子。

除了对结构化数据(和其他功能)更丰富的支持之外,protobuf 还非常重视处理性能。现在,由于 TSV 非常简单,因此这里的优势不会很大(但在上面很明显),但同样:与测试的 xml、json 或内置 BinaryFormatter 对比与具有相似功能的格式相比,差异是显而易见的。


上述数字的示例(更新为使用 BufferedStream):

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.IO.Compression;
using System.Text;
using ProtoBuf;
static class Program
{
    static void Main()
    {
        RunTest(12345, 1, new StringWriter()); // let everyone JIT etc
        RunTest(12345, 50000, Console.Out); // actual test
        Console.WriteLine("(done)");
        Console.ReadLine();
    }
    static void RunTest(int seed, int count, TextWriter cout)
    {

        var data = InventData(seed, count);

        byte[] raw;
        Catalog catalog;
        var write = Stopwatch.StartNew();
        using(var ms = new MemoryStream())
        {
            Serializer.Serialize(ms, data);
            raw = ms.ToArray();
        }
        write.Stop();

        var read = Stopwatch.StartNew();
        using(var ms = new MemoryStream(raw))
        {
            catalog = Serializer.Deserialize<Catalog>(ms);
        }
        read.Stop();

        cout.WriteLine("protobuf-net, no compression: {0} bytes, write {1}ms, read {2}ms, chk {3}", raw.Length, write.ElapsedMilliseconds, read.ElapsedMilliseconds, catalog.Products.Count);
        raw = null; catalog = null;

        write = Stopwatch.StartNew();
        using (var ms = new MemoryStream())   
        {
            using (var gzip = new GZipStream(ms, CompressionMode.Compress, true))
            using (var bs = new BufferedStream(gzip, 64 * 1024))
            {
                Serializer.Serialize(bs, data);
            } // need to close gzip to flush it (flush doesn't flush)
            raw = ms.ToArray();
        }
        write.Stop();

        read = Stopwatch.StartNew();
        using(var ms = new MemoryStream(raw))
        using(var gzip = new GZipStream(ms, CompressionMode.Decompress, true))
        {
            catalog = Serializer.Deserialize<Catalog>(gzip);
        }
        read.Stop();

        cout.WriteLine("protobuf-net, gzip: {0} bytes, write {1}ms, read {2}ms, chk {3}", raw.Length, write.ElapsedMilliseconds, read.ElapsedMilliseconds, catalog.Products.Count);
        raw = null; catalog = null;

        write = Stopwatch.StartNew();
        using (var ms = new MemoryStream())
        {
            using (var writer = new StreamWriter(ms))
            {
                WriteTsv(data, writer);
            }
            raw = ms.ToArray();
        }
        write.Stop();

        read = Stopwatch.StartNew();
        using (var ms = new MemoryStream(raw))
        using (var reader = new StreamReader(ms))
        {
            catalog = ReadTsv(reader);
        }
        read.Stop();

        cout.WriteLine("tsv, no compression: {0} bytes, write {1}ms, read {2}ms, chk {3}", raw.Length, write.ElapsedMilliseconds, read.ElapsedMilliseconds, catalog.Products.Count);
        raw = null; catalog = null;

        write = Stopwatch.StartNew();
        using (var ms = new MemoryStream())
        {
            using (var gzip = new GZipStream(ms, CompressionMode.Compress))
            using(var bs = new BufferedStream(gzip, 64 * 1024))
            using(var writer = new StreamWriter(bs))
            {
                WriteTsv(data, writer);
            }
            raw = ms.ToArray();
        }
        write.Stop();

        read = Stopwatch.StartNew();
        using(var ms = new MemoryStream(raw))
        using(var gzip = new GZipStream(ms, CompressionMode.Decompress, true))
        using(var reader = new StreamReader(gzip))
        {
            catalog = ReadTsv(reader);
        }
        read.Stop();

        cout.WriteLine("tsv, gzip: {0} bytes, write {1}ms, read {2}ms, chk {3}", raw.Length, write.ElapsedMilliseconds, read.ElapsedMilliseconds, catalog.Products.Count);
    }

    private static Catalog ReadTsv(StreamReader reader)
    {
        string line;
        List<Product> list = new List<Product>();
        while((line = reader.ReadLine()) != null)
        {
            string[] parts = line.Split('\t');
            var row = new Product();
            row.Id = int.Parse(parts[0]);
            row.Name = parts[1];
            row.QuantityAvailable = int.Parse(parts[2]);
            row.Price = decimal.Parse(parts[3]);
            row.Weight = int.Parse(parts[4]);
            row.Sku = parts[5];
            list.Add(row);
        }
        return new Catalog {Products = list};
    }
    private static void WriteTsv(Catalog catalog, StreamWriter writer)
    {
        foreach (var row in catalog.Products)
        {
            writer.Write(row.Id);
            writer.Write('\t');
            writer.Write(row.Name);
            writer.Write('\t');
            writer.Write(row.QuantityAvailable);
            writer.Write('\t');
            writer.Write(row.Price);
            writer.Write('\t');
            writer.Write(row.Weight);
            writer.Write('\t');
            writer.Write(row.Sku);
            writer.WriteLine();
        }
    }
    static Catalog InventData(int seed, int count)
    {
        string[] lipsum =
            @"Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."
                .Split(' ');
        char[] skuChars = "0123456789abcdef".ToCharArray();
        Random rand = new Random(seed);
        var list = new List<Product>(count);
        int id = 0;
        for (int i = 0; i < count; i++)
        {
            var row = new Product();
            row.Id = id++;
            var name = new StringBuilder(lipsum[rand.Next(lipsum.Length)]);
            int wordCount = rand.Next(0,5);
            for (int j = 0; j < wordCount; j++)
            {
                name.Append(' ').Append(lipsum[rand.Next(lipsum.Length)]);
            }
            row.Name = name.ToString();
            row.QuantityAvailable = rand.Next(1000);
            row.Price = rand.Next(10000)/100M;
            row.Weight = rand.Next(100);
            char[] sku = new char[10];
            for(int j = 0 ; j < sku.Length ; j++)
                sku[j] = skuChars[rand.Next(skuChars.Length)];
            row.Sku = new string(sku);
            list.Add(row);
        }
        return new Catalog {Products = list};
    }
}
[ProtoContract]
public class Catalog
{
    [ProtoMember(1, DataFormat = DataFormat.Group)]
    public List<Product> Products { get; set; } 
}
[ProtoContract]
public class Product
{
    [ProtoMember(1)]
    public int Id { get; set; }
    [ProtoMember(2)]
    public string Name { get; set; }
    [ProtoMember(3)]
    public int QuantityAvailable { get; set;}
    [ProtoMember(4)]
    public decimal Price { get; set; }
    [ProtoMember(5)]
    public int Weight { get; set; }
    [ProtoMember(6)]
    public string Sku { get; set; }
}

There are a number of factors possible here; firstly, note that the protocol buffers wire format uses straight UTF-8 encoding for strings; if you data is dominated by strings, it will ultimately need about the same amount of space as it would for TSV.

Protocol buffers is also designed to help store structured data i.e. more complex models that the single table scenario. This doesn't contribute hugely to the size, but start comparing with xml/json etc (which are more similar in terms of capability) and the difference is more obvious.

Additionally, since protocol buffers is pretty dense (UTF-8 notwithstanding), in some cases compressing it can actually make it bigger - you might want to check if this is the case here.

In a quick sample for the scenario you present, both formats give roughly the same sizes - there is no massive jump:

protobuf-net, no compression: 2498720 bytes, write 34ms, read 72ms, chk 50000
protobuf-net, gzip: 1521215 bytes, write 234ms, read 146ms, chk 50000
tsv, no compression: 2492591 bytes, write 74ms, read 122ms, chk 50000
tsv, gzip: 1258500 bytes, write 238ms, read 169ms, chk 50000

the tsv is marginally smaller in this case, but ultimately TSV is indeed a very simple format (with very limited capabilities in terms of structured data), so it is no surprise that it is quick.

Indeed; if all you are storing is a very simple single table, TSV is not a bad option - however, it is ultimately a very limited format. I can't reproduce your "much bigger" example.

In addition to the richer support for structured data (and other features), protobuf places a lot of emphasis on processing performance too. Now, since TSV is pretty simple the edge here won't be massive (but is noticeable in the above), but again: contrast to xml, json, or the inbuilt BinaryFormatter for a test against formats with similar features and the difference is obvious.


Example for the numbers above (updated to use BufferedStream):

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.IO.Compression;
using System.Text;
using ProtoBuf;
static class Program
{
    static void Main()
    {
        RunTest(12345, 1, new StringWriter()); // let everyone JIT etc
        RunTest(12345, 50000, Console.Out); // actual test
        Console.WriteLine("(done)");
        Console.ReadLine();
    }
    static void RunTest(int seed, int count, TextWriter cout)
    {

        var data = InventData(seed, count);

        byte[] raw;
        Catalog catalog;
        var write = Stopwatch.StartNew();
        using(var ms = new MemoryStream())
        {
            Serializer.Serialize(ms, data);
            raw = ms.ToArray();
        }
        write.Stop();

        var read = Stopwatch.StartNew();
        using(var ms = new MemoryStream(raw))
        {
            catalog = Serializer.Deserialize<Catalog>(ms);
        }
        read.Stop();

        cout.WriteLine("protobuf-net, no compression: {0} bytes, write {1}ms, read {2}ms, chk {3}", raw.Length, write.ElapsedMilliseconds, read.ElapsedMilliseconds, catalog.Products.Count);
        raw = null; catalog = null;

        write = Stopwatch.StartNew();
        using (var ms = new MemoryStream())   
        {
            using (var gzip = new GZipStream(ms, CompressionMode.Compress, true))
            using (var bs = new BufferedStream(gzip, 64 * 1024))
            {
                Serializer.Serialize(bs, data);
            } // need to close gzip to flush it (flush doesn't flush)
            raw = ms.ToArray();
        }
        write.Stop();

        read = Stopwatch.StartNew();
        using(var ms = new MemoryStream(raw))
        using(var gzip = new GZipStream(ms, CompressionMode.Decompress, true))
        {
            catalog = Serializer.Deserialize<Catalog>(gzip);
        }
        read.Stop();

        cout.WriteLine("protobuf-net, gzip: {0} bytes, write {1}ms, read {2}ms, chk {3}", raw.Length, write.ElapsedMilliseconds, read.ElapsedMilliseconds, catalog.Products.Count);
        raw = null; catalog = null;

        write = Stopwatch.StartNew();
        using (var ms = new MemoryStream())
        {
            using (var writer = new StreamWriter(ms))
            {
                WriteTsv(data, writer);
            }
            raw = ms.ToArray();
        }
        write.Stop();

        read = Stopwatch.StartNew();
        using (var ms = new MemoryStream(raw))
        using (var reader = new StreamReader(ms))
        {
            catalog = ReadTsv(reader);
        }
        read.Stop();

        cout.WriteLine("tsv, no compression: {0} bytes, write {1}ms, read {2}ms, chk {3}", raw.Length, write.ElapsedMilliseconds, read.ElapsedMilliseconds, catalog.Products.Count);
        raw = null; catalog = null;

        write = Stopwatch.StartNew();
        using (var ms = new MemoryStream())
        {
            using (var gzip = new GZipStream(ms, CompressionMode.Compress))
            using(var bs = new BufferedStream(gzip, 64 * 1024))
            using(var writer = new StreamWriter(bs))
            {
                WriteTsv(data, writer);
            }
            raw = ms.ToArray();
        }
        write.Stop();

        read = Stopwatch.StartNew();
        using(var ms = new MemoryStream(raw))
        using(var gzip = new GZipStream(ms, CompressionMode.Decompress, true))
        using(var reader = new StreamReader(gzip))
        {
            catalog = ReadTsv(reader);
        }
        read.Stop();

        cout.WriteLine("tsv, gzip: {0} bytes, write {1}ms, read {2}ms, chk {3}", raw.Length, write.ElapsedMilliseconds, read.ElapsedMilliseconds, catalog.Products.Count);
    }

    private static Catalog ReadTsv(StreamReader reader)
    {
        string line;
        List<Product> list = new List<Product>();
        while((line = reader.ReadLine()) != null)
        {
            string[] parts = line.Split('\t');
            var row = new Product();
            row.Id = int.Parse(parts[0]);
            row.Name = parts[1];
            row.QuantityAvailable = int.Parse(parts[2]);
            row.Price = decimal.Parse(parts[3]);
            row.Weight = int.Parse(parts[4]);
            row.Sku = parts[5];
            list.Add(row);
        }
        return new Catalog {Products = list};
    }
    private static void WriteTsv(Catalog catalog, StreamWriter writer)
    {
        foreach (var row in catalog.Products)
        {
            writer.Write(row.Id);
            writer.Write('\t');
            writer.Write(row.Name);
            writer.Write('\t');
            writer.Write(row.QuantityAvailable);
            writer.Write('\t');
            writer.Write(row.Price);
            writer.Write('\t');
            writer.Write(row.Weight);
            writer.Write('\t');
            writer.Write(row.Sku);
            writer.WriteLine();
        }
    }
    static Catalog InventData(int seed, int count)
    {
        string[] lipsum =
            @"Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."
                .Split(' ');
        char[] skuChars = "0123456789abcdef".ToCharArray();
        Random rand = new Random(seed);
        var list = new List<Product>(count);
        int id = 0;
        for (int i = 0; i < count; i++)
        {
            var row = new Product();
            row.Id = id++;
            var name = new StringBuilder(lipsum[rand.Next(lipsum.Length)]);
            int wordCount = rand.Next(0,5);
            for (int j = 0; j < wordCount; j++)
            {
                name.Append(' ').Append(lipsum[rand.Next(lipsum.Length)]);
            }
            row.Name = name.ToString();
            row.QuantityAvailable = rand.Next(1000);
            row.Price = rand.Next(10000)/100M;
            row.Weight = rand.Next(100);
            char[] sku = new char[10];
            for(int j = 0 ; j < sku.Length ; j++)
                sku[j] = skuChars[rand.Next(skuChars.Length)];
            row.Sku = new string(sku);
            list.Add(row);
        }
        return new Catalog {Products = list};
    }
}
[ProtoContract]
public class Catalog
{
    [ProtoMember(1, DataFormat = DataFormat.Group)]
    public List<Product> Products { get; set; } 
}
[ProtoContract]
public class Product
{
    [ProtoMember(1)]
    public int Id { get; set; }
    [ProtoMember(2)]
    public string Name { get; set; }
    [ProtoMember(3)]
    public int QuantityAvailable { get; set;}
    [ProtoMember(4)]
    public decimal Price { get; set; }
    [ProtoMember(5)]
    public int Weight { get; set; }
    [ProtoMember(6)]
    public string Sku { get; set; }
}
一直在等你来 2024-12-20 05:33:11

GZip 是一个流压缩器。如果您没有正确缓冲数据,压缩效果将非常差,因为它只能对小块进行操作,从而导致压缩效果大大降低。

尝试在序列化器和 GZipStream 之间放置一个具有适当大小缓冲区的 BufferedStream。

示例:使用直接写入 GZipStream 的 BinaryWriter 压缩 Int32 序列 1..100'000 将产生约 650kb 的结果,而在其间使用 64kb BufferedStream 将仅产生约 340kb 的压缩数据。

GZip is a stream compressor. In case you do not buffer data properly, the compression will be very poor because it will only operate on small blocks, resulting in much less effective compression.

Try putting a BufferedStream between your serializer and the GZipStream with a properly sized buffer.

Example: Compressing the Int32 sequence 1..100'000 with a BinaryWriter directly writing to a GZipStream will result in ~650kb, while with a 64kb BufferedStream between will result in only ~340kb of compressed data.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文