json 的紧凑二进制表示

发布于 2024-10-16 02:40:55 字数 1539 浏览 7 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

梦太阳 2024-10-23 02:40:55

您可以查看通用二进制 JSON 规范。它不会像 Smile 那样紧凑,因为它不进行名称引用,但它与 JSON 100% 兼容(其中 BSON 和 BJSON 定义了 JSON 中不存在的数据结构,因此没有标准转换为/从)。

使用以下标准格式进行读写也(故意)非常简单:

[type, 1-byte char]([length, 4-byte int32])([data])

所以简单的数据类型以 ASCII 标记代码开头,例如 'I' 表示 32 位 int,'T' 表示 true,'Z' 表示 null , 'S' 代表字符串等等。

该格式经过设计,旨在快速读取,因为所有数据结构都以其大小为前缀,因此不会扫描空终止序列。

例如,读取可能像这样划分的字符串([]-字符仅用于说明目的,它们不是以格式编写的)

[S][512][this is a really long 512-byte UTF-8 string....]

您将看到“S”,将其打开以处理字符串,请参阅“512”后面的 4 字节整数,并且知道您可以将接下来的 512 字节抓取到一个块中,并将它们解码回字符串。

类似地,数值在没有长度值的情况下写出,以便更紧凑,因为它们的类型(byte、int32、int64、double)都定义了它们的字节长度(分别为 1、4、8 和 8)。还支持任意长的数字即使在不支持它们的平台上,它也非常便携)。

平均而言,您应该看到平衡良好的 JSON 对象(大量混合类型)的大小减少了大约 30%。如果您想确切地了解某些结构如何压缩或不压缩,您可以查看大小要求部分来了解一下。

好的一面是,无论压缩如何,数据都将以更优化的格式写入,并且处理速度更快。

我检查了核心 Input/OutputStream 实现 用于今天将格式读/写到 GitHub 中。我将在本周晚些时候检查一般基于反射的对象映射。

你可以看一下这两个类,看看如何读写格式,我认为核心逻辑大概是20行代码。由于方法的抽象以及围绕检查标记字节以确保数据文件是有效格式的一些结构,这些类更长;诸如此类的事情。

如果您有真正具体的问题,例如规范的字节顺序(大)或双精度数的数字格式(IEEE 754),所有这些都包含在规范文档中,或者直接问我。

希望有帮助!

You could take a look at the Universal Binary JSON specification. It won't be as compact as Smile because it doesn't do name references, but it is 100% compatible with JSON (where as BSON and BJSON define data structures that don't exist in JSON so there is no standard conversion to/from).

It is also (intentionally) criminally simple to read and write with a standard format of:

[type, 1-byte char]([length, 4-byte int32])([data])

So simple data types begin with an ASCII marker code like 'I' for a 32-bit int, 'T' for true, 'Z' for null, 'S' for string and so on.

The format is by design engineered to be fast-to-read as all data structures are prefixed with their size so there is no scanning for null-terminated sequences.

For example, reading a string that might be demarcated like this (the []-chars are just for illustration purposes, they are not written in the format)

[S][512][this is a really long 512-byte UTF-8 string....]

You would see the 'S', switch on it to processing a string, see the 4-byte integer that follows it of "512" and know that you can just grab in one chunk the next 512 bytes and decode them back to a string.

Similarly numeric values are written out without a length value to be more compact because their type (byte, int32, int64, double) all define their length of bytes (1, 4, 8 and 8 respectively. There is also support for arbitrarily long numbers that is extremely portable, even on platforms that don't support them).

On average you should see a size reduction of roughly 30% with a well balanced JSON object (lots of mixed types). If you want to know exactly how certain structures compress or don't compress you can check the Size Requirements section to get an idea.

On the bright side, regardless of compression, the data will be written in a more optimized format and be faster to work with.

I checked the core Input/OutputStream implementations for reading/writing the format into GitHub today. I'll check in general reflection-based object mapping later this week.

You can just look at those two classes to see how to read and write the format, I think the core logic is something like 20 lines of code. The classes are longer because of abstractions to the methods and some structuring around checking the marker bytes to make sure the data file is a valid format; things like that.

If you have really specific questions like the endianness (Big) of the spec or numeric format for doubles (IEEE 754) all of that is covered in the spec doc or just ask me.

Hope that helps!

呆° 2024-10-23 02:40:55

是:Smile 数据格式(请参阅维基百科条目。它有公共 Java 实现,C 版本正在 github (libsmile) 上工作。它的优点是比JSON(可靠),但作为 100% 兼容的逻辑数据模型,因此可以轻松且可能地与文本 JSON 来回转换。

对于性能,您可以参见 jvm-serializers 基准测试,其中 smile 与其他二进制格式(thrift、avro、protobuf)很好地竞争;在大小方面它不是最紧凑的(因为它确实保留字段名称),但对于名称重复的数据流效果更好。Elastic

Search 和 Solr(可选)等项目正在使用它,Protostuff-rpc 支持它,尽管它不像 Thrift 或 protobuf 那样广泛。

编辑(2011 年 12 月)——现在还有 PHP、Ruby 和 Python 的 libsmile 绑定,因此语言支持正在改进。此外还有数据大小的测量;尽管对于单记录数据替代方案(Avro、protobuf)更紧凑,但对于数据流,由于键和字符串值反向引用选项,Smile 通常更紧凑。

Yes: Smile data format (see Wikipedia entry. It has public Java implementation, C version in the works at github (libsmile). It has benefit of being more compact than JSON (reliably), but being 100% compatible logical data model, so it is easy and possible to convert back and forth with textual JSON.

For performance, you can see jvm-serializers benchmark, where smile competes well with other binary formats (thrift, avro, protobuf); sizewise it is not the most compact (since it does retain field names), but does much better with data streams where names are repeated.

It is being used by projects like Elastic Search and Solr (optionally), Protostuff-rpc supports it, although it is not as widely as say Thrift or protobuf.

EDIT (Dec 2011) -- there are now also libsmile bindings for PHP, Ruby and Python, so language support is improving. In addition there are measurements on data size; and although for single-record data alternatives (Avro, protobuf) are more compact, for data streams Smile is often more compact due to key and String value back reference option.

花开柳相依 2024-10-23 02:40:55

由于其普遍支持,gzipping JSON 数据将让您轻松获得良好的压缩比。此外,如果您处于浏览器环境中,您最终可能会在新库的依赖项大小方面付出比实际节省的有效负载更大的字节成本。

如果您的数据有其他约束(例如大量冗余字段值),您可以通过查看不同的序列化协议而不是坚持使用 JSON 来进行优化。示例:基于列的序列化(例如 Avro 的即将推出的列式存储)可能会让您更好比率(用于磁盘存储)。如果您的有效负载包含大量常量值(例如表示枚举的列),则字典压缩方法也可能很有用。

gzipping JSON data is going to get you good compression ratios with very little effort because of its universal support. Also, if you're in a browser environment, you may end up paying a greater byte cost in the size of the dependency from a new library than you would in actual payload savings.

If your data has additional constraints (such as lots of redundant field values), you may be able to optimize by looking at a different serialization protocol rather than sticking to JSON. Example: a column-based serialization such as Avro's upcoming columnar store may get you better ratios (for on-disk storage). If your payloads contain lots of constant values (such as columns that represent enums), a dictionary compression approach may be useful too.

故事与诗 2024-10-23 02:40:55

现在应该考虑的另一个替代方案是 CBOR (RFC 7049),它具有明确的 JSON 兼容模型,其中包含很多内容的灵活性。它既稳定又符合您的开放标准资格,并且显然投入了很多心思。

Another alternative that should be considered these days is CBOR (RFC 7049), which has an explicitly JSON-compatible model with a lot of flexibility. It is both stable and meets your open-standard qualification, and has obviously had a lot of thought put into it.

戈亓 2024-10-23 02:40:55

您尝试过 BJSON 吗?

Have you tried BJSON ?

烧了回忆取暖 2024-10-23 02:40:55

尝试使用 js-inflate 来创建和取消创建 blob。

https://github.com/augustl/js-inflate

这很完美,我用了很多。

Try to use the js-inflate to make and unmake blobs.

https://github.com/augustl/js-inflate

This is perfect and I use a lot.

梦与时光遇 2024-10-23 02:40:55

您可能还想看看我编写的库。它被称为 minijson,它就是为此目的而设计的。
它是Python:

https://github.com/Dronehub/minijson

You might also want to take a look at a library I wrote. It's called minijson, and it was designed for this very purpose.
It's Python:

https://github.com/Dronehub/minijson

彩扇题诗 2024-10-23 02:40:55

你试过AVRO吗?阿帕奇阿夫罗
https://avro.apache.org/

Have you tried AVRO? Apache Avro
https://avro.apache.org/

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文