Clojure/Java 的高效二进制序列化
我正在寻找一种有效地将 Clojure 对象序列化为二进制格式的方法 - 即不仅仅是进行经典的打印和读取文本序列化。
即我想做类似的事情:
(def orig-data {:name "Data Object"
:data (get-big-java-array)
:other (get-clojure-data-stuff)})
(def binary (serialize orig-data))
;; here "binary" is a raw binary form, e.g. a Java byte array
;; so it can be persisted in key/value store or sent over network etc.
;; now check it works!
(def new-data (deserialize binary))
(= new-data orig-data)
=> true
动机是我有一些大型数据结构,其中包含大量二进制数据(在 Java 数组中),并且我想避免将这些数据全部转换为文本并再次转换回来的开销。此外,我试图保持格式紧凑,以尽量减少网络带宽的使用。
我想要的具体功能:
- 轻量级、纯 Java 实现
- 支持所有 Clojure 的标准数据结构以及所有 Java 原语、数组等。
- 不需要额外的构建步骤/配置文件 - 我宁愿它只是工作”开箱即用”
- 在所需的处理时间方面都有良好的性能
- 在二进制编码表示方面的紧凑性
在 Clojure 中执行此操作的最佳/标准方法是什么?
I'm looking for a way to efficiently serialize Clojure objects into a binary format - i.e. not just doing the classic print and read text serialization.
i.e. I want to do something like:
(def orig-data {:name "Data Object"
:data (get-big-java-array)
:other (get-clojure-data-stuff)})
(def binary (serialize orig-data))
;; here "binary" is a raw binary form, e.g. a Java byte array
;; so it can be persisted in key/value store or sent over network etc.
;; now check it works!
(def new-data (deserialize binary))
(= new-data orig-data)
=> true
The motivation is that I have some large data structures that contain a significant amount of binary data (in Java arrays), and I want to avoid the overhead of converting these all to text and back again. In addition, I'm trying to keep the format compact in order to minimise network bandwidth usage.
Specific features I'd like to have:
- Lightweight, pure-Java implementation
- Support all of Clojure's standard data structures as well as all Java primitives, arrays etc.
- No need for extra build steps / configuration files - I'd rather it just worked "out of the box"
- Good performance both in terms of processing time required
- Compactness in terms of binary encoded representation
What's the best / standard approach to doing this in Clojure?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我可能在这里遗漏了一些东西,但是标准的 Java 序列化有什么问题呢?太慢、太大,还是其他原因?
用于普通 Java 序列化的 Clojure 包装器可能是这样的:
有些值无法序列化,例如 Java 流和 Clojure 原子/代理/未来,但它应该适用于大多数普通值,包括 Java 原语和数组以及 Clojure 函数,收集和记录。
您是否真正保存任何东西取决于。在我对小型数据集的有限测试中,序列化为文本和二进制似乎具有相同的时间和空间。
但对于大量数据是 Java 原语数组的特殊情况,Java 序列化可以快几个数量级并节省大量空间。 (在笔记本电脑上进行快速测试,100k 随机字节:序列化 0.9 毫秒,100kB;文本 490 毫秒,700kB。)
请注意
(= new-data orig-data)
test 不适用于数组(它委托给 Java 的 equals ,对于数组来说它只是测试它是否是同一个对象),所以你可以想要/需要编写自己的相等函数来测试序列化。I may be missing something here, but what's wrong with the standard Java serialization? Too slow, too big, something else?
A Clojure wrapper for plain Java serialization could be something like this:
There are values that cannot be serialized, e.g. Java streams and Clojure atom/agent/future, but it should work for most plain values, including Java primitives and arrays and Clojure functions, collections and records.
Whether you actually save anything depends. In my limited testing on smallish data sets serializing to text and binary seems to be about the same time and space.
But for the special case where the bulk of the data is arrays of Java primitives, Java serialization can be orders of magnitude faster and save a significant chunk of space. (Quick test on a laptop, 100k random bytes: serialize 0.9 ms, 100kB; text 490 ms, 700kB.)
Note that the
(= new-data orig-data)
test doesn't work for arrays (it delegates to Java'sequals
, which for arrays just tests whether it's the same object), so you may want/need to write your own equality function to test the serialization.Nippy 是最好的选择之一: https://github.com/ptaoussanis/nippy
Nippy is one of the best choices imho: https://github.com/ptaoussanis/nippy
您考虑过 Google 的 protobuf 吗?您可能需要使用 Clojure 界面检查 GitHub 存储库。
Have you considered Google's protobuf? You might want to check the GitHub repository with the interface for Clojure.
如果您事先没有模式,序列化为文本可能是您最好的选择。一般来说,要序列化任意数据,您需要做大量工作来保留对象图,并进行反射以了解如何序列化所有内容...至少 Clojure 的打印机可以对
进行静态、无反射查找每个项目的 >print-method
。相反,如果您确实想要优化的有线格式,则需要定义一个架构。我使用了 java 中的 thrift 和 clojure 中的 protobuf:两者都不是很有趣,但如果您提前计划,它也不会非常繁重。
If you don't have a schema ahead of time, serializing to text is probably your best bet. To serialize arbitrary data in general, you need to do a lot of work to preserve the object graph, and do reflection to see how to serialize everything...at least Clojure's printer can do a static, no-reflection lookup of the
print-method
for each item.Conversely, if you really want an optimized wire format, you need to define a schema. I've used thrift from java, and protobuf from clojure: neither is loads of fun, but it's not hideously onerous if you plan in advance.