Clojure/Java 的高效二进制序列化

发布于 2024-12-09 06:26:15 字数 848 浏览 0 评论 0原文

我正在寻找一种有效地将 Clojure 对象序列化为二进制格式的方法 - 即不仅仅是进行经典的打印和读取文本序列化。

即我想做类似的事情：

(def orig-data {:name "Data Object" 
                :data (get-big-java-array) 
                :other (get-clojure-data-stuff)})

(def binary (serialize orig-data))

;; here "binary" is a raw binary form, e.g. a Java byte array
;; so it can be persisted in key/value store or sent over network etc.

;; now check it works!

(def new-data (deserialize binary))

(= new-data orig-data)
=> true

动机是我有一些大型数据结构，其中包含大量二进制数据（在 Java 数组中），并且我想避免将这些数据全部转换为文本并再次转换回来的开销。此外，我试图保持格式紧凑，以尽量减少网络带宽的使用。

我想要的具体功能：

轻量级、纯 Java 实现
支持所有 Clojure 的标准数据结构以及所有 Java 原语、数组等。
不需要额外的构建步骤/配置文件 - 我宁愿它只是工作”开箱即用”
在所需的处理时间方面都有良好的性能
在二进制编码表示方面的紧凑性

在 Clojure 中执行此操作的最佳/标准方法是什么？

原文

I'm looking for a way to efficiently serialize Clojure objects into a binary format - i.e. not just doing the classic print and read text serialization.

i.e. I want to do something like:

(def orig-data {:name "Data Object" 
                :data (get-big-java-array) 
                :other (get-clojure-data-stuff)})

(def binary (serialize orig-data))

;; here "binary" is a raw binary form, e.g. a Java byte array
;; so it can be persisted in key/value store or sent over network etc.

;; now check it works!

(def new-data (deserialize binary))

(= new-data orig-data)
=> true

The motivation is that I have some large data structures that contain a significant amount of binary data (in Java arrays), and I want to avoid the overhead of converting these all to text and back again. In addition, I'm trying to keep the format compact in order to minimise network bandwidth usage.

Specific features I'd like to have:

Lightweight, pure-Java implementation
Support all of Clojure's standard data structures as well as all Java primitives, arrays etc.
No need for extra build steps / configuration files - I'd rather it just worked "out of the box"
Good performance both in terms of processing time required
Compactness in terms of binary encoded representation

What's the best / standard approach to doing this in Clojure?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

脸赞 2024-12-16 06:26:15

我可能在这里遗漏了一些东西，但是标准的 Java 序列化有什么问题呢？太慢、太大，还是其他原因？

用于普通 Java 序列化的 Clojure 包装器可能是这样的：

(defn serializable? [v]
  (instance? java.io.Serializable v))

(defn serialize 
  "Serializes value, returns a byte array"
  [v]
  (let [buff (java.io.ByteArrayOutputStream. 1024)]
    (with-open [dos (java.io.ObjectOutputStream. buff)]
      (.writeObject dos v))
    (.toByteArray buff)))

(defn deserialize 
  "Accepts a byte array, returns deserialized value"
  [bytes]
  (with-open [dis (java.io.ObjectInputStream.
                   (java.io.ByteArrayInputStream. bytes))]
    (.readObject dis)))

 user> (= (range 10) (deserialize (serialize (range 10))))
 true

有些值无法序列化，例如 Java 流和 Clojure 原子/代理/未来，但它应该适用于大多数普通值，包括 Java 原语和数组以及 Clojure 函数，收集和记录。

您是否真正保存任何东西取决于。在我对小型数据集的有限测试中，序列化为文本和二进制似乎具有相同的时间和空间。

但对于大量数据是 Java 原语数组的特殊情况，Java 序列化可以快几个数量级并节省大量空间。 （在笔记本电脑上进行快速测试，100k 随机字节：序列化 0.9 毫秒，100kB；文本 490 毫秒，700kB。）

请注意 (= new-data orig-data) test 不适用于数组（它委托给 Java 的 equals ，对于数组来说它只是测试它是否是同一个对象），所以你可以想要/需要编写自己的相等函数来测试序列化。

user> (def a (range 10))
user> (= a (range 10))
true
user> (= (into-array a) (into-array a))
false
user> (.equals (into-array a) (into-array a))
false
user> (java.util.Arrays/equals (into-array a) (into-array a))
true

I may be missing something here, but what's wrong with the standard Java serialization? Too slow, too big, something else?

A Clojure wrapper for plain Java serialization could be something like this:

(defn serializable? [v]
  (instance? java.io.Serializable v))

(defn serialize 
  "Serializes value, returns a byte array"
  [v]
  (let [buff (java.io.ByteArrayOutputStream. 1024)]
    (with-open [dos (java.io.ObjectOutputStream. buff)]
      (.writeObject dos v))
    (.toByteArray buff)))

(defn deserialize 
  "Accepts a byte array, returns deserialized value"
  [bytes]
  (with-open [dis (java.io.ObjectInputStream.
                   (java.io.ByteArrayInputStream. bytes))]
    (.readObject dis)))

 user> (= (range 10) (deserialize (serialize (range 10))))
 true

There are values that cannot be serialized, e.g. Java streams and Clojure atom/agent/future, but it should work for most plain values, including Java primitives and arrays and Clojure functions, collections and records.

Whether you actually save anything depends. In my limited testing on smallish data sets serializing to text and binary seems to be about the same time and space.

But for the special case where the bulk of the data is arrays of Java primitives, Java serialization can be orders of magnitude faster and save a significant chunk of space. (Quick test on a laptop, 100k random bytes: serialize 0.9 ms, 100kB; text 490 ms, 700kB.)

Note that the (= new-data orig-data) test doesn't work for arrays (it delegates to Java's equals, which for arrays just tests whether it's the same object), so you may want/need to write your own equality function to test the serialization.

user> (def a (range 10))
user> (= a (range 10))
true
user> (= (into-array a) (into-array a))
false
user> (.equals (into-array a) (into-array a))
false
user> (java.util.Arrays/equals (into-array a) (into-array a))
true

回复收藏 0 原文