java:为自定义序列化分配对象引用ID

发布于 2024-09-05 07:08:15 字数 1597 浏览 5 评论 0 原文

由于各种原因,我有一个自定义序列化,其中我将一些相当简单的对象转储到数据文件中。可能有 5-10 个类,并且生成的对象图是非循环的并且非常简单(每个序列化对象都有 1 或 2 个对另一个序列化对象的引用)。例如:

class Foo
{
    final private long id;
    public Foo(long id, /* other stuff */) { ... }
}

class Bar
{
    final private long id;
    final private Foo foo;
    public Bar(long id, Foo foo, /* other stuff */) { ... }
}

class Baz
{
    final private long id;
    final private List<Bar> barList;
    public Baz(long id, List<Bar> barList, /* other stuff */) { ... }
}

id 字段仅用于序列化,因此当我序列化到文件时,我可以通过记录到目前为止已序列化的 ID 来写入对象,然后对于每个对象检查其子对象是否已被序列化序列化并写入尚未序列化的序列,最后通过写入其数据字段和与其子对象对应的 ID 来写入对象本身。

令我困惑的是如何分配 id。我想了想,分配ID似乎有三种情况:

  • 动态创建的对象——id是从一个
  • 从磁盘读取对象的计数器中分配的——id是从存储在磁盘文件单
  • 例 中的数字分配的对象——对象在任何动态创建的对象之前创建,以表示始终存在的单例对象。

我该如何正确处理这些?我觉得我正在重新发明轮子,必须有一个完善的技术来处理所有的情况。


澄清:就像一些无关紧要的信息一样,我正在查看的文件格式大致如下(掩盖了一些不相关的细节)。它经过优化,可以处理相当大量的密集二进制数据(数十/数百 MB),并能够在其中散布结构化数据。密集的二进制数据占文件大小的 99.9%。

该文件由一系列作为容器的纠错块组成。每个块可以被认为包含一个由一系列数据包组成的字节数组。可以连续读取一个数据包(例如,可以知道每个数据包的结尾在哪里,并且下一个数据包随后立即开始)。

因此,该文件可以被视为存储在纠错层之上的一系列数据包。这些数据包绝大多数是不透明的二进制数据,与这个问题无关。然而,这些数据包中的一小部分是包含序列化结构化数据的项目,形成一种由可以通过对象引用关系链接的数据“岛”组成的“群岛”。

因此,我可能有一个文件,其中数据包 2971 包含序列化的 Foo,数据包 12083 包含引用数据包 2971 中的 Foo 的序列化 Bar。(数据包 0-2970 和 2972​​-12082 是不透明数据包)

所有这些数据包都是不可变(因此考虑到 Java 对象构造的限制,它们形成非循环对象图),因此我不必处理可变性问题。它们也是通用 Item 接口的后代。我想做的是将任意 Item 对象写入文件。如果 Item 包含对文件中已存在的其他 Item 的引用,我也需要将它们写入文件,但前提是它们尚未写入。否则,当我读回它们时,我将需要以某种方式合并重复的内容。

For various reasons I have a custom serialization where I am dumping some fairly simple objects to a data file. There are maybe 5-10 classes, and the object graphs that result are acyclic and pretty simple (each serialized object has 1 or 2 references to another that are serialized). For example:

class Foo
{
    final private long id;
    public Foo(long id, /* other stuff */) { ... }
}

class Bar
{
    final private long id;
    final private Foo foo;
    public Bar(long id, Foo foo, /* other stuff */) { ... }
}

class Baz
{
    final private long id;
    final private List<Bar> barList;
    public Baz(long id, List<Bar> barList, /* other stuff */) { ... }
}

The id field is just for the serialization, so that when I am serializing to a file, I can write objects by keeping a record of which IDs have been serialized so far, then for each object checking whether its child objects have been serialized and writing the ones that haven't, finally writing the object itself by writing its data fields and the IDs corresponding to its child objects.

What's puzzling me is how to assign id's. I thought about it, and it seems like there are three cases for assigning an ID:

  • dynamically-created objects -- id is assigned from a counter that increments
  • reading objects from disk -- id is assigned from the number stored in the disk file
  • singleton objects -- object is created prior to any dynamically-created object, to represent a singleton object that is always present.

How can I handle these properly? I feel like I'm reinventing the wheel and there must be a well-established technique for handling all the cases.


clarification: just as some tangential information, the file format I am looking at is approximately the following (glossing over a few details which should not be relevant). It's optimized to handle a fairly large amount of dense binary data (tens/hundreds of MB) with the ability to intersperse structured data in it. The dense binary data makes up 99.9% of the file size.

The file consists of a series of error-corrected blocks which serve as containers. Each block can be thought of as containing a byte array which consists of a series of packets. It is possible to read the packets one at a time in succession (e.g. it's possible to tell where the end of each packet is, and the next one starts immediately afterwards).

So the file can be thought of as a series of packets stored on top of an error-correcting layer. The vast majority of these packets are opaque binary data that has nothing to do with this question. A small minority of these packets, however, are items containing serialized structured data, forming a sort of "archipelago" consisting of data "islands" which may be linked by object reference relationships.

So I might have a file where packet 2971 contains a serialized Foo, and packet 12083 contains a serialized Bar that refers to the Foo in packet 2971. (with packets 0-2970 and 2972-12082 being opaque data packets)

All these packets are all immutable (and therefore given the constrains of Java object construction, they form an acyclic object graph) so I don't have to deal with mutability issues. They are also descendents of a common Item interface. What I would like to do is write an arbitrary Item object to the file. If the Item contains references to other Items that are already in the file, I need to write those to the file too, but only if they haven't been written yet. Otherwise I will have duplicates that I will need to somehow coalesce when I read them back.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

林空鹿饮溪 2024-09-12 07:08:15

你真的需要这样做吗?在内部,ObjectOutputStream 跟踪哪些对象已被序列化。同一对象的后续写入仅存储内部引用(类似于仅写出 id),而不是再次写出整个对象。

有关更多详细信息,请参阅序列化缓存

如果 ID 对应于某些外部定义的身份(例如实体 ID),那么这是有意义的。但问题指出,生成 ID 纯粹是为了跟踪哪些对象被序列化。

您可以通过 readResolve 方法处理单例。一种简单的方法是将新反序列化的实例与单例实例进行比较,如果匹配,则返回单例实例而不是反序列化的实例。例如

   private Object readResolve() {
      return (this.equals(SINGLETON)) ? SINGLETON : this;
      // or simply
      // return SINGLETON;
   }

编辑:响应评论,流主要是二进制数据(以优化格式存储),复杂对象分散在该数据中。这可以通过使用支持子流的流格式(例如 zip 或简单的块分块)来处理。例如,流可以是一系列块:

offset 0  - block type
offset 4  - block length N
offset 8  - N bytes of data
...
offset N+8  start of next block

然后您可以拥有二进制数据块、序列化数据块、XStream 序列化数据块等。由于每个块都知道它的大小,因此您可以创建一个子流以从放在文件中。这使您可以自由地混合数据,而不必担心解析。

要实现流,请让您的主流解析块,例如

   DataInputStream main = new DataInputStream(input);
   int blockType = main.readInt();
   int blockLength = main.readInt();
   // next N bytes are the data
   LimitInputStream data = new LimitInputStream(main, blockLength);

   if (blockType==BINARY) {
      handleBinaryBlock(new DataInputStream(data));
   }
   else if (blockType==OBJECTSTREAM) {
      deserialize(new ObjectInputStream(data));
   }
   else
      ...

LimitInputStream 的草图如下所示:

public class LimitInputStream extends FilterInputStream
{
   private int bytesRead;
   private int limit;
   /** Reads up to limit bytes from in */
   public LimitInputStream(InputStream in, int limit) {
      super(in);
      this.limit = limit;
   }

   public int read(byte[] data, int offs, int len) throws IOException {
      if (len==0) return 0; // read() contract mandates this
      if (bytesRead==limit)
         return -1;
      int toRead = Math.min(limit-bytesRead, len);
      int actuallyRead = super.read(data, offs, toRead);
      if (actuallyRead==-1)
          throw new UnexpectedEOFException();
      bytesRead += actuallyRead;
      return actuallyRead;
   }

   // similarly for the other read() methods

   // don't propagate to underlying stream
   public void close() { }
}

Do you really need to do this? Internally, the ObjectOutputStream tracks which objects have been serialized already. Subsequent writes of the same object only store a internal reference (similar to writing out just the id) rather than writing out the whole object again.

See Serialization Cache for more details.

If the IDs correspond to some externally defined identity, such as an entity ID, then this makes sense. But the question states that the IDs are generated purely to track which objects are serialized.

You can handle singletons via the readResolve method. A simple approach is to compare the freshly deserialized instance with your singleton instances, and if there is a match, return the singleton instance rather than the deserialized instance. E.g.

   private Object readResolve() {
      return (this.equals(SINGLETON)) ? SINGLETON : this;
      // or simply
      // return SINGLETON;
   }

EDIT: In response to the comments, the stream is mostly binary data (stored in an optimized format) with complex objects indispersed in that data. This can be handled by using a stream format that supports substreams, e.g. zip, or a simple block chunking. E.g. the stream can be a sequence of blocks:

offset 0  - block type
offset 4  - block length N
offset 8  - N bytes of data
...
offset N+8  start of next block

You can then have blocks for binary data, blocks for serialized data, blocks for XStream serialized data etc. Since each block knows it's size you can create a substream to read up to that length from the place in the file. This allows you to freely mix data without concerns for parsing.

To implement a stream, have your main stream parse the blocks, e.g.

   DataInputStream main = new DataInputStream(input);
   int blockType = main.readInt();
   int blockLength = main.readInt();
   // next N bytes are the data
   LimitInputStream data = new LimitInputStream(main, blockLength);

   if (blockType==BINARY) {
      handleBinaryBlock(new DataInputStream(data));
   }
   else if (blockType==OBJECTSTREAM) {
      deserialize(new ObjectInputStream(data));
   }
   else
      ...

A sketch of LimitInputStream looks like this:

public class LimitInputStream extends FilterInputStream
{
   private int bytesRead;
   private int limit;
   /** Reads up to limit bytes from in */
   public LimitInputStream(InputStream in, int limit) {
      super(in);
      this.limit = limit;
   }

   public int read(byte[] data, int offs, int len) throws IOException {
      if (len==0) return 0; // read() contract mandates this
      if (bytesRead==limit)
         return -1;
      int toRead = Math.min(limit-bytesRead, len);
      int actuallyRead = super.read(data, offs, toRead);
      if (actuallyRead==-1)
          throw new UnexpectedEOFException();
      bytesRead += actuallyRead;
      return actuallyRead;
   }

   // similarly for the other read() methods

   // don't propagate to underlying stream
   public void close() { }
}
山色无中 2024-09-12 07:08:15

foo 是否已在 FooRegistry 中注册?您可以尝试这种方法(假设 Bar 和 Baz 也有注册表来通过 id 获取引用)。

这可能有很多语法错误、使用错误等。但我觉得这种方法是一个很好的方法。

公共类 Foo {

public Foo(...) {
    //construct
    this.id = FooRegistry.register(this);
}

public Foo(long id, ...) {
    //construct
    this.id = id;
    FooRegistry.register(this,id);
}

}

公共类 FooRegistry() {
Map foos = new HashMap...

long register(Foo foo) {
    while(foos.get(currentFooCount) == null) currentFooCount++;
    foos.add(currentFooCount,foo);
    return currentFooCount;
}

void register(Foo foo, long id) {
    if(foo.get(id) == null) throw new Exc ... // invalid
    foos.add(foo,id);
}

}

public class Bar() {

void writeToStream(OutputStream out) {
    out.print("<BAR><id>" + id + "</id><foo>" + foo.getId() + "</foo></BAR>");
}

}

public class Baz() {

void.writeToStream(OutputStream out) {
    out.print("<BAZ><id>" + id + "</id>");
    for(Bar bar : barList) out.println("<bar>" + bar.getId() + </bar>");
    out.print("</BAZ>");
}

}

Are the foos registered with a FooRegistry? You could try this approach (assume Bar and Baz also have registries to acquire the references via the id).

This probably has lots of syntax errors, usage errors, etc. But I feel the approach is a good one.

public class Foo {

public Foo(...) {
    //construct
    this.id = FooRegistry.register(this);
}

public Foo(long id, ...) {
    //construct
    this.id = id;
    FooRegistry.register(this,id);
}

}

public class FooRegistry() {
Map foos = new HashMap...

long register(Foo foo) {
    while(foos.get(currentFooCount) == null) currentFooCount++;
    foos.add(currentFooCount,foo);
    return currentFooCount;
}

void register(Foo foo, long id) {
    if(foo.get(id) == null) throw new Exc ... // invalid
    foos.add(foo,id);
}

}

public class Bar() {

void writeToStream(OutputStream out) {
    out.print("<BAR><id>" + id + "</id><foo>" + foo.getId() + "</foo></BAR>");
}

}

public class Baz() {

void.writeToStream(OutputStream out) {
    out.print("<BAZ><id>" + id + "</id>");
    for(Bar bar : barList) out.println("<bar>" + bar.getId() + </bar>");
    out.print("</BAZ>");
}

}

奢华的一滴泪 2024-09-12 07:08:15

我觉得我正在重新发明轮子,必须有一种成熟的技术来处理所有情况。

是的,看起来默认对象序列化就可以了,或者最终你要进行预优化。

您可以更改序列化数据的格式(例如 XMLEncoder )为了更方便。

但是如果你坚持的话,我认为带有动态计数器的单例应该可以,但不要将 id 放在构造函数的公共接口中:

class Foo {
    private final int id;
    public Foo( int id, /*other*/ ) { // drop the int id
    }
 }

所以该类可以是一个“序列”,并且可能是一个long 更适合避免 Integer.MAX_VALUE 出现问题。

使用 AtomicLong,如 java.util.concurrent.atomic 包(以避免两个线程分配相同的 id,或避免过度同步)也会有所帮助。

class Sequencer {
    private static AtomicLong sequenceNumber = new AtomicLong(0);
    public static long next() { 
         return sequenceNumber.getAndIncrement();
    }
}

现在,在每个班级中,您都可以做到

 class Foo {
      private final long id;
      public Foo( String name, String data, etc ) {
          this.id = Sequencer.next();
      }
 }

这一点。

(注意,我不记得反序列化对象是否调用构造函数,但你明白了)

I feel like I'm reinventing the wheel and there must be a well-established technique for handling all the cases.

Yes, looks like default object serialization would do, or ultimately you're pre-optimizing.

You can change the format of the serialized data ( like the XMLEncoder does ) for a more convenient one.

But if you insist, I think the singleton with dynamic counter should do, but don't put the id, in the public interface for the constructor:

class Foo {
    private final int id;
    public Foo( int id, /*other*/ ) { // drop the int id
    }
 }

So the class could be a "sequence" and probably a long would be more appropriate to avoid have problems with the Integer.MAX_VALUE.

Using an AtomicLong as described in the java.util.concurrent.atomic package ( to avoid having two threads assign the same id, or to avoid excessive synchronization ) would help too.

class Sequencer {
    private static AtomicLong sequenceNumber = new AtomicLong(0);
    public static long next() { 
         return sequenceNumber.getAndIncrement();
    }
}

Now in each class you have

 class Foo {
      private final long id;
      public Foo( String name, String data, etc ) {
          this.id = Sequencer.next();
      }
 }

And that's it.

( note, I don't remember if deserializing the object invokes the constructor, but you get the idea )

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文