对象的序列化:不能涉及任何线程状态,对吗?
我正在认真研究将正在执行的程序的状态存储到磁盘并将其再次带回的基本原理。 在我们当前的设计中,每个对象(这是一个带有函数指针列表的 C 级东西,一种低级的自制面向对象——并且有很好的理由这样做)将是调用将其显式状态导出为可写和可恢复的格式。 实现这项工作的关键属性是与对象相关的所有状态确实封装在对象数据结构中。
还有其他解决方案可以处理活动对象,其中有一个用户级线程附加到某些对象。 因此,程序计数器、寄存器内容和堆栈内容突然成为程序状态的一部分。 据我所知,没有好的方法可以在任意时间点将此类内容序列化到磁盘。 线程必须将自己置于某种特殊状态,其中程序计数器等不表示任何内容,因此基本上将它们的执行状态机状态“保存”到显式对象状态。
我查看了一系列序列化库,据我所知这是一个通用属性。
核心问题是:或者事实并非如此? 是否有保存/恢复解决方案可以包含线程状态(就线程在代码中的执行位置而言)?
请注意,在虚拟机中保存整个系统状态不算在内,这并不是真正的序列化国家,但只是冻结一台机器并移动它。 这是一个显而易见的解决方案,但大多数时候有点重量级。
有些问题清楚地表明我在解释我们如何做事的想法方面还不够清楚。 我们正在开发一个模拟器系统,对于在其中运行的代码有非常严格的规则,允许编写。 特别是,我们将对象构造和对象状态完全分开。 每次设置系统时都会重新创建接口函数指针,并且它们不是状态的一部分。 状态仅由特定指定的“属性”组成,每个属性都有一个定义的 get/set 函数,用于在内部运行时表示和存储表示之间进行转换。 对于对象之间的指针,它们都被转换为名称。 因此,在我们的设计中,一个对象在存储中可能会像这样:
Object foo {
value1: 0xff00ff00;
value2: 0x00ffeedd;
next_guy_in_chain: bar;
}
Object bar {
next_guy_in_chain: null;
}
链接列表永远不会真正存在于模拟结构中,每个对象代表某种硬件单元。
问题是有些人想要这样做,但也有线程作为编码行为的方式。 这里的“行为”实际上是模拟单元状态的突变。 基本上,我们的设计表明,所有此类更改都必须在调用、完成工作并返回的原子完整操作中进行。 所有状态都存储在对象中。 您有一个反应式模型,或者它可以称为“运行至完成”或“事件驱动”。
另一种思考方式是让对象拥有对其进行操作的活动线程,这些线程以与经典 Unix 线程相同的方式处于永恒循环中,并且永远不会终止。 在这种情况下,我试图看看是否可以合理地将其存储到磁盘,但如果不在下面插入虚拟机,这似乎是不可行的。
更新,2009 年 10 月: 2009 年 FDL 会议上发表了与此相关的论文,请参阅 这篇关于检查点和 SystemC 的论文。
I am looking hard at the basic principles of storing the state of an executing program to disk, and bringing it back in again. In the current design that we have, each object (which is a C-level thingy with function pointer lists, kind of low-level home-made object-orientation -- and there are very good reasons for doing it this way) will be called to export its explicit state to a writable and restorable format. The key property to make this work is that all state related to an object is indeed encapsulated in the object data structures.
There are other solutions where you work with active objects, where there is a user-level thread attached to some objects. And thus, the program counter, register contents, and stack contents suddenly become part of the program state. As far as I can see, there is no good way to serialize such things to disk at an arbitrary point in time. The threads have to go park themselves in some special state where nothing is represented by the program counter et al, and thus basically "save" their execution state machine state to the explicit object state.
I have looked at a range of serialization libraries, and as far as I can tell this is a universal property.
The core quesion is this: Or is this actually not so? Are there save/restore solutions out there that can include thread state, in terms of where in its code a thread is executing?
Note that saving an entire system state in a virtual machine does not count, that is not really serializing the state, but just freezing a machine and moving it. It is an obvious solution, but a bit heavyweight most of the time.
Some questions made it clear that I was not clear enough in explaining the idea of how we do things. We are working on a simulator system, with very strict rules for code running inside it is allowed to be written. In particular, we make a complete divide between object construction and object state. The interface function pointers are recreated every time you set up the system, and are not part of the state. The state only consists of specific appointed "attributes" that each have a defined get/set function that converts between internal runtime representation and storage representation. For pointers between objects, they are all converted to names. So in our design, an object might come out like this in storage:
Object foo {
value1: 0xff00ff00;
value2: 0x00ffeedd;
next_guy_in_chain: bar;
}
Object bar {
next_guy_in_chain: null;
}
Linked lists are never really present in the simulation structure, each object represents a unit of hardware of some kind.
The problem is that some people want to do this, but also have threads as a way to code behavior. "Behavior" here is really mutation of the state of the simulation units. Basically, the design we have says that all such changeds have to be made in atomic complete operations that are called, do their work, and return. All state is stored in the objects. You have a reactive model, or it could be called "run to completion", or "event driven".
The other way of thinking about this is to have objects have active threads working on them, which sit in an eternal loop in the same way as classic Unix threads, and never terminate. This is the case that I am trying to see if it can be reasonable stored to disk, but it does not seem like that is feasible without interposing a VM underneath.
Update, October 2009: A paper related to this was published at the FDL conference in 2009, see this paper about checkpointing and SystemC.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
我认为仅序列化程序的“某些线程”是行不通的,因为您会遇到同步问题(这里描述了一些问题 http://java.sun.com/j2se/1.3/docs/guide/misc/threadPrimitiveDeprecation.html )。
因此,坚持整个程序是获得一致状态的唯一可行方法。
您可能会研究正交持久性。 有一些原型实现:
http: //research.sun.com/forest/COM.Sun.Labs.Forest.doc.external_www.PJava.main.html
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.17.7429
但它们都没有维护不再或已经获得了很多吸引力(据我所知)。 我想检查点毕竟不是最好的解决方案。 在我自己的项目 http://www.siebengeisslein.org 中,我正在尝试使用轻量级事务来调度的方法一个事件,因此不必维护线程状态(因为在事务结束时,线程调用堆栈再次为空,并且如果操作在事务中停止,则所有内容都会回滚,因此线程调用堆栈确实很重要出色地)。
您可能可以使用任何 OODBMS 实现类似的功能。
另一种看待事物的方式是延续 (http://en.wikipedia.org/wiki/Continuation 、http://jauvm.blogspot.com/)。 它们是一种在定义的代码位置挂起执行的方法(但它们不一定保留线程状态)。
我希望这能为您提供一些起点(但据我所知,没有现成的解决方案)。
编辑:阅读您的说明后:您绝对应该研究 OODBMS。 在自己的事务中分派每个事件并且不关心线程。
I don't think serializing only "some threads" of a program can work, since you will run into problems with synchronization (some of the problems are described here http://java.sun.com/j2se/1.3/docs/guide/misc/threadPrimitiveDeprecation.html ).
So persisting your whole program is the only viable way to get a consistent state.
What you might look into is orthogonal persistence. There are some prototypical implementations:
http://research.sun.com/forest/COM.Sun.Labs.Forest.doc.external_www.PJava.main.html
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.17.7429
But none of them are maintained anymore or have gained a lot of attraction (afaik). I guess checkpointing is not the best solution after all. In my own project http://www.siebengeisslein.org I am trying the approach of using lightweight transactions to dispatch an event so thread state does not have to be maintained (since at the end of a transaction, the thread callstack is empty again, and if a operation is stopped in mid-transaction, everything is rolled back, so the thread callstack does matter as well).
You probably can implement something similar with any OODBMS.
Another way to look at things are continuations (http://en.wikipedia.org/wiki/Continuation , http://jauvm.blogspot.com/). They are a way to suspend execution at defined code locations (but they are not necessarily persisting the thread state).
I hope this gives you some starting points (but there is no ready-to-use solution to this afaik).
EDIT: After reading your clarifications: You should definitely look into OODBMS. Dispatch each event in its own transaction and don't care about threads.
这听起来确实像是保存虚拟机的状态并能够以完全相同的方式恢复它正是您想要的。
如果您需要的只是能够使用先前执行使用的相同数据来启动程序运行,那么您只需要保存和恢复持久数据,每个线程的确切状态并不重要,因为它无论如何都会改变得如此之快 - 并且下次事物的实际地址将会有所不同。 无论如何,使用数据库应该给你这种能力。
It really sounds like saving the state of a virtual machine and being able to restore it the exact same way is exactly what you want.
If all you need is to be able to start the program running with the same data that the previous execution was using, then you only need to save off and restore persistent data, the exact state of each thread shouldn't really matter, since it will change so fast anyways - and the actual addresses of things will be different the next time. Using a database should give you this ability anyways.
比尝试序列化程序状态更好的方法是使用数据实现仅崩溃软件检查点。 如何进行数据检查点取决于您的实施和问题领域。
A better approach than trying to serialize program state would be to implement Crash Only Software with data checkpointing. How you do your data checkpointing will depend on your implementation and problem domain.
看起来您想要在 C++ 中使用 闭包 。 正如您所指出的,该语言中没有内置机制可以让您执行此操作。 据我所知,以完全通用的方式基本上不可能做到这一点。 一般来说,用没有虚拟机的语言很难做到这一点。 您可以通过执行类似您建议的操作来伪造它,基本上创建一个维护执行环境/状态的闭包对象。 然后当它处于已知状态时使其自行序列化。
您还会遇到函数指针的问题。 每次加载时可以将函数加载到不同的内存地址。
It looks like you want have a closure in C++. As you have pointed out there is no mechanism built into the language to let you do this. As far as I know this is basically impossible to do in a totally general manner. In general it's hard to do in a language that doesn't have a VM. You can fake it somewhat by doing something like you have suggested basically creating a closure object that maintains the execution environment/state. Then having this serialize itself when it is in a known state.
You will also run into trouble with your function pointers. The functions can be loaded to different memory addresses on each load.
我认为线程状态是一个可能不适合序列化的实现细节。 您想要保存对象的状态 - 不一定是它们如何成为现在的样子。
作为为什么要采用此方法的示例,请考虑无中断升级。 如果您正在运行应用程序的版本 N 并且想要升级到版本 N+1,则可以使用对象序列化来执行此操作。 然而,“版本 N+1”线程将与版本 N 线程不同。
I consider the thread state to be an implementation detail which is probably not appropriate to be serialized. You want to save the state of your objects--not necessarily how they got to be the way they are.
As an example for why you want to take this approach, consider hitless upgrade. If you're running version N of your application and want to upgrade to version N+1, you can do so using object serialization. However, the "version N+1" threads are going ot be different from the version N threads.
您不应该尝试将程序必须写入磁盘的状态序列化。 因为你的程序永远不会完全控制其状态,除非操作系统允许,在这种情况下......它是操作系统的一部分。
您不能保证指向某个虚拟内存位置的指针将再次指向相同的虚拟内存位置(除了堆开始/结束、堆栈开始等属性),因为对于程序来说,操作系统对虚拟内存的选择是不确定的。 您通过 sbrk 或更高级别的接口(例如 malloc)从操作系统请求的页面将从任何地方开始。
更好:
我怀疑您想缩短序列化/反序列化特定数据结构(例如链接列表)所需的开发时间。 请放心,您尝试做的事情并非微不足道,而是需要大量工作。 如果您坚持这样做,请考虑研究操作系统的内存管理代码和操作系统的分页机制。 ;-)
由于附加问题而编辑:您所说的设计听起来像是某种状态机; 对象属性设置为可序列化,可以恢复函数指针。
首先,关于对象中的线程状态:仅当存在典型的并发编程问题(例如竞争条件)等时,这些状态才重要。如果是这种情况,您需要线程同步功能,例如互斥体,然后你可以随时访问属性来序列化/反序列化并且安全。
其次,关于对象设置:看起来很酷,不确定您是否有二进制或其他对象表示。 假设二进制:如果您可以表示内存中的实际结构(这是一点编码开销),您可以轻松序列化它们。 在对象的开头插入某种类 ID 值,并有一个指向实际服装的查找表。 查看第一个 sizeof(id) 字节,您就知道您拥有哪种结构。 然后你就会知道那里有哪个结构。
序列化/反序列化时,请像这样解决问题:您可以查找假设打包(成员之间没有间距)结构的长度,分配该大小并依次读/写成员。 考虑 offsetof 或者,如果您的编译器支持它,则只需使用打包结构。
由于大胆的核心问题而进行编辑:-)不,没有; 不适合C。
You should NOT try to serialize a state that your program has to disk. Because your program will never have full control over its' state unless it is allowed to by the operating system, in which case... it is part of the operating system.
You can not guarantee that a pointer to some virtual memory location will point to the same virtual memory location again (except for properties like heap-begin/end, stack-begin), because to the program the operating systems' choices for virtual memory are indeterministic. The pages you request from the OS via sbrk or the higher level interfaces such as malloc will begin anywhere.
Better:
I suspect you want to shortcut the development time it takes to serialize/deserialize specific data structures, such as linked lists. Be assured, what you are attempting to do is not trivial and it's a lot more work. If you insist on doing so, consider looking into your operating system's memory management code and into the OS's paging mechanisms. ;-)
EDIT due to appended question: The design you state sounds like some kind of state machine; object properties are set up such that they are serializable, function pointers can be restored.
First, regarding thread states in objects: these only matter if there can be typical concurrent-programming problems such as race conditions, etc. If that's the case, you need thread-synchronization functionality, such as mutexes, semaphores, etc. Then you can at any time access the properties to serialize/deserialize and be safe.
Second, regarding object setup: looks cool, not sure if you are having a binary or other object representation. Assuming binary: you can serialize them easily if you can represent the actual structures in memory (which is a bit of coding overhead). Insert some kind of class-ID value at the begin of the objects and have a look up table that points to the actual outfit. Look at the first sizeof(id) bytes and you know which kind of struct you have. Then you will know which structure is laying there.
When serializing/deserializing, approach the problem like this: you can look up the length of the hypothetically packed (no spacing between members) structure, allocate that size and read/write the members one after the other. Think offsetof or, if your compiler supports it, just use packed structs.
EDIT due to bold core question :-) No, there are none; not for C.
实际上,JSR 323 中为 Java 提出了这样的建议:
http ://tech.puredanger.com/2008/01/09/strong-mobility-for-java/
,但因过于理论化而未被接受:
http://tech.puredanger.com/2008/01/24/jcp-votes-down-jsr-323 /
如果您点击链接,您可以找到一些关于这个问题的有趣研究。
Something like this was actually proposed for Java in JSR 323:
http://tech.puredanger.com/2008/01/09/strong-mobility-for-java/
but was not accepted as being too theoretical:
http://tech.puredanger.com/2008/01/24/jcp-votes-down-jsr-323/
If you follow the links, you can find some interesting research on this problem.