存储任意数据库表的数据结构
我想设计一个 JVM 数据结构(Java/Scala),可用于表示和存储任意关系数据库表的内容。数据结构应该是快速的(GC 不太密集,缓存友好)并且内存效率高,因此 RAM 中可以容纳更大的表。
一种节省内存的解决方案是将每一列单独存储在一个原始数组中,但我担心缓存友好性,因为同一行中的项目不存储在一起。具有 N 列的行将导致 N 次缓存未命中,无论列有多窄。
另一种解决方案是将每一行存储在对象数组中,其中每个元素代表一个字段,并在检索时转换为正确的类型,但这需要以装箱形式存储数字类型,因此内存效率不高。而且它的缓存效率可能也不是那么高。
另一种解决方案是将每行的数据布局到字节数组中,就像真实数据库序列化其行一样,仅使用所需的字节数。这是缓存友好且内存高效的,但我担心每次访问时序列化/反序列化的成本。
最好的办法是什么?
I'd like to design a JVM data structure (Java/Scala) that can be used to represent and store the contents of arbitrary relational database tables. The data structure should be fast (not too gc-intensive, cache-friendly) and memory efficient, so larger tables can fit in RAM.
One memory-efficient solution is to store each column separately in a primitive array, but I'm worried about the cache friendliness because items in the same row are not stored together. A row with N columns will incur N cache misses, no matter how narrow the columns.
Another solution is to store each row in an object array where each element represents a field and is cast to the correct type on retrieval, but this requires storing numeric types in their boxed form, so it's not very memory-efficient. And it's probably not that cache efficient either.
Another solution is to layout each row's data into a byte array the same way real databases serialize their rows, using only as many bytes as necessary. This is cache-friendly and memory efficient, but I'm concerned about the cost of serialization/de-serialization on every access.
What's the best way?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
第四种解决方案是将每行的数据存储为字符串而不是字节数组。这可以避免大多数情况下的序列化成本 - 前提是大多数数据都是字符串。
这也将更容易调试并且独立于平台。当然它有一些限制:例如浮点数不能按原样表示,但可以以类似于 SQL DECIMAL 格式的形式存储。
任何解决方案都将是一种权衡。
编辑但是,我更喜欢针对您的情况的字节数组解决方案:每行一个字节数组。对于固定大小的行来说,这应该是最适合缓存的。但是,您还应该为可变大小的行提供解决方案。低级语言似乎更适合这项任务,在 C 中可以定义两种格式:固定大小的行,其中表元数据包含列偏移量(例如,第 1 列:字节 0..31,第 2 列:字节 32..127等),以及第二个可变大小行格式,其中行本身包含列大小(例如,字节 1..3 包含大小,后面的字节数包含数据,然后另外 4 个字节包含大小,后面的数据等等)。
A fourth solution would be to store each row's data as strings instead of byte arrays. This may avoid serialization costs in most cases - provided that most data will be strings.
This will also be easier to debug and will be platform independent. Of course it has some limitations: e.g. a float can not be represented as-is, but may be stored in something similar to a SQL DECIMAL format.
Any solution will be a trade-off.
EDIT However, I would prefer the byte array solution for your case: one byte-array per row. This should be most cache-friendly for fixed-size rows. But then you should also provide a solution for variable-sized rows. A low-level language seems to fit that task better, in C one could define two formats: fixed size rows where the table metadata contains column-offsets (e.g. column 1: bytes 0..31, column 2: bytes 32..127 etc.), and a second variable size row format, where the rows itself contain the columns sizes (e.g. bytes 1..3 contain the size, the following number of bytes contain the data, then another 4 bytes contain the size, following data and so on).
这样做的目的是什么?您可能最好简单地将从数据库检索的数据(作为将其映射到的对象)存储在某种缓存层(如 EhCache、OSCache、memcache 等)中,而不是重新发明轮子。
What is the purpose of doing this? You are likely better simply storing the data that you retrieve from your database (as the objects you map it to) in some sort of caching layer like EhCache, OSCache, memcache, etc - rather than re-inventing the wheel.
为什么不使用 hsqldb 或
它们都支持内存模式并且是纯Java的。它们强制您使用 SQL 进行访问,但在另一端,您不必实现自己的联接。
两者都是开源的,因此您也可以使用它作为性能基准,看看自己的按列/按行数据结构是否会更快并且值得付出努力。
Why not use hsqldb or h2?
They both support in-memory mode and are pure Java. They force you to use SQL to access but on the other end, you don't have to implement your own join.
Both are open source, so you can also use this as a baseline for performance and see if doing your own by column/by row data structure would be faster and be worth the effort.