磁盘持久惰性缓存列表 ™在斯卡拉
我需要在 Scala 中有一个非常非常长的对 (X, Y) 列表。太大了,无法放入内存(但非常适合磁盘)。
- 所有更新操作都是缺点(头追加)。
- 所有读访问都从头部开始,并有序遍历列表,直到找到预先确定的对。
- 缓存会很棒,因为大多数读取访问都会一遍又一遍地保留相同的数据。
所以,这基本上是一个“磁盘持久延迟可缓存列表”™
在我开始推出自己的列表之前,有什么关于如何获取它的想法吗?
附录:是的.. mongodb,或任何其他不可嵌入的资源,都是一种矫枉过正。如果您对此的特定用例感兴趣,请参阅 此处Timeline 类。基本上,我有一个非常非常大的时间表(几个月内有数百万对),尽管我的比赛只需要触及最后几个小时。
I need to have a very, very long list of pairs (X, Y) in Scala. So big it will not fit in memory (but fits nicely on a disk).
- All update operations are cons (head appends).
- All read accesses start in the head, and orderly traverses the list until it finds a pre-determined pair.
- A cache would be great, since most read accesses will keep the same data over and over.
So, this is basically a "disk-persisted-lazy-cacheable-List" ™
Any ideas on how to get one before I start to roll out my own?
Addendum: yes.. mongodb, or any other non-embeddable resource, is an overkill. If you are interested in a specific use-case for this, see the class Timeline
here. Basically, I which to have a very, very big timeline (millions of pairs throughout months), although my matches only need to touch the last hours.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
执行此类操作的最简单方法是扩展
Traversable
。您只需要定义foreach
,并且可以完全控制遍历,因此您可以执行打开和关闭文件之类的操作。您还可以扩展
Iterable
,这需要定义iterator
,当然还需要返回某种Iterator
。在这种情况下,您可能会为磁盘数据创建一个迭代器,但控制打开文件等操作会困难得多。下面是我所描述的
Traversable
的一个示例,由 Josh Suereth 编写:The easiest way to do something like this is to extend
Traversable
. You only have to defineforeach
, and you have full control over the traversal, so you can do things like open and close the file.You can also extend
Iterable
, which requires definingiterator
and, of course, returning some sort ofIterator
. In this case, you'd probably create anIterator
for the disk data, but it's going to be much harder to control things like open files.Here's one example of a
Traversable
such as I described, written by Josh Suereth:你写:
您知道有嵌入式数据库引擎,包括一些非常小的引擎吗?如果您知道,我不确定您的确切要求以及您为什么不使用它们。
您确定 Hibernate + 可嵌入的数据库(例如 SQLite)还不够吗?
或者,BerkeleyDB Java 版、HSQLDB 或 其他嵌入式数据库 可能是一个选项。
如果您不对对象本身执行查询(听起来确实没有),也许序列化会比复杂对象的对象关系映射更简单,但我从未尝试过,而且我不知道哪一个会更简单快一点。但序列化可能是类型中完全通用的唯一方法,假设您选择的框架提供了合适的接口来编写
[T <: Serialized]
。如果没有,您可以在创建自己的“类型类”MySerialized[T]
(例如Ordering[T]< /code> 在 Scala 标准库中)。
但是,您不想使用标准 Java 序列化来完成此任务。 “任何可序列化的内容”听起来是一个糟糕的要求,因为它建议为此使用序列化,但我想您可以将其放宽为“使用我选择的框架可序列化的任何内容”。序列化在时间和空间上效率极低,并且不是为了序列化单个对象而设计的,而是返回一个包含特殊标头的文件。我建议使用一些不同的序列化框架 - 请查看此处进行比较。
不走自定义实现之路的其他原因
此外,听起来您基本上会向后读取文件,从性能角度来看,在非 SSD 上,这是一种非常糟糕的访问模式磁盘:读取一个扇区后,需要几乎完整的磁盘旋转才能访问前一个扇区。
此外,正如 Chris Shain 在上面的评论中指出的那样,您需要使用基于页面的解决方案,并且需要处理可变大小的对象。
You write:
Do you know that there are embeddable database engines, including some really small ones? If you know, I'm not sure about your exact requirement and why would you not use them.
You sure that Hibernate + an embeddable DB (say SQLite) would not be enough?
Alternatively, BerkeleyDB Java Edition, HSQLDB, or other embedded databases could be an option.
If you do not perform queries on the object themselves (and it really sounds like you do not), maybe serialization would be simpler than object-relational mapping for complex objects, but I've never tried, and I don't know which would be faster. But serialization is probably the only way to be completely generic in the type, assuming that your framework of choice offers a suitable interface to write
[T <: Serializable]
. If not, you could write[T: MySerializable]
after creating your own "type-class"MySerializable[T]
(like for instanceOrdering[T]
in the Scala standard library).However, you don't want to use standard Java serialization for this task. "Anything serializable" sounds a bad requirement because it suggests the use of serialization for this, but I guess you can relax that to "anything serializable with my framework of choice". Serialization is extremely inefficient in time and space and is not designed to serialize a single object, instead it gives you back a file complete with special headers. I would suggest to use some different serialization framework - have a look here for a comparison.
Additional reasons not to go on the road of a custom implementation
In addition, it sounds like you would be reading the file essentially backward, and that's a quite bad access pattern, performance-wise, on non-SSD disks: after reading a sector, it takes an almost complete disk rotation to access the previous one.
Moreover, as Chris Shain pointed out in the comment above, you'd need to use a page-based solution, and you'd need to cope with variable-sized objects.
如果您不想升级到可嵌入的数据库之一,那么在 内存映射中使用堆栈怎么样?文件?
If you don't want to step up to one of the embeddable DBs, how about a stack in memory mapped files?
这些 Java 库可能包含您需要的内容。他们的目标是比标准 Java 集合更有效地存储条目。
These Java libraries may contain what you need. They aim to store entries more efficiently than standard Java collections.