磁盘持久惰性缓存列表 ™在斯卡拉

发布于 2024-12-29 19:48:04 字数 472 浏览 1 评论 0原文

我需要在 Scala 中有一个非常非常长的对 (X, Y) 列表。太大了,无法放入内存(但非常适合磁盘)。

  • 所有更新操作都是缺点(头追加)。
  • 所有读访问都从头部开始,并有序遍历列表,直到找到预先确定的对。
  • 缓存会很棒,因为大多数读取访问都会一遍又一遍地保留相同的数据。

所以,这基本上是一个“磁盘持久延迟可缓存列表”™

在我开始推出自己的列表之前,有什么关于如何获取它的想法吗?


附录:是的.. mongodb,或任何其他不可嵌入的资源,都是一种矫枉过正。如果您对此的特定用例感兴趣,请参阅 此处Timeline 类。基本上,我有一个非常非常大的时间表(几个月内有数百万对),尽管我的比赛只需要触及最后几个小时。

I need to have a very, very long list of pairs (X, Y) in Scala. So big it will not fit in memory (but fits nicely on a disk).

  • All update operations are cons (head appends).
  • All read accesses start in the head, and orderly traverses the list until it finds a pre-determined pair.
  • A cache would be great, since most read accesses will keep the same data over and over.

So, this is basically a "disk-persisted-lazy-cacheable-List" ™

Any ideas on how to get one before I start to roll out my own?


Addendum: yes.. mongodb, or any other non-embeddable resource, is an overkill. If you are interested in a specific use-case for this, see the class Timeline here. Basically, I which to have a very, very big timeline (millions of pairs throughout months), although my matches only need to touch the last hours.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

清眉祭 2025-01-05 19:48:04

执行此类操作的最简单方法是扩展 Traversable。您只需要定义foreach,并且可以完全控制遍历,因此您可以执行打开和关闭文件之类的操作。

您还可以扩展Iterable,这需要定义iterator,当然还需要返回某种Iterator。在这种情况下,您可能会为磁盘数据创建一个迭代器,但控制打开文件等操作会困难得多。

下面是我所描述的 Traversable 的一个示例,由 Josh Suereth 编写:

class FileLinesTraversable(file: java.io.File) extends Traversable[String] {
  override def foreach[U](f: String => U): Unit = {
     val in = new java.io.BufferedReader(new java.io.FileReader(file))
     try {
       def loop(): Unit = in.readLine match {
          case null => ()
          case line => f(line); loop()
       }
       loop()
     } finally {
       in.close()
     }
  }
}

The easiest way to do something like this is to extend Traversable. You only have to define foreach, and you have full control over the traversal, so you can do things like open and close the file.

You can also extend Iterable, which requires defining iterator and, of course, returning some sort of Iterator. In this case, you'd probably create an Iterator for the disk data, but it's going to be much harder to control things like open files.

Here's one example of a Traversable such as I described, written by Josh Suereth:

class FileLinesTraversable(file: java.io.File) extends Traversable[String] {
  override def foreach[U](f: String => U): Unit = {
     val in = new java.io.BufferedReader(new java.io.FileReader(file))
     try {
       def loop(): Unit = in.readLine match {
          case null => ()
          case line => f(line); loop()
       }
       loop()
     } finally {
       in.close()
     }
  }
}
淡淡の花香 2025-01-05 19:48:04

你写:

mongodb 或任何其他不可嵌入的资源都是多余的

您知道有嵌入式数据库引擎,包括一些非常小的引擎吗?如果您知道,我不确定您的确切要求以及您为什么不使用它们。

您确定 Hibernate + 可嵌入的数据库(例如 SQLite)还不够吗?
或者,BerkeleyDB Java 版、HSQLDB 或 其他嵌入式数据库 可能是一个选项。

如果您不对对象本身执行查询(听起来确实没有),也许序列化会比复杂对象的对象关系映射更简单,但我从未尝试过,而且我不知道哪一个会更简单快一点。但序列化可能是类型中完全通用的唯一方法,假设您选择的框架提供了合适的接口来编写 [T <: Serialized]。如果没有,您可以在创建自己的“类型类”MySerialized[T](例如 Ordering[T]< /code> 在 Scala 标准库中)。

但是,您不想使用标准 Java 序列化来完成此任务。 “任何可序列化的内容”听起来是一个糟糕的要求,因为它建议为此使用序列化,但我想您可以将其放宽为“使用我选择的框架可序列化的任何内容”。序列化在时间和空间上效率极低,并且不是为了序列化单个对象而设计的,而是返回一个包含特殊标头的文件。我建议使用一些不同的序列化框架 - 请查看此处进行比较。

不走自定义实现之路的其他原因

此外,听起来您基本上会向后读取文件,从性能角度来看,在非 SSD 上,这是一种非常糟糕的访问模式磁盘:读取一个扇区后,需要几乎完整的磁盘旋转才能访问前一个扇区。

此外,正如 Chris Shain 在上面的评论中指出的那样,您需要使用基于页面的解决方案,并且需要处理可变大小的对象。

You write:

mongodb, or any other non-embeddable resource, is an overkill

Do you know that there are embeddable database engines, including some really small ones? If you know, I'm not sure about your exact requirement and why would you not use them.

You sure that Hibernate + an embeddable DB (say SQLite) would not be enough?
Alternatively, BerkeleyDB Java Edition, HSQLDB, or other embedded databases could be an option.

If you do not perform queries on the object themselves (and it really sounds like you do not), maybe serialization would be simpler than object-relational mapping for complex objects, but I've never tried, and I don't know which would be faster. But serialization is probably the only way to be completely generic in the type, assuming that your framework of choice offers a suitable interface to write [T <: Serializable]. If not, you could write [T: MySerializable] after creating your own "type-class" MySerializable[T] (like for instance Ordering[T] in the Scala standard library).

However, you don't want to use standard Java serialization for this task. "Anything serializable" sounds a bad requirement because it suggests the use of serialization for this, but I guess you can relax that to "anything serializable with my framework of choice". Serialization is extremely inefficient in time and space and is not designed to serialize a single object, instead it gives you back a file complete with special headers. I would suggest to use some different serialization framework - have a look here for a comparison.

Additional reasons not to go on the road of a custom implementation

In addition, it sounds like you would be reading the file essentially backward, and that's a quite bad access pattern, performance-wise, on non-SSD disks: after reading a sector, it takes an almost complete disk rotation to access the previous one.

Moreover, as Chris Shain pointed out in the comment above, you'd need to use a page-based solution, and you'd need to cope with variable-sized objects.

对你的占有欲 2025-01-05 19:48:04

如果您不想升级到可嵌入的数据库之一,那么在 内存映射中使用堆栈怎么样?文件

  • 堆栈似乎满足您所需的访问特性。 (推送一堆数据,并频繁迭代最近推送的数据)
  • 可以使用Java的MappedByteBuffer 直接来自 Scala。您可以像访问内存一样访问文件,而无需尝试将文件实际加载到内存中。
  • 通过这种方式,您可以从操作系统免费获得一些缓存,因为映射文件的功能类似于虚拟内存。最近写入/访问的页面将保留在操作系统文件缓存中,直到操作系统认为适合将它们刷新(或者您手动将它们刷新)回磁盘
  • 如果您担心顺序读取性能,您可以从文件的任一端构建堆栈,但是如果您通常读取刚刚写入的数据,我认为这不会成为问题,因为它仍然在内存中。 (不过,如果您正在跨页读取数小时/数天编写的数据,那么这可能是一个问题)
  • 即使在 64 位 JVM 上,以这种方式处理的文件的大小也限制为 2GB,但您可以使用多个文件来克服这个限制。

If you don't want to step up to one of the embeddable DBs, how about a stack in memory mapped files?

  • A stack seems to meet your desired access characteristics. (Push a bunch of data, and iterate over the most recently pushed data frequently)
  • You can use Java's MappedByteBuffer directly from Scala. You get to address the file like its memory, without trying to actually load the file into memory.
  • You'd get some caching for free from the OS this way, since the mapped file would function like virtual memory. Recently written/accessed pages would stay in the OSs file cache until the OS saw fit to flush them (or you flushed them manually) back to disk
  • You could build your stack from either end of the file if you're worried about sequential read performance, but if you're usually reading data you just wrote I wouldn't expect that would be a problem since it will still be in memory. (Though if you're reading data that youve written over hours/days across pages then it might be a problem)
  • A file addressed in this way is limited in size to 2GB even on a 64 bit JVM, but you can use multiple files to overcome this limitation.
权谋诡计 2025-01-05 19:48:04

这些 Java 库可能包含您需要的内容。他们的目标是比标准 Java 集合更有效地存储条目。

  • github.com/OpenHFT/Chronicle-Queue
  • github.com/OpenHFT/Chronicle-Map

These Java libraries may contain what you need. They aim to store entries more efficiently than standard Java collections.

  • github.com/OpenHFT/Chronicle-Queue
  • github.com/OpenHFT/Chronicle-Map
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文