Bitcask 适合简单且高性能的文件存储吗?

发布于 2024-11-07 04:27:02 字数 861 浏览 7 评论 0原文

我正在寻找一种简单的方法来存储和检索数百万个 xml 文件。目前一切都是在文件系统中完成的,这存在一些性能问题。

我们的要求是:

  1. 能够在批处理过程中存储数百万个 xml 文件。 XML 文件可能有几兆大,大多数在 100KB 范围内。
  2. 通过 id 进行非常快速的随机查找(例如文档 URL)
  3. 可通过 Java 和 Perl 访问
  4. 可在最重要的 Linux 发行版和 Windows 上使用

我确实查看了几个 NoSQL 平台(例如 CouchDB、Riak 等),虽然这些系统看起来很棒,但它们看起来几乎有点矫枉过正:

  1. 不需要集群
  2. 不需要守护进程(“服务”)不需要
  3. 聪明的搜索所需的功能

在深入研究 Riak 后,我发现了 Bitcask(参见 简介),这似乎正是我想要的。简介中描述的基础知识确实很有趣。但不幸的是,没有办法通过 java 访问 bitcask 存储库(或者有吗?)

所以我的问题归结为

  • 以下假设是正确的:Bitcask 模型(仅附加写入,内存中密钥管理)是正确的方法要存储/检索数百万个文档,
  • 是否有通过 Java 提供的 Bitcask 的可行替代方案? (我想到了 BerkleyDB...)
  • (对于 riak 专家)与“裸”Bitcask 相比,Riak 的实施/管理/资源开销是否更大?

I am looking for a simple way to store and retrieve millions of xml files. Currently everything is done in a filesystem, which has some performance issues.

Our requirements are:

  1. Ability to store millions of xml-files in a batch-process. XML files may be up to a few megs large, most in the 100KB-range.
  2. Very fast random lookup by id (e.g. document URL)
  3. Accessible by both Java and Perl
  4. Available on the most important Linux-Distros and Windows

I did have a look at several NoSQL-Platforms (e.g. CouchDB, Riak and others), and while those systems look great, they seem almost like beeing overkill:

  1. No clustering required
  2. No daemon ("service") required
  3. No clever search functionality required

Having delved deeper into Riak, I have found Bitcask (see intro), which seems like exactly what I want. The basics described in the intro are really intriguing. But unfortunately there is no means to access a bitcask repo via java (or is there?)

Soo my question boils down to

  • is the following assumption right: the Bitcask model (append-only writes, in-memory key management) is the right way to store/retrieve millions of documents
  • are there any viable alternatives to Bitcask available via Java? (BerkleyDB comes to mind...)
  • (for riak specialists) Is Riak much overhead implementation/management/resource wise compared to "naked" Bitcask?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

想你的星星会说话 2024-11-14 04:27:03

我认为 Bitcask 不会很好地适合您的用例。看起来 Bitcask 模型是为每个值的大小相对较小的用例而设计的。

问题出在Bitcask的数据文件合并过程中。这涉及将所有实时值从多个“旧数据文件”复制到“合并数据文件”中。如果您有数百万个值,每个值的大小都在 100Kb 左右,那么这将是一个疯狂的数据复制量。


请注意,上面假设 XML 文档的更新相对频繁。如果更新很少和/或者您可以处理大量的空间“浪费”,那么合并可能只需要很少进行,或者根本不需要进行。

I don't think that Bitcask is going to work well for your use-case. It looks like the Bitcask model is designed for use-cases where the size of each value is relatively small.

The problem is in Bitcask's data file merging process. This involves copying all of the live values from a number of "older data file" into the "merged data file". If you've got millions of values in the region of 100Kb each, this is an insane amount of data copying.


Note the above assumes that the XML documents are updated relatively frequently. If updates are rare and / or you can cope with a significant amount of space "waste", then merging may only need to be done rarely, or not at all.

向地狱狂奔 2024-11-14 04:27:03

Bitcask 可能适合这种情况(较大的值),具体取决于是否存在大量覆盖。特别是,除非存在大量浪费的空间,否则没有理由合并文件,这种情况仅在新值与旧值具有相同键的情况下出现。

Bitcask 特别适合这种批量加载情况,因为它将按顺序将传入数据流直接写入磁盘。在大多数情况下,查找将进行一次查找,但如果存在任何时间局部性,文件缓存将帮助您。

我不确定 Java 版本/包装器的状态。

Bitcask can be appropriate for this case (large values) depending on whether or not there is a great deal of overwriting. In particular, there is not reason to merge files unless there is a great deal of wasted space, which only occurs when new values arrive with the same key as old values.

Bitcask is particularly good for this batch load case as it will sequentially write the incoming data stream straight to disk. Lookups will take one seek in most cases, although the file cache will help you if there is any temporal locality.

I am not sure on the status of a Java version/wrapper.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文