当前位置：文江博客话题详情

动态水平可扩展的键值存储

发布于 2024-08-18 12:18:07 字数 755 浏览 8 评论 0原文

是否有一个键值存储可以为我提供以下功能：

允许我简单地添加和删除节点并自动重新分配数据
允许我删除节点并仍然有 2 个额外的数据节点来提供冗余
允许我存储文本或图像大小可达 1GB
可以存储高达 100TB 的小数据
快速（因此将允许在其之上执行查询）
让所有这些对客户端透明
可在 Ubuntu/FreeBSD 或 Mac 上运行
免费或开源

我基本上想要一些东西我可以使用“单个”，而不必担心拥有 memcached、数据库和多个存储组件，所以是的，我确实想要一个数据库“银弹”，你可以说。

谢谢祖贝尔

：

到目前为止的回答 BackBlaze 之上的 MogileFS - 据我所知，这只是一个文件系统，经过一些研究，它似乎只适合大型图像文件

Tokyo Tyrant - 需要 lightcloud。当您添加新节点时，这不会自动缩放。我确实对此进行了研究，看起来对于适合单个节点的查询来说，

Riak 的速度非常快 - 这是我自己正在研究的一个，但我还没有任何结果

Amazon S3 - 有没有人使用它作为他们的生产中唯一的持久层？从我所看到的来看，它似乎用于存储图像，因为复杂的查询太昂贵了

@shaman 建议 Cassandra - 绝对是我正在研究的一个

到目前为止，似乎没有数据库或键值存储能够满足我提到的标准，即使提供了100积分的悬赏，问题也没有得到解答！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

中二柚 2024-08-25 12:18:07

您对开源软件的要求太多了。

如果您有数十万美元的预算来购买某些企业级软件，那么有几种解决方案。没有任何东西可以立即满足您的需求，但有些公司的产品非常接近您的需求。

“快速（因此将允许在其之上执行查询）”

如果您有一个键值存储，那么一切都应该非常快。然而问题是，如果没有在键值存储之上构建本体或数据模式，您最终将针对每个查询遍历整个数据库。您需要一个包含要存储的每种“类型”数据的键的索引。

在这种情况下，您通常可以对所有约 15,000 台计算机并行执行查询。瓶颈在于廉价硬盘驱动器的上限为每秒 50 次搜索。如果您的数据集适合 RAM，您的性能将会非常高。但是，如果键存储在 RAM 中，但没有足够的 RAM 来存储值，则系统将在几乎所有键值查找时转到磁盘。每个密钥都位于驱动器上的随机位置。

这限制了每台服务器每秒最多 50 次键值查找。然而，当键值对存储在 RAM 中时，在商用硬件（例如 Redis）上每台服务器每秒执行 10 万次操作并不罕见。

然而，串行光盘读取性能非常高。我的寻道驱动器在串行读取时速度达到 50 MB/s (800 Mb/s)。因此，如果要将值存储在光盘上，则必须构建存储结构，以便可以串行读取需要从光盘读取的值。

这就是问题所在。除非您将键值对完全存储在 RAM 中（或者将 RAM 中的键与 SSD 驱动器上的值一起存储），或者在该架构之上定义某种类型的模式或类型系统，否则您无法在普通键值存储上获得良好的性能。密钥，然后将数据聚集在光盘上，以便可以通过串行光盘读取轻松检索给定类型的所有密钥。

如果一个键有多种类型（例如数据库中有数据类型继承关系），那么该键将是多个索引表的元素。在这种情况下，您必须进行时空权衡来构造这些值，以便可以从光盘上连续读取它们。这需要存储密钥值的冗余副本。

您想要的将比键值存储更高级一点，特别是如果您打算进行查询。然而，存储大文件的问题不是问题。假设您的系统可以键入高达 50 兆的密钥。然后，您只需将 1 gig 文件分成 50 meg 段，并将一个键与每个段值关联起来。使用简单的服务器可以直接将您想要的文件部分转换为键值查找操作。

实现冗余的问题更加困难。对服务器的键值表进行“源代码”或“部分文件”非常容易，这样，如果特定服务器出现故障，可以以线速 (1 Gb/s) 将服务器的数据重建到备用服务器上。通常，您可以使用“心跳”系统来检测服务器死亡，如果服务器在 10 秒内没有响应，则会触发该系统。甚至可以对部分文件编码的键值表进行键值查找，但这样做效率很低，但仍然可以为您提供服务器故障事件的备份。更大的问题是几乎不可能保持备份最新，并且数据可能已经是 3 分钟前的了。如果您进行大量写入，备份功能将引入一些性能开销，但如果您的系统主要进行读取，则开销可以忽略不计。

我不是在故障模式下维护数据库一致性和完整性约束的专家，所以我不确定这个要求会带来什么问题。如果您不必担心这一点，它会大大简化系统的设计及其要求。

快速（因此将允许在其之上执行查询）

首先，当您的数据库如此大时，请忘记连接或任何扩展速度比 n*log(n) 更快的操作。您可以执行以下两件事来替换通常使用联接实现的功能。您可以构建数据，以便不需要进行联接，也可以“预编译”您正在执行的查询并进行时空权衡，预先计算联接并存储它们以供提前查找。

对于语义网络数据库，我认为我们将看到人们预编译查询并进行时空权衡，以便在即使是中等大小的数据集上也能获得不错的性能。我认为这可以由数据库后端自动且透明地完成，而无需应用程序程序员付出任何努力。然而，我们才刚刚开始看到企业数据库为关系数据库实施这些技术。据我所知，还没有开源产品可以做到这一点，如果有人尝试对水平可扩展数据库中的链接数据执行此操作，我会感到惊讶。

对于这些类型的系统，如果您有额外的 RAM 或存储空间，出于性能原因，最好的用途是预先计算并存储常见子查询的结果，而不是向键值存储添加更多冗余。预先计算结果并按您要查询的键进行排序，以将 n^2 连接转换为 log(n) 查找。任何扩展性比 n*log(n) 差的查询或子查询都需要执行其结果并将其缓存在键值存储中。

如果您正在进行大量写入，则缓存的子查询将比处理它们的速度更快地失效，并且没有性能优势。处理缓存子查询的缓存失效是另一个棘手的问题。我认为解决方案是可能的，但我还没有看到。

欢迎来到地狱。您不应该期望在接下来的 20 年内免费获得这样的系统。

到目前为止，似乎还没有数据库或键值存储能够满足我提到的标准，即使在提供 100 积分的赏金后，问题也没有得到解答！

你在祈求奇迹。等待 20 年，直到我们拥有开源奇迹数据库，否则您应该愿意花钱购买根据您的应用程序需求定制的解决方案。

You are asking too much from open source software.

If you have a couple hundred thousand dollars in your budget for some enterprise class software, there are a couple of solutions. Nothing is going to do what you want out of the box, but there are companies that have products which are close to what you are looking for.

"Fast (so will allow queries to be performed on top of it)"

If you have a key-value store, everything should be very fast. However the problem becomes that without an ontology or data schema built on top of the key-value store, you will end up going through the whole database for each query. You need an index containing the key for each "type" of data you want to store.

In this case, you can usually perform queries in parallel against all ~15,000 machines. The bottleneck is that cheap hard drives cap out at 50 seeks per second. If your data set fits in RAM, your performance will be extremely high. However, if the keys are stored in RAM but there is not enough RAM for the values to be stored, the system will goto disc on almost all key-value lookups. The keys are each located at random positions on the drive.

This limits you to 50 key-value lookups per second per server. Whereas when the key-value pairs are stored in RAM, it is not unusual to get 100k operations per second per server on commodity hardware (ex. Redis).

Serial disc read performance is however extremely high. I have seek drives goto 50 MB/s (800 Mb/s) on serial reads. So if you are storing values on disc, you have to structure the storage so that the values that need to be read from disc can be read serially.

That is the problem. You cannot get good performance on a vanilla key-value store unless you either store the key-value pairs completely in RAM (or keys in RAM with values on SSD drives) or if you define some type of schema or type system on top of the keys and then cluster the data on disc so that all keys of a given type can be retrieved easily through a serial disc read.

If a key has multiple types (for example if you have data-type inheritance relationships in the database), then the key will be an element of multiple index tables. In this case, you will have to make time-space trade offs to structure the values so that they can be read serially from disc. This entails storing redundant copies of the value for the key.

What you want is going to be a bit more advanced than a key-value store, especially if you intend to do queries. The problem of storing large files however is a non-problem. Pretend your system can keys upto 50 meg. Then you just break up a 1 gig file into 50 meg segments and associate a key to each segment value. Using a simple server it is straight forward to translate the part of the file you want into a key-value lookup operation.

The problem of achieving redundancy is more difficult. Its very easy to "fountain code" or "part file" the key-value table for a server, so that the server's data can be reconstructed at wire speed (1 Gb/s) onto a standby server, if a particular server dies. Normally, you can detect server death using a "heart beat" system which is triggered if the server does not respond for 10 seconds. It is even possible to key-value lookups against the part-file encoded key-value tables, but it is inefficient to do so but still gives you a backup for the event of server failure. A bigger issues it is almost impossible to keep the backup up to date and the data may be 3 minutes old. If you are doing lots of writes, the backup functionality is going to introduce some performance overhead, but the overhead will be negligible if your system is primarily doing reads.

I am not an expert on maintaining database consistency and integrity constraints under failure modes, so I am not sure what problems this requirement would introduce. If you do not have to worry about this, it greatly simplifies the design of the system and its requirements.

Fast (so will allow queries to be performed on top of it)

First, forget about joins or any operation that scales faster than n*log(n) when your database is this large. There are two things you can do to replace the functionality normally implemented with joins. You can either structure the data so that you do not need to do joins or you can "pre-compile" the queries you are doing and make a time-space trade-off and pre-compute the joins and store them for lookup in advance.

For semantic web databases, I think we will be seeing people pre-compiling queries and making time-space trade-offs in order to achieve decent performance on even modestly sized datasets. I think that this can be done automatically and transparently by the database back-end, without any effort on the part of the application programmer. However we are only starting to see enterprise databases implementing these techniques for relational databases. No open source product does it as far as I am aware and I would surprised if anyone is trying to do this for linked data in horizontally scalable databases yet.

For these types of systems, if you have extra RAM or storage space the best use of it is to pre-compute and store the result of common sub-queries for performance reasons, instead of adding more redundancy to the key-value store. Pre-compute results and order by the keys you are going to query against to turn an n^2 join into a log(n) lookup. Any query or sub-query that scales worse than n*log(n) is something whose results need to be performed and cached in the key-value store.

If you are doing a large number of writes, the cached sub-queries will be invalidated faster than they can be processed and there is no performance benefit. Dealing with cache invalidation for cached sub-queries is another intractable problem. I think a solution is possible, but I have not seen it.

Welcome to hell. You should not expect to get a system like this for free for another 20 years.

So far it seems that there is no database or key value store that fulfills the criteria I mentioned, not even after offering a bounty of 100 points did the question get answered!

You are asking for a miracle. Wait 20 years until we have open source miracle databases or you should be willing to pay money for a solution customized to your application's needs.

回复收藏 0 原文

喵星人汪星人 2024-08-25 12:18:07

Amazon S3 是一种存储解决方案，而不是数据库。

如果您只需要简单的键/值，您最好的选择是将 Amazon SimpleDB 与 S3 结合使用。大文件存储在S3上，而用于搜索的元数据存储在SimpleDB中。这为您提供了一个可水平扩展的键/值系统，可以直接访问 S3。

回复收藏 0 原文

凉风有信 2024-08-25 12:18:07

还有另一个解决方案，这似乎正是您正在寻找的：Apache Cassandra 项目： http://incubator .apache.org/cassandra/

目前 twitter 正在从 memcached+mysql 集群切换到 Cassandra

回复收藏 0 原文

聽兲甴掵 2024-08-25 12:18:07

HBase 和 HDFS 共同满足了大部分要求。 HBase 可用于存储和检索小对象。 HDFS可以用来存储大对象。 HBase 压缩小对象并将它们作为大对象存储在 HDFS 上。速度是相对的 - HBase 从磁盘随机读取的速度不如 mysql （例如） - 但从内存读取的速度相当快（类似于 Cassandra）。它具有出色的写入性能。 HDFS 作为底层存储层，对于多个节点的丢失具有完全的弹性。它可以跨机架复制，并允许机架级维护。它是一个基于 Java 的堆栈，具有 Apache 许可证 - 几乎可以运行大多数操作系统。

该堆栈的主要弱点是低于最佳随机磁盘读取性能和缺乏跨数据中心支持（这是一项正在进行的工作）。

回复收藏 0 原文

与君绝 2024-08-25 12:18:07

我可以建议您两种可能的解决方案：

1）购买亚马逊的服务（Amazon S3）。对于 100 TB，每月费用为 14 512 美元。
2）更便宜的解决方案：

构建两个自定义 backblaze 存储盒（link）并在其上运行 MogileFS。

目前，我正在研究如何使用类似的解决方案存储 PB 级的数据，因此，如果您发现一些有趣的内容，请发布您的笔记。

回复收藏 0 原文

山色无中 2024-08-25 12:18:07

看看东京暴君。它是一个非常轻量级、高性能的复制守护进程，将 Tokyo Cabinet 键值存储导出到网络。我听说过一些关于它的好消息。

回复收藏 0 原文

∝单色的世界 2024-08-25 12:18:07

从我在你的问题中看到的 Project Voldemort 似乎是最接近的一个。看看他们的设计页面。

我看到的唯一问题是它将如何处理大文件，并且根据这个线程，事情并不全是好事。但您始终可以使用文件相当轻松地解决这个问题。最后——这就是文件系统的确切目的。查看维基百科文件系统列表 - 列表很大。

回复收藏 0 原文

何以笙箫默 2024-08-25 12:18:07

您可能想看看 MongoDB。

据我所知，您正在寻找数据库/分布式文件系统组合，这可能很难甚至不可能找到。

您可能想看看分布式文件系统，例如 MooseFS 或 Gluster 并将您的数据保存为文件。这两个系统都是容错和分布式的（您可以根据需要添加和取出节点），并且对客户端都是透明的（构建在 FUSE 之上）——您使用的是简单的文件系统操作。这涵盖以下功能：1)、2)、3)、4)、6)、7)、8)。我们使用 MooseFS 进行数字电影存储，存储容量约为 1.5 PB，上传/下载速度与网络设置允许的速度一样快（因此性能取决于 I/O，而不取决于协议或实现）。您的列表中不会有查询（功能 5）），但您可以将此类文件系统与 MongoDB 甚至像 Lucene 这样的搜索引擎（它有聚集索引）来查询存储在文件系统中的数据。

回复收藏 0 原文

不奢求什么 2024-08-25 12:18:07

Zubair，

我正在开发一个键值存储，到目前为止是

它（尚未）使用复制，缺少您的两个首要要求，但这个问题启发了我 - 谢谢！

否：允许我简单地添加和删除节点并自动重新分配数据
否：允许我删除节点，但仍然有 2 个额外的数据节点来提供冗余
好的：允许我存储最大 1GB 的文本或图像（是：无限制）
好的：可以存储高达 100TB 的小数据（是：无限制）
ok：快速（因此将允许在其之上执行查询）（是：比 Tokyo Cabinet 的 TC-FIXED 阵列更快）
ok：使所有这些对客户端透明（是：集成到网络服务器）
好的：适用于 Ubuntu/FreeBSD 或 Mac （是的：Linux）
好的：免费或开源（是的：免费软件）

除了单线程性能优于哈希表和 B 树之外，这个 KV 存储是我所知道的唯一一个“等待- FREE”（不阻塞，也不延迟任何操作）。