分布式数据库,轻负载节点多
我正在从事一个涉及 CPU 密集型计算的业余爱好项目。这个问题是令人尴尬的并行。此计算需要在大量节点(例如 1000-10000)上进行。每个节点几乎可以完全独立于其他节点来完成其工作。然而,整个系统将需要回答来自系统外部的查询。每秒大约需要回答 100000 个此类查询。为了回答查询,系统需要一些有时在两个节点之间共享的状态。节点最多需要 128MB RAM 来进行计算。
显然,我可能不会负担得起以上述规模实际构建这个系统的费用,但我仍然对其工程挑战感兴趣,并认为我应该设置少量节点作为证明-概念。
我正在考虑使用 Cassandra 和 CouchDB 之类的东西在所有节点上拥有可扩展的持久状态。如果我在每个节点上运行分布式数据库服务器,它的负载会非常轻,但从操作的角度来看,让所有节点都相同会非常好。
现在我的问题是:
任何人都可以建议一种分布式数据库实现,该实现非常适合包含大量节点且每个节点的 RAM 很少的集群吗?
Cassandra 似乎做了我想做的事,但是 http://wiki.apache.org/cassandra/CassandraHardware 讨论建议每个节点至少使用 4G RAM。
我还没有找到 CouchDB 内存需求的数字,但考虑到它是在 Erlang 中实现的,我想也许它并没有那么糟糕?
不管怎样,欢迎推荐、提示、建议、意见!
I'm working on a hobby project involving a rather CPU-intensive calculation. The problem is embarrassingly parallel. This calculation will need to happen on a large number of nodes (say 1000-10000). Each node can do its work almost completely independently of the others. However, the entire system will need to answer queries from outside the system. Approximately 100000 such queries per second will have to be answered. To answer the queries, the system needs some state that is sometimes shared between two nodes. The nodes need at most 128MB RAM for their calculations.
Obviously, I'm probably not going to afford to actually build this system in the scale described above, but I'm still interested in the engineering challenge of it, and thought I'd set up a small number of nodes as proof-of-concept.
I was thinking about using something like Cassandra and CouchDB to have scalable persistent state across all nodes. If I run a distributed database server on each node, it would be very lightly loaded, but it would be very nice from an ops perspective to have all nodes be identical.
Now to my question:
Can anyone suggest a distributed database implementation that would be a good fit for a cluster of a large number of nodes, each with very little RAM?
Cassandra seems to do what I want, but http://wiki.apache.org/cassandra/CassandraHardware talks about recommending at least 4G RAM for each node.
I haven't found a figure for the memory requirements of CouchDB, but given that it is implemented in Erlang, I figure maybe it isn't so bad?
Anyway, recommendation, hints, suggestions, opinions are welcome!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您应该能够使用 cassandra 来完成此操作,但根据您的可靠性要求,像 redis 这样的内存数据库可能更合适。
由于数据集非常小(100 MB 数据),因此每个节点应该能够使用不到 4 GB 的 RAM 来运行。添加 cassandra 开销,您可能需要 200MB 的内存用于内存表,另外 200MB 的内存用于行缓存(为了缓存整个数据集,关闭键缓存),另外还需要 500MB 的内存用于 java,这意味着每台机器配备 2 GB 内存就可以了。
使用 3 的复制因子,您可能只需要一个大约 10 个节点的集群来满足您所需的读/写数量(特别是因为您的数据集非常小,并且所有读操作都可以从行缓存提供) 。如果您需要 1000 个节点的计算能力,请让它们与存储您数据的 10 个 cassandra 节点通信,而不是尝试拆分 cassandra 以在 1000 个节点上运行。
You should be able to do this with cassandra, though depending on your reliability requirements, an in memory database like redis might be more appropriate.
Since the data set is so small (100 MBs of data), you should be able to run with less than 4GB of ram per node. Adding in cassandra overhead you probably need 200MB of ram for the memtable, and another 200MB of ram for the row cache (to cache the entire data set, turn off the key cache), plus another 500MB of ram for java in general, which means you could get away with 2 gigs of ram per machine.
Using a replication factor of three, you probably only need a cluster on the order of 10's of nodes to serve the number of reads/writes you require (especially since your data set is so small and all reads can be served from the row cache). If you need the computing power of 1000's of nodes, have them talk to the 10's of cassandra nodes storing you data rather than try to split cassandra to run across 1000's of nodes.
我自己没有使用过 CouchDB,但我听说 Couch 的运行内存只有 256M,大约有 500K 条记录。据猜测,这意味着每个节点可能需要 ~512M,考虑到计算所需的额外 128M。最终,您应该下载并在 VPS 中进行测试,但听起来 Couch 的运行内存确实比 Cassandra 少。
I've not used CouchDB myself, but I am told that Couch will run in as little as 256M with around 500K records. At a guess that would mean that each of your nodes might need ~512M, taking into account the extra 128M they need for their calculations. Ultimately you should download and give each a test inside a VPS, but it does sound like Couch will run in less memory than Cassandra.
好吧,在发布问题后做了更多阅读并尝试了一些东西之后,我决定使用 MongoDB。
到目前为止我很高兴。我的负载很小,MongoDB 使用的系统资源也很少(最多约 200MB)。然而,我的数据集并不像问题中描述的那么大,而且我只运行 1 个节点,所以这没有任何意义。
CouchDB 似乎不支持开箱即用的分片,因此(事实证明)不太适合问题中描述的问题(我知道有用于分片的插件)。
Okay, after doing some more read-up after posting the question, and trying some thing out, I decided to go with MongoDB.
So far I'm happy. I have very little load, and MongoDB is using very little system resources (~200MB at most). However, my dataset isn't nearly as large as described in the question, and I am only running 1 node, so this doesn't mean anything.
CouchDB doesn't seem to support sharding out-of-the-box, so is not (it turns out) a good fit for the problem described in the question (I know there are addons for sharding).