最适合数十亿索引的数据存储
所以我们希望存储两种索引。
- 第一种的数量级为数十亿,每个值有 1 到 1000 个值,每个值是一个或两个 64 位整数。
- 第二种是数百万个量级,每个值大约有 200 个值,每个值的大小在 1KB 到 1MB 之间。
我们的使用模式将是这样的:
- 两种索引都会将值添加到顶部,每秒最多数千次。
- 索引不会经常被读取,但是当它们被读取时,它将是读取的整个索引索引
- 应该被修剪,无论是在将值写入索引时还是在某种批处理类型作业中
现在我们已经考虑了相当多的数据库很少,目前我们最喜欢的是 Cassandra 和 PostreSQL。但是,我们的应用程序使用 Erlang,它没有适用于 Cassandra 的生产就绪绑定。而一个主要的要求就是不能需要太多的人力来维护。我感觉 Cassandra 会出现意想不到的扩展问题,而 PostgreSQL 的分片会很痛苦,但至少对我们来说这是一个已知的问题。我们已经熟悉 PostgreSQL,但不太熟悉 Cassandra。
所以。关于哪种数据存储最适合我们的用例有什么建议或建议吗?我愿意接受任何建议!
谢谢,
-亚历克
So we're looking to store two kinds of indexes.
- First kind will be in the order of billions, each with between 1 and 1000 values, each value being one or two 64 bit integers.
- Second kind will be in the order of millions, each with about 200 values, each value between 1KB and 1MB in size.
And our usage pattern will be something like this:
- Both kinds of index will have values added to the top up to thousands of times per second.
- Indexes will be infrequently read, but when they are read it'll be the entirety of the index that is read
- Indexes should be pruned, either on writing values to the index or in some kind of batch type job
Now we've considered quite a few databases, our favourites at the moment are Cassandra and PostreSQL. However, our application is in Erlang, which has no production-ready bindings for Cassandra. And a major requirement is that it can't require too much manpower to maintain. I get the feeling that Cassandra's going to throw up unexpected scaling issues, whereas PostgreSQL's just going to be a pain to shard, but at least for us it's a know quantity. We're already familiar with PostgreSQL, but not hugely well acquainted with Cassandra.
So. Any suggestions or recommendations as to which data store would be most appropriate to our use case? I'm open to any and all suggestions!
Thanks,
-Alec
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您没有提供足够的信息来支持有关索引设计的大部分答案。然而,Cassandra 通过扩展集群可以很容易地进行扩展。
您可能想阅读这篇文章:http://techblog.netflix .com/2011/11/benchmarking-cassandra-scalability-on.html
对于 Cassandra 来说,一个更重要的问题是它是否支持您需要的查询类型 - 可扩展性不会成为问题。从您给出的数字来看,听起来我们正在谈论 TB 或数十 TB,这对于 Cassandra 来说是非常安全的领域。
You haven't given enough information to support much of an answer re: your index design. However, Cassandra scales up quite easily by growing the cluster.
You might want to read this article: http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html
A more significant issue for Cassandra is whether it supports the kind of queries you need - scalability won't be the problem. From the numbers you give, it sounds like we are talking about terabytes or tens of terabytes, which is very safe territory for Cassandra.
以今天的标准来看,数十亿并不是一个大数字,为什么不写一个基准而不是猜测呢?这将为您提供更好的决策工具,而且非常容易做到。只需安装您的目标操作系统和每个数据库引擎,然后使用 Perl 运行查询(因为我喜欢它)
完成这一切不会花费你超过一天的时间,我以前做过类似的事情。
一种很好的基准测试方法是编写一个脚本,随机地或使用高斯钟形曲线之类的东西执行查询,“模拟”实际使用情况。然后绘制数据或像老板一样做,只阅读日志。
Billions is not a big number by todays standards, why not writing a benchmark instead of guesswork? That will give you a better decision tool and it's really easy to do. Just install your target OS, and each database engine, then run querys with let's say Perl (because i like it)
It won't take you more than one day to do all this, i've done something like this before.
A nice way to benchmark is writing a script that randomly , or with something like a gauss bell curve, executes querys, "simulating" real usage. Then plot the data or do it like a boss and just read the logs.