Cassandra 用于存储文档
我目前正在运行一个项目,每年需要为大约 2 亿个帐户存储 400 亿份文档(PDF、TIFF),我想知道是否可以使用 Cassandra 来实现这一点?这主要是因为 Cassandra 设计中的可扩展性、稳定性和多数据中心使用。
但我想知道使用 Cassandra 是否是一个好主意 - 或者像 CouchDB 这样的其他替代方案是否是更好的选择?
请注意,我们不需要在文档中进行全文搜索,并且对于每个文档,只会附加有限的元数据 - 例如日期、时间、来源、所有者和唯一 ID,以及一些关键字。对文档的访问通常通过查询所有者 ID 来完成,然后通过来源和可选的日期/时间选择所需的文档。所以没什么花哨的。
感谢您对此的想法。
I am currently running a project where we need to annually store 40 billion documents (PDF,TIFF) for roughly 200 million accounts and was wondering if it is possible to use Cassandra for that? this is mainly because of the scalability, stability and multiple datacenter use in the Cassandra design.
But I wonder if it is a good idea to use Cassandra for this at all - or would another alternative like CouchDB be a better option?
Just a note, we don't need full text search in the documents and for each document there will only be a limited of metadata attached to each - like date, time, origin, owner and unique id, plus a few keywords. Access to documents will normally be done through a query on owner id and from there choose the document needed through origin and optionally date/time. So nothing fancy.
Thanks for your thoughts on this.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
只是一些想法:
您可能还想考虑分布式文件系统,例如 HDFS。
每年 400 亿次等于每秒 1361 次 - Cassandra 可以处理这种写入负载,假设文档大小适中且并非都是巨大的多兆字节文件。
您预计会有什么样的阅读负载?
这些文件会被永久保存,即每年无限期地增加 400 亿份文件吗?
如果一个文档是 100KB(比如说),我想那每年就是 4 PB?我没有听说过这么大的 Cassandra 集群 - 值得在 Cassandra 邮件列表(有一些现实的数字,而不是我的猜测!)。
我听说 Cassandra 节点通常可以在重负载下管理 1TB,在轻负载下可能管理 10TB。因此,第一年至少有 400 个节点的集群,可能更多,特别是如果您需要复制的话。
本页提供了 2009 年 HDFS 功能的一些数据 - 14 PB (6000 万个文件)使用 4000 个节点,加上许多其他有趣的细节(例如需要 60GB RAM 的名称节点)。
Just a few thoughts:
You might want to also consider a distributed file system such as HDFS.
40 billion per year is 1361 per second - Cassandra can handle this kind of write load, assuming the documents are modestly sized and not all huge multi-megabyte files.
What kind of read load are you anticipating?
Will the documents be preserved for ever i.e. 40 billion added per year indefinitely?
If a document is 100KB (say), that's 4 petabytes per year, I think? I've not heard of a Cassandra cluster that big - it would be worth asking on the Cassandra mailing list (with some realistic figures rather than my guesses!).
I've heard that a Cassandra node can typically manage 1TB under heavy load, maybe 10TB under light load. So that's at least a 400-node cluster for year one, possibly much more, especially if you want replication.
This page gives some 2009 figures for HDFS capabilities - 14 petabytes (60 million files) using 4000 nodes, plus a lot of other interesting detail (e.g. name nodes needing 60GB of RAM).