Cassandra 有什么好的批量数据加载工具
我正在寻找一个将 CSV 加载到 Cassandra 中的工具。我本来希望使用 RazorSQL 来完成此任务,但我被告知这需要几个月的时间。
什么是好工具?
谢谢
I'm looking for a tool to load CSV into Cassandra. I was hoping to use RazorSQL for this but I've been told that it will be several months out.
What is a good tool?
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
1) 如果您已准备好所有要加载的数据,您可以尝试使用 sstableloader(仅适用于 cassandra 0.8.x 及以上版本)实用程序批量加载数据。有关更多详细信息,请参阅:cassandra 批量加载器
2) Cassandra在最新版本 cassandra-1.1.x 开始,引入了 BulkOutputFormat 通过 hadoop 作业将数据批量加载到 cassandra 中。
有关更多详细信息,请参阅:使用 Hadoop 批量加载到 Cassandra
1) If you have all the data to be loaded in place you can try sstableloader(only for cassandra 0.8.x onwards) utility to bulk load the data.For more details see:cassandra bulk loader
2) Cassandra has introduced BulkOutputFormat bulk loading data into cassandra with hadoop job in latest version that is cassandra-1.1.x onwards.
For more details see:Bulkloading to Cassandra with Hadoop
我怀疑工具支持是否能在很大程度上帮助解决这个问题,因为 Cassandra 架构需要反映您想要运行的查询,而不仅仅是您域的通用模型。
cassandra 的内置批量加载机制是通过 BinaryMemtables 实现的: http://wiki.apache.org/cassandra/ BinaryMemtable
但是,无论您使用此接口还是更常用的 Thrift 接口,您仍然可能需要手动设计从 CSV 到 Cassandra ColumnFamilies 的映射,同时考虑到您的查询需要运行。来自 CSV-> 的通用映射Cassandra 可能不合适,因为通常需要二级索引和非规范化。
I'm dubious that tool support would help a great deal with this, since a Cassandra schema needs to reflect the queries that you want to run, rather than just being a generic model of your domain.
The built-in bulk loading mechanism for cassandra is via BinaryMemtables: http://wiki.apache.org/cassandra/BinaryMemtable
However, whether you use this or the more usual Thrift interface, you still probably need to manually design a mapping from your CSV into Cassandra ColumnFamilies, taking into account the queries you need to run. A generic mapping from CSV-> Cassandra may not be appropriate since secondary indexes and denormalisation are commonly needed.
对于 Cassandra 1.1.3 及更高版本,可以使用 CQL COPY 命令将数据导入到表(或从表中导出)。根据文档,如果您导入的行数大致少于 200 万行,那么这是一个不错的选择。它比 sstableloader 更容易使用,并且更不容易出错。 sstableloader 要求您创建严格格式的 .db 文件,而 CQL COPY 命令接受带分隔符的文本文件。文档在这里:
http://www.datastax.com/docs/1.1/references/cql/COPY
对于较大的数据集,您应该使用 sstableloader。http://www.datastax.com/docs/1.1/references/bulkloader。这里描述了一个工作示例 http://www.datastax.com/dev/blog/bulk -loading。
For Cassandra 1.1.3 and higher, there is the CQL COPY command that is available for importing (or exporting) data to (or from) a table. According to the documentation, if you are importing less than 2 million rows, roughly, then this is a good option. Is is much easier to use than the sstableloader and less error prone. The sstableloader requires you to create strictly formatted .db files whereas the CQL COPY command accepts a delimited text file. Documenation here:
http://www.datastax.com/docs/1.1/references/cql/COPY
For larger data sets, you should use the sstableloader.http://www.datastax.com/docs/1.1/references/bulkloader. A working example is described here http://www.datastax.com/dev/blog/bulk-loading.