使用 cassandra 监控数据模型的技巧
我对 cassandra 比较陌生,必须评估监控工具的不同 NoSQL 解决方案。 一项数据大约只有 100 字节大,但数据数量确实很多。一天之内,我们会获得大约 1500 万条记录...... 所以我目前正在测试 9 亿条记录(大约 15GB 作为 SQL-Insert 脚本)
我的第一个问题是: cassandra 满足我的需求吗?我需要进行范围查询(在创建记录的日期),并根据数据中存储的“二级索引”定义的组对一些列进行求和。)
我已经尝试过 MongoDB,但它真的很差,MapReduce 做得非常糟糕工作... 我还阅读了有关 HBase 的内容,但它所需的大量配置让我希望 Cassandra 能够提供解决方案...
第二个问题是:我如何存储数据以通过上述方式访问它? 我已经想到了一个超级列族,其中键是日期(自 1970 年以来),列将是当时获取的数据......但是如果我使用随机分区器,我无法进行快速范围查询就其而言(据我所知),如果我使用订单保留分区器,数据将不会分布在我的集群(当前由两个节点组成)上。
我希望我给了你所有必要的信息...... 感谢您的帮助!
安迪
I'm relatively new to cassandra and have to evaluate different NoSQL-Solutions for a monitoring tool.
One datum is just about 100 Bytes big, but there are really a lot of them. During a day we get about 15 million records...
So I'm currently testing with 900 million records (about 15GB as SQL-Insert Script)
My first question is:
Does cassandra fit my needs? I need to do range querys (on the date the records were created) and sum up some of the columns according to groups defined by "secondary indexes" stored in the datum.)
I already tried MongoDB but it's really poor MapReduce made a really crappy job...
I also read about HBase, but the enormous amount of configuration needed for it makes me hope that there could be solution with Cassandra...
A second question is: how I could store my data to access it in the ways mentioned above?
I already thought of a super column family, where the key is the date (as long since 1970) and the columns would be the datums taken at that time... but if I use Random Partitioner, I can't do fast range querys on it (as I know) and if I use Order Preserving Partitioner the data won't be spread over my cluster (currently consisting of two nodes).
I hope I gave you all the necessary information...
Thank you for your help!
andy
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
听起来像是 Brisk(Cassandra + Hadoop 发行版)的工作。完整的 Hadoop 映射/归约,包括 Hive 支持,几乎无需配置。
http://www.datastax.com/products/brisk
Sounds like a job for Brisk (Cassandra + Hadoop distribution). Full Hadoop map/reduce including Hive support, virtually no configuration required.
http://www.datastax.com/products/brisk
我们也有类似的情况。
我们将数据存储在简单的行中,其中行键的格式为
:
。我们当前的时间段大小是 24 小时。该列是时间戳,值是一个用msgpack
序列化的小对象。如果需要,我们手动进行聚合。
我们还做了一个小的优化:当存储桶已满时,它变得不可变,因此我们创建一个“all”对象,将所有值保存在单个列中。然后可以清除每个时间戳列。这使我们能够获取整个存储桶并以 O(1) 的时间对其进行反序列化,而不是扫描整行。
We had a similar situation.
We store our data in simple rows, where the row key is in the form
<id>:<time-bucket>
. Our current time bucket size is 24h. The column is the timestamp, and the value is a small object serialized withmsgpack
.We do aggregation manually if needed.
We also do a small optimization: when the bucket is full, it becomes immutable, so we create an "all" object holding all values in a single column. Then the per-timestamp columns can be purged. This allows us to fetch a whole bucket and deserialize it in O(1) rather than scanning through the row.