mysql集群赶上cassandra?
我最近一直在为我们即将推出的相当大的数据库寻找 nosql 解决方案,发现 cassandra 很好,但是网上关于新版本 cassandra 的可用资源非常少,并且大多数博客和文章都与 0.6 版本相关,而现在它也有实现了对 hadoop 和 hive 的支持。另一方面,mysql 集群版本也是专门为使用商用服务器在水平扩展设置上运行而设计的。
由于我们已经习惯了关系模型多年,迁移到 cassandra 需要对大脑进行反编译,而产品还不是很成熟,社区也没有那么大,无法快速响应任何特定问题,我检查了 datastax(在专业支持中)供应商)网站和他们的论坛几乎已经死了。
那么,如何在抛开关系型和非关系型比较的情况下比较 mysql cluster 和 cassandra 呢?
尽管 cassandra 的模式较少,但它仍然提供了很多表格功能,例如超级列和子列,因此可以从多个列值中搜索记录。
我还尽力找出 cassandra 如何物理存储更新的查询,例如编辑子列并添加相当大的数据块时的行,然后它如何物理存储该记录以及如何快速访问该记录?因为在 mysql 中列分配了固定长度,所以这不是一个大问题。
I have been recently looking at nosql solutions for our quite big upcoming database and found that cassandra is good but there are very less resources available online about new releases of cassandra and most of the blogs and articles are related to 0.6 version while now it has also implemented support for hadoop and hive. While on the other hand mysql cluster version is also specifically made to run on horizontal scaled setup using commodity servers.
As we are used to relational model for years and moving to cassandra will need decompiling of brain while the product is still not very mature and community is not also that big to respond quickly to any particular problem I have checked datastax(on of the professional support providers) website and their forums are pretty much dead.
So, how to compare mysql cluster vs cassandra while putting relational and non-relational comparison put aside?
Though cassandra is schema less but still it provies pretty much tabular features like super colum and sub column too so record can be searched from multiple column values.
I have also tried my best to find out how cassandra physically stores updated queries like for a row when a sub column is edited and added quite a big chunk of data then how it physically stores that record and how it accesses that record fast? Because in mysql columns have fixed length allocated so its not a big issue.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
以下是我怀疑 Cassandra 具有优势的一些领域:
详细说明一下最后一点,大多数没有在多节点集群上实际运行 Cassandra 的人并没有意识到 Cassandra 为此设计得有多好。如需两分钟的体验,请参阅 Jake Luciani 的演示。
Here are some areas where I suspect Cassandra has an advantage:
To elaborate on the last a little, most people who haven't actually run Cassandra on a multi-node cluster, don't realize just how well Cassandra has been designed for this. For a two minute taste, see Jake Luciani's demo.
要回答您的物理存储问题,Cassandra 写入速度快的关键功能是它们仅追加。也就是说,Cassandra 只将连续块写入磁盘;它不需要在写入期间对随机磁盘位置进行任何缓慢的查找。
当列被更新时,会发生两件事:写入被追加到提交日志(用于故障恢复),并且内存中的 Memtable 被更新。一旦 Memtable 满了,它就会作为新的 SSTable 刷新到磁盘。因此,数据的长度并不重要,因为您并不试图将其放入固定长度的磁盘结构中。
SSTable 是只读的 - 您永远不会在更新时返回并覆盖旧值,而只需写入新值。在读取时,Cassandra 首先在 Memtable 中查找密钥。如果没有找到,Cassandra 会按从最新到最旧的顺序扫描 SSTable,并在找到密钥时停止。这将为您提供最新的值。
还有一些优化。每个 SSTable 的键都有一个关联的布隆过滤器,这是一个紧凑的概率索引,可以产生误报,但绝不会产生误报。如果密钥不在布隆过滤器中,您可以安全地跳过该 SSTable,因为它保证不包含该密钥,尽管您偶尔可能会读取不必要的 SSTable。
当您获得太多 SSTable 时,它们会在称为压缩的过程中合并成一个更大的 SSTable。本质上,这对 SSTables 进行了一次大的合并排序。这使 Cassandra 可以回收已覆盖或删除的值的空间,并对分布在多个 SSTable 中的行进行碎片整理。
请参阅http://www.mikeperham.com/2010/03/13 /cassandra-internals-writing/ 和 http://wiki.apache.org/cassandra/MemtableSSTable 了解更多信息。
To answer your physical storage question, the key feature that makes Cassandra writes fast is that they are append-only. That is, Cassandra only ever writes sequential blocks to disk; it doesn't need to do any slow seeks to random disk locations during a write.
When a column is updated, two things happen: the write is appended to the commit log (for failure recovery), and the in-memory Memtable is updated. Once the Memtable is full, it is flushed out to disk as a new SSTable. Thus, the length of the data doesn't matter, since you're not trying to fit it into a fixed-length disk structure.
SSTables are read-only - you never go back and overwrite an old value on an update, you just write new ones. On a read, Cassandra first looks in the Memtable for the key. If it doesn't find it, Cassandra scans the SSTables in order from newest to oldest and stops when it finds the key. This gives you the most recent value.
There are a few optimizations as well. Each SSTable has an associated Bloom filter for its keys, which is a compact probabilistic index that can produce false positives but never false negatives. If the key is not in the Bloom filter, you can safely skip that SSTable as it is guaranteed not to contain the key, although you may occasionally read an SSTable that you didn't have to.
When you get too many SSTables, they are merged together into a bigger one in a process called compaction. Essentially this does a big merge sort on the SSTables. This lets Cassandra reclaim the space for values that have been overwritten or deleted, and defragment rows that were spread across multiple SSTables.
See http://www.mikeperham.com/2010/03/13/cassandra-internals-writing/ and http://wiki.apache.org/cassandra/MemtableSSTable for more information.
免责声明;我是 MySQL Cluster 产品团队的一员
如果您正在寻找 Cluster,那么值得从最新的 7.2 开发版本开始,其中包括显着增强
JOIN
性能的新功能,以及新的 memcached接口,绕过SQL层;http://dev.mysql。 com/tech-resources/articles/mysql-cluster-labs-dev-milestone-release.html
如果您已经熟悉 MySQL,那么以下文档重点介绍了 InnoDB 和当前 GA 7.1 版本之间的差异:
http://dev.mysql。 com/doc/refman/5.1/en/mysql-cluster-ndb-innodb-workloads.html
虽然这些没有提供与 Cassandra 的直接比较,但它们至少提供了有关 Cluster 的最新信息您可以根据它进行任何比较。
Disclaimer; I work as part of the MySQL Cluster product team
If you are looking to Cluster it would be worth starting with the latest 7.2 Development Release which includes new capabilities to significantly enhance
JOIN
performance, as well as a new memcached interface, bypassing the SQL layer;http://dev.mysql.com/tech-resources/articles/mysql-cluster-labs-dev-milestone-release.html
If you are familiar already with MySQL, then the following documentation highlights differences between InnoDB and the current GA 7.1 release:
http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-ndb-innodb-workloads.html
While these don't provide direct comparisons with Cassandra, they do at least provide the latest information on Cluster from which you can base any comparison.
如今的另一个选择是使用 playORM 的 cassandra 中的关系模型,只要您对非常非常大的表进行分区,您就可以使用可扩展 SQL 进行连接以及您熟悉的所有操作,如下所示
注意:该表是一个 Trades 表,并且 p .security 引用安全表。 Trades 是分区的,因此它可以有无限的分区,而 Security 表较小,因此它没有分区,但您可以使用您想要的联接执行所有 Scalabla SQL。
Another option these days is relational model in cassandra with playORM and as long as you partition your really really big tables, you can do joins and all the stuff you are familiar with using Scalable SQL like so
NOTE: The TABLE is a Trades table and p.security references the Security table. Trades is partitioned so it can have unlimited partitions and Security table is smaller so it is not partitioned but you can do all the Scalabla SQL with joins you want to.