卡桑德拉文件系统
根据轻快的实现 [Cassandra SF 中的演示] - Cassandra、CFS、Job/Task Tracker 和 Hive Metastore 在单个 JVM 中运行,这与配置独立的 hadoop 集群完全不同。
这是一个优势吗?
如果任务跟踪器或 JVM 中的任何单个进程失败,会发生什么情况?这会影响同一 JVM 中的 cassandra 实例吗?
CFS如何获取数据?它将 SSTables 存储为子块还是它的副本?子块的压缩是在哪里完成的?
问候, 泰米尔语
According to brisk implementation [ Presentation in Cassandra SF ] - Cassandra, CFS, Job/Task Tracker and Hive Metastore run in a single JVM which is totally different from Configuring an independent hadoop cluster.
Is this an advantage?
What happens if Task Tracker or any of the individual process in the JVM fails? Will that affect the cassandra instance in the same JVM?
How does CFS get data from? Is it storing the SSTables as sub blocks or a copy of it? Where is that compression of sub blocks done?
Regards,
Tamil
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
Brisk 确实在单个 JVM 中运行所有这些,但是在互不影响的单独独立线程中运行。跟踪器在专用节点上运行,但不存在单点故障。可以选择任何节点来运行跟踪器,并且所有状态都保存到 Cassandra 集群中。
所有这些都位于同一个 JVM 中的优点是,将数据从 Cassandra 移动到 Hadoop 代码时没有复制和序列化开销。
CassandraFS 将 64MB HDFS 块分解为 2MB 块,并将它们作为列存储在 Cassandra 中,每个块一行。文件本身映射到 inode 列族中的块行 UUID 列表。
Brisk does run all of it in a single JVM, but in separate independent threads that don't effect one another. The trackers run on a dedicated node, but there is no single-point-of-failure. Any node can be elected to run the trackers and all of the state is persisted to the Cassandra cluster.
The advantage to it all being in the same JVM is that there's no copy and serialization overhead for moving data from Cassandra into the Hadoop code.
CassandraFS breaks the 64MB HDFS blocks into 2MB chunks and stores them as columns in Cassandra, with one row per block. The files themselves are mapped to a list of block row UUIDs in the inodes column family.