cassandra的物理磁盘空间管理
最近,我一直从我们新项目的角度研究 Cassandra,并从这个社区及其 wiki 中学到了很多东西。但我还没有找到任何关于 Cassandra 中如何在物理磁盘空间管理方面管理更新的信息,尽管它似乎与使用压缩的记录删除管理非常相似。
假设有 100 个记录,每个记录有 5 个列值,因此当所有更改都将刷新磁盘时,所有记录将相邻写入,并且当删除操作完成时,首先在内存表中标记它,并在配置中设置的一段时间后删除物理记录或当它满了的时候。压缩过程占用了空间。
现在的问题是,一方面是模式较少,一开始没有固定数量的列,但另一方面,当压缩过程发生时..它是否像传统 RDBMS 一样将记录相邻地放在磁盘上以加速读取过程对于 RDBMS 来说,这很容易,因为它们必须根据列数据类型的声明分配固定数量的空间。
但是 Cassandra 究竟如何在压缩过程(更新/删除)中将记录放置在磁盘上以加快读取速度?
与压缩相关的另一个问题是,当没有删除查询但有一个更新查询用一些可变长度数据更新现有记录或插入一个新列时,压缩如何在磁盘上已存在的数据行之间提供可用空间?
Recently I have been looking into Cassandra from our new project's perspective and learned a lot from this community and its wiki too. But I have not found anything about about how updates are managed in Cassandra in terms of physical disk space management though it seems to be very much similar to record delete management using compaction.
Suppose there are 100 records with 5 column values each so when all changes would be flushed disk all records will be written adjacently and when delete operation is done then its marked in Memory table first and physically record is deleted after some time as set in configuration or when its full. And the compaction process claims the space.
Now question is that at one side being schema less there is no fixed number of columns at the the beginning but on the other side when compaction process takes place then.. does it put records adjacently on disk like traditional RDBMS to speed up the read process as for RDBMS its easy because they have to allocate fixed amount of space as per declaration of columns datatype.
But how Cassandra exactly makes the records placement on disk in compaction process (both for update/delete) to speed up the reads?
One more question related to compaction is that when there is no delete queries but there is an update query which updates an existent record with some variable length data or insert altogether a new column then how compaction makes its space available on disk between already existent data rows?
行和列按排序顺序存储在 SSTable 中。这允许压缩多个 SSTable 以输出一个新的(已排序的)SSTable,并且仅使用顺序磁盘 IO。这个新的 SSTable 将被输出到磁盘上的新文件和可用空间中。此过程不依赖于列的行数,只依赖于它们按排序顺序存储。所以是的,在所有 SSTable 中(甚至是那些由压缩产生的结果),行和列将在磁盘上按排序顺序排列。
更重要的是,正如您在问题中暗示的那样,更新与插入没有什么不同 - 它们不会覆盖磁盘上的值,而是在 Memtable 中缓冲,然后刷新到新的 SSTable 中。当新的 SSTable 最终与包含原始值的 SSTable 进行压缩时,新值将消除旧值 - 即旧值不会从压缩中输出。时间戳用于确定哪些值是最新的。
删除以相同的方式处理,有效地插入“反值”或墓碑。此过程的局限性在于可能需要大量空间开销。删除实际上是“惰性的”,因此空间直到一段时间后才会被释放。此外,虽然压缩的输出可以与输入的大小相同,但在新的 SSTable 完成之前无法删除旧的 SSTable,因此这可以将磁盘利用率降低至 50%。
在上述系统中,现有键的新值可以与现有键具有不同的大小,而无需填充到某个预先确定的长度,因为新值不会在更新时覆盖旧值,而是写入新的 SSTable 。
Rows and columns are stored in sorted order in an SSTable. This allows a compaction of multiple SSTables to output a new, (sorted) SSTable, with only sequential disk IO. This new SSTable will be outputted into a new file and freespace on the disks. This process doesn't depend on the number of rows of columns, just on them being stored in a sorted order. So yes, in all SSTables (even those resulting form compactions) rows and columns will be arranged in a sorted order on disk.
Whats more, as you hint at in your question, updates are no different from inserts - they do not overwrite the value on disk, but instead get buffered in a Memtable, then get flushed into a new SSTable. When the new SSTable eventually gets compacted with the SSTable containing the original value, the newer value will annihilate the old one - ie the old value will not be outputted from the compaction. Timestamps are used to decide which values is newest.
Deletes are handled in the same fashion, effectively inserted an "anti-value", or tombstone. The limitation of this process is that is can require significant space overhead. Deletes are effectively 'lazy, so the space doesn't get freed until some time later. Also, while the output of the compaction can be the same size as the input, the old SSTables cannot be deleted until the new one is completed, so this can reduce disk utilisation to 50%.
In the system described above, new values for an existing key can be a different size to the existing key without padding to some pre-determined length, as the new value does not get written over the old value on update, but to a new SSTable.