在 Cassandra 中组织范围扫描的行键的方法

发布于 2025-01-01 16:29:18 字数 753 浏览 4 评论 0原文

我正在尝试找到一种好方法来组织行键以对它们执行范围扫描，而无需创建自己的索引列表。

我有一个 MySQL 数据库，目前约有 15.000 个数据库，每个数据库约 50 个表 = 75.000 个表。由于 99% 的数据始终使用唯一标识符读取，因此计划将数据移至 Cassandra 集群中。

对于某些维护（列出完整表的内容、删除完整表或删除数据库）情况，我需要获取完整表甚至数据库的内容。范围扫描似乎非常适合这一点。

目前，我计划为旧结构的每个部分生成 UUID，并将它们放在一起，并用 | 分隔（DB + Table + Id = UUID1|UUID2|UUID2）。

示例：

07424eaa-4761-11e1-ac67-12313c033ac4|0619a6ec-4525-11e1-906e-12313c033ac4|0619a6ec-4795-12e9-906e-78313c033ac4

包含数据的 CF 应使用 org.apache.cassandra.db.marshal.AsciiType 进行排序。

作为客户端，我使用 phpcassa。

对于范围扫描，我想使用 UUID| 作为范围的开始键和结束键，相同的键但使用 chr(255) 或 z附加到它。这两个字符的 ascii 值都大于该键中后面的任何其他 UUID 字符。

这是一种可靠的方法吗？可以让我实现范围扫描的解释目标吗？

原文

I am trying to find a good way to organize my row-keys to perform range scans on them without creating my own index lists.

I am having a MySQL Database with currently about 15.000 Databases, each ~50 Tables = 75.000 Tables. Because 99% of the data is always read with an unique identifier that data is planned to move into a Cassandra cluster.

For some maintenance (listing the contents of a complete table, removing a complete table or dropping a database) cases I need to get the contents of a complete table or even a database. Range-Scans seem to be the perfect fit for that.

Currently I am planning to generate UUIDs for each part of the old structure and put them together separated by a | (DB + Table + Id = UUID1|UUID2|UUID2).

Example:

07424eaa-4761-11e1-ac67-12313c033ac4|0619a6ec-4525-11e1-906e-12313c033ac4|0619a6ec-4795-12e9-906e-78313c033ac4

The CF with the data should be sorted with org.apache.cassandra.db.marshal.AsciiType.

As client I am using phpcassa.

For the range scans I want to use an UUID| as start key and as an end for the range, the same key but with chr(255) or z appended to it. The ascii-value for both characters are bigger any other of the UUID characters that are following in that keys.

Is this a solid approach that allows me to achieve the explained goals for the range scans?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

醉酒的小男人 2025-01-08 16:29:18

Cassandra 最佳实践是使用 RandomPartitioner - 只要您的令牌均匀分布，这就能为您提供“免费”负载平衡。不幸的是，使用随机分区器，行范围查询（即 get_range_slices）以随机顺序返回键。

这对于对整个列族进行分页来说很好（如果这是您想要的，那么您的方法将会起作用）。但是，如果您只想分页浏览较小的连续范围的行键，则它将无法工作。

解决此问题的一种选择是使用宽行和复合列。例如，一个列族如下所示：

{ 
  row1 -> {column1: value1, column2: value2},
  row2 -> {column3: value3, column4: value4},
  ... 
}

将转置为如下所示：

{
  row1-10 -> {
              [row1, column1]: value1, [row1, column2]: value2,
              [row2, column3]: value3, [row2, column4]: value4,
              ...
             }
  ...
}

您可以通过在右列之间的右行上执行列切片 (get_slice) 来执行范围查询。即

get_range_slice(start=row1, end=row2)

变为：

get_slice(row=row1-10, start=[row1, null], end=[row2, null])

注意列键上的第二个维度为空。

诀窍是选择您的行（“桶”）键，这样您的列就不会变得太大（这对于普通 Cassandra 来说性能会很差），但您的查询不需要获取太多行。这取决于您的平均查询大小和 uuid 的分布，但一个好的候选者可能是使用 UUID1 作为行键，使用 [UUID2, UUID3] 作为列键的第一个维度。

Cassandra best practices are to use the RandomPartitioner - this gives you 'free' load balancing, as long as your tokens are evenly distributed. Unfortunately, with the random partitioner, row range queries (ie get_range_slices) returns keys in a random order.

This is fine for paging through the entire column family (and if that is what you want to, then you approach will work). But if you just want to page through a smaller, contiguous range of row keys, it will not work.

One option to solve this is to use wide rows and composite columns. For example, a column family which looks like this:

{ 
  row1 -> {column1: value1, column2: value2},
  row2 -> {column3: value3, column4: value4},
  ... 
}

Would be transposed to look like this:

{
  row1-10 -> {
              [row1, column1]: value1, [row1, column2]: value2,
              [row2, column3]: value3, [row2, column4]: value4,
              ...
             }
  ...
}

And you can do a range query by doing a column slice (get_slice) on the right row, between the right columns. ie

get_range_slice(start=row1, end=row2)

becomes:

get_slice(row=row1-10, start=[row1, null], end=[row2, null])

Note the null second dimension on the column keys.

The trick is to pick your row ('bucket') keys such that your columns won't grow too large (this will perform badly for normal Cassandra), but that you queries won't need to get too many rows. This will depend on your average query size, and the distribution of your uuids, but a good candidate might be to use UUID1 as the row keys and [UUID2, UUID3] as the first dimensions of the column keys.

回复收藏 0 原文

~没有更多了~