cassandra 中非均匀范围数据的均匀分区

发布于 2024-10-24 04:00:52 字数 1260 浏览 5 评论 0原文

我有一个相当棘手的问题，请耐心等待，因为我尽量不要在这里结结巴巴。我正在做一些研究，我的团队正在过渡到 cassandra 数据库。我们的研究之前使用过 MySQL，但数据增长超过了数据库（内存中 1.92 亿行 @ 16G——这是足够快地查询数据的唯一方法）。数据本身有点静态。数据量很大，但此时任何新数据都有点缓慢。

该数据由大量分类器分数对组成。我们对数据库制定查询，基本上就是“给我以下分类器的前 500 个”。然后数据库返回那么多分数。例如，如果我们要求 2 个分类器的前 500 个分数，我们会返回 1000 行（每行包含一个分类器 ID 和一个分数 - 即 [4, 9100]）。分数本身是不均匀的（分布倾向于向值的一端聚集——顺便说一下，这些值是从 -10000 到 10000）

当我们过渡到 cassandra 时，有许多要求。首先，我们需要能够在每个分类器的基础上查询前 N 个分数和后 N 个分数。通常情况下，我可以看到有序分区器适合于此，但是就像我所说的那样，分数往往会在极端情况下聚集（这会给一个节点带来太大的负担）。所以我的第一个问题是，如何均匀分布分类器/分数对，同时仍然能够查询顶部或底部 N。

还有一个第二个要求，它几乎搞砸了第一个要求。有时需要找到接近另一个分数的所有分数。因此，如果我看到分类器 6 的分数为 400，我可能会问，显示最接近该分数的 500 个分数（全部在分类器 6 内）。我完全被这个问题难住了。我读到 cassandra 支持二级索引（是的），但仅支持哈希类型（嘘 - 无范围）。我们是否为此用例创建一个单独的 ColumnFamily？

最后，速度至关重要。该数据正在交互式 GUI 应用程序中使用。理想情况下，查询应该只需要几秒钟。如果数据全部卡在一个特定节点上，速度就会变慢。

我们尝试了各种巧妙的技巧。我们最好的想法是将数据放入存储桶中，以便前 500 个放入存储桶 1，接下来的 500 个放入存储桶 2，依此类推。优点是，要获得前 500 个，我们只需要存储桶 1。此外，所有数据都将使用随机分区器均匀分布。然而，由于我们的大多数查询仅对存储桶 1 感兴趣，因此这会给一个节点带来很大的负担（请记住，如果涉及 N 个分类器，则实际上每个存储桶有 500 * N 个分数）。该方案的真正缺点是，当我们需要根据分数的接近程度进行查询时（我们必须对存储桶进行某种奇怪的二分搜索才能找到起始值），它就会崩溃。

此时我们的想法已经所剩无几了。我所看到的关于 cassandra 的一切让我想知道它是否适合这项任务。我们选择它主要是因为它的水平可扩展性，这很重要（添加节点比分片 RDBM 容易得多）。所以我想我的总体问题是：你会如何处理这个问题？如果是 cassandra，请解决上述任何问题。否则任何见解或智慧将不胜感激。谢谢。

原文

I've got a rather tricky one, bear with me as I try not to stumble over my words here. I'm doing some research, and my group is transitioning to a cassandra database. Our research used MySQL before, but the data outgrew the database (192 million rows in memory @ 16G -- it was the only way to query the data fast enough). The data itself is kinda-sorta static. There's a whole lot of it, but any new data is a somewhat slow trickle at this point.

The data consists of a boatload of classifier-score pairs. We formulate queries for the database which basically say, "give me the top 500 for the following classifiers". Then the database returns that many scores. For example, if we ask for the top 500 scores for 2 classifiers, we get back 1000 rows (each row consisting of a classifier ID and a score -- i.e. [4, 9100]). The scores themselves are non-uniform (the distribution tends to clump toward one end of the values -- which by the way are from -10000 to 10000)

As we transition to cassandra, there are a number of requirements. First of all, we need to be able to query for the top and bottom N scores on a per-classifier basis. Normally I can see that an ordered partitioner would be appropriate for this, however like I said the scores tends to clump at the extremes (which would put too much of a burden on one node). So my first question is, how do I evenly distribute the classifier/score pairs while still being able to query for the top or bottom N.

There is a secondary requirement which pretty much screws up the first one. Sometimes it is necessary to find all scores that are near another score. So if I see classifier 6 with a score of 400, I might ask, show me 500 scores that are the closest to that (all within classifier 6). I'm absolutely stumped about this one. I've read that cassandra supports secondary indices (yay) but only hash type (boo - no ranges). Do we create a seperate ColumnFamily for this use case?

And finally, speed is paramount. The data is being used in an interactive GUI application. Ideally, queries should only take a few seconds. And if data all gets stuck on one particular node, it will slow things down.

We've tried all kinds of clever tricks. Our best idea was to put the data into buckets, so that the top 500 went into bucket 1, the next 500 went into bucket 2, and so on. The advantage is that to get the top 500 we just ask for bucket 1. Also all of the data WOULD be evenly distributed using a random partitioner. However since MOST of our queries are interested only in bucket 1, it would put a lot of burden on just one node (remember, if N classifiers are involved, it's actually 500 * N scores per bucket). The real disadvantage of this scheme is that it falls apart when we need to query based on nearness to a score (we'd have to do some kind of weird binary search over the buckets to find our starting value).

At this point we're running low on ideas. Everything I've seen about cassandra makes me wonder if it's even appropriate for this task. We chose it mainly because of it's horizontal scalability, which is important (much easier to add a node than to shard an RDBM). So I suppose my overall question is: how would you approach this? If cassandra, please address any of the above issues. Otherwise any insight or wisdom would be appreciated. Thanks.

分享到QQ

分享到微博