连续压力写入后 Cassandra 集群负载不平衡

发布于 2025-01-04 20:34:57 字数 1757 浏览 5 评论 0原文

能够重新创建一个更简单的场景，请参阅底部附近的更新

首先介绍一下问题的背景。我正在 Amazon EC2 上做一些 Cassandra 实验。我在一个集群中东有 4 个节点，西有 4 个节点。为了模拟我的用例，我使用在单独的 East-EC2 实例上运行的 cassandras 内部压力工具来发出：

./stress -d us-eastnode1,...,us-eastnode4 --replication-strategy NetworkTopologyStrategy --strategy-properties us-east:3,us-west:3 -e LOCAL_QUORUM -c 200 -i 10 -n 1000000

接下来我运行了相同的写入，同时也在另一个单独的 West-EC2 实例上开始相应的 local_quorum 读取：

./stress -d us-westnode1,...,us-westnode4 -o read -e LOCAL_QUORUM -c 200 -i 10 -n 1000000

在第一个 300k 或之后因此，读取时，西部节点之一开始阻塞约 80% iowait cpu，并使总读取速度降低约 90%。与此同时，写入工作以接近正常速度完成。为了找出导致这个单个节点 iowait 阻塞的原因，我只启动了阅读器，并立即遇到了同样的问题。

我的令牌是这样的，它在东节点周围是平衡的，每个西节点比每个相应的东节点+1，即。 us-eastnode1: 0、us-westnode1: 1、us-eastnode2: 42535295865117307932921825928971026432 等。实际负载最终在整个集合中达到平衡，因此我将其排除在可能的原因之外。

我最终运行了一次主要压缩（尽管 CF 只有 10 个 sstable，并且在超过一个小时内没有启动任何次要压缩）。当我再次尝试压力读取时，该节点很好......但是下一个顺序节点却遇到了同样的问题。这是我找到的最大线索，但我不知道它通向何方。

我在 cassandra IRC 中询问过，但没有得到任何想法。有人对我可以尝试的新事物有任何想法，试图找出这里出了什么问题吗？

次日更新 经过进一步研究，我只需运行两次写入压力，然后运行读取即可重现这一点。第一次写入后的 nodetool cfstats 显示每个节点负责约 750k 个密钥，这对于 DC 中的 4 个节点有 1,000,000 个密钥和 RF:3 有意义。然而，在第二次压力写入之后，us-westnode1 拥有约 1,500,000 个密钥，而 us-westnode1-3 每个都有约 875,000 个密钥。当它尝试读取时，负载两倍于应有负载的节点陷入困境。这让我觉得问题出在压力工具上。它使用相同的 c0-c199 列覆盖相同的 0000000-0999999 行。然而不知何故，没有一个节点保持与第一次运行时大致相同的数据负载。

简单的娱乐 通过删除第二个 DC 作为变量来缩小问题范围。现在运行 1 个 DC，4 个节点，每个节点拥有 25% 所有权，RandomPartitioner 和以下写入：

./stress -d node1,...,node4 --replication-factor 3 -e QUORUM -c 200 -i 10 -n 1000000

一次写入（和较小的压缩）后，每个节点的负载约为 7.5GB。
两次写入（和较小的压缩）后，每个节点的负载约为 8.6GB，除了 Node2 的负载约为 15GB。在所有节点上运行主要压缩后，每个节点的负载恢复到约 7.5GB。

这是否只是像压力工具那样有效覆盖整个数据集时出现的奇怪压缩问题？

原文

Was able to recreate a simpler scenario, see update near bottom

First some backround into the problem. I'm doing some Cassandra experiments on Amazon EC2. I've got 4 nodes in East, 4 in West in one cluster. To simulate my use case, I used cassandras internal stress tool running on a separate East-EC2 instance to issue:

./stress -d us-eastnode1,...,us-eastnode4 --replication-strategy NetworkTopologyStrategy --strategy-properties us-east:3,us-west:3 -e LOCAL_QUORUM -c 200 -i 10 -n 1000000

Next I ran the same write, while also starting off a corresponding local_quorum read on another seperate West-EC2 instance:

./stress -d us-westnode1,...,us-westnode4 -o read -e LOCAL_QUORUM -c 200 -i 10 -n 1000000

After the first 300k or so reads, one of the west nodes started blocking with ~80% iowait cpu and lowering the total read speed by ~90%. Meanwhile the writes finished just fine at close to their normal speed. In an attempt to figure out what is causing this single node to iowait block, I started up just the reader, and had the same issue immediately.

My tokens are such that it is balanced around the East nodes, with each West node +1 over each corresponding East node, ie. us-eastnode1: 0, us-westnode1: 1, us-eastnode2: 42535295865117307932921825928971026432, etc.. The actual load ended up balanced across the set, so I struck that out of the possible cause for this.

I eventually ran a major compaction (Despite there being only 10 sstables for the CF, and no minor compactions having been kicked off for >hour). Once I tried the stress read again, the node was fine...However the next sequential node was then having the same problem. This is the biggest clue that I found, but I do not know where it leads.

I've asked in the cassandra IRC, but got no ideas from there. Anybody have any ideas for new things I could try in an attempt to figure out what is going wrong here?

Next day update
Some further delving I was able to reproduce this by simply running the write stress twice, then running the read. nodetool cfstats after the first write shows that each node is responsible for ~750k keys, which makes sense for 1,000,000 keys and RF:3 for 4 nodes in a DC. However, after the second stress write, us-westnode1 has ~1,500,000 keys while us-westnode1-3 each has ~875,000 keys. When it then tries to read, the node with twice as much load as it should have is bogging down.
This makes me think that the trouble is in the stress tool. It is overwriting the same 0000000-0999999 rows with the same c0-c199 columns. Yet somehow none of the nodes stay at roughly the same data load as they had the first run through.

Simple recreation
Narrowed down the problem by removing the second DC as a variable. Now running 1 DC, 4 nodes with 25% ownership each, RandomPartitioner, and the following write:

./stress -d node1,...,node4 --replication-factor 3 -e QUORUM -c 200 -i 10 -n 1000000

After one write (and minor compactions), each node had ~7.5gb of load.
After two writes (and minor compactions), each node had ~8.6gb of load, save for node2 with ~15gb.
After running a major compaction on all nodes, each node was back to ~7.5gb of load.

Is this simply a weird compaction issue that crops up when effectively overwriting the entire dataset like the stress tool does?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

顾挽 2025-01-11 20:34:57

Is this simply a weird compaction issue that crops up when effectively overwriting the entire dataset like the stress tool does?

是的，压缩分桶的行为会有些随机，并且某些节点不像其他节点那样压缩是正常的。（也就是说，听起来 Node2 基本上没有完成压缩，可能只是落后了。）

如果您的实际工作负载还涉及大量覆盖，您可能应该测试 Leveled Compaction，它的设计目的是在这方面做得更好、更可预测。场景：http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra

Is this simply a weird compaction issue that crops up when effectively overwriting the entire dataset like the stress tool does?

Yes, compaction bucketing is going to behave somewhat randomly and it's normal for some nodes to not compact as well as others. (That said, it sounds like node2 at essentially no compaction done was probably just behind.)

If your actual workload also involves a lot of overwrites, you should probably test Leveled Compaction, which is designed to do a better and more predictable job in that scenario: http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra

回复收藏 0 原文

~没有更多了~