Cassandra 多重获取性能
我有一个 cassandra 集群,行数相当少(大约 200 万行,我希望这对于 cassandra 来说是“小”)。每行都以唯一的 UUID 为键,每行大约有 200 列(或多或少)。总而言之,这些都是非常小的行,没有二进制数据或大量文本。只是短字符串。
我刚刚完成从旧数据库到 cassandra 集群的初始导入。我已经在每台机器上对 cassandra 进行了调优。有数亿次写入,但没有读取。现在是时候使用这个东西了,我发现读取速度绝对令人沮丧。我正在使用 pycassa 一次对 500 到 10000 行进行多重获取。即使有 500 行,性能也很糟糕,有时需要 30 秒以上。
什么会导致这种行为?在这样的大规模进口之后,您会推荐什么样的东西?谢谢。
I've got a cassandra cluster with a fairly small number of rows (2 million or so, which I would hope is "small" for cassandra). Each row is keyed on a unique UUID, and each row has about 200 columns (give or take a few). All in all these are pretty small rows, no binary data or large amounts of text. Just short strings.
I've just finished the initial import into the cassandra cluster from our old database. I've tuned the hell out of cassandra on each machine. There were hundreds of millions of writes, but no reads. Now that it's time to USE this thing, I'm finding that read speeds are absolutely dismal. I'm doing a multiget using pycassa on anywhere from 500 to 10000 rows at a time. Even at 500 rows, the performance is awful sometimes taking 30+ seconds.
What would cause this type of behavior? What sort of things would you recommend after a large import like this? Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
听起来你是 io 瓶颈。如果您的数据适合 RAM,Cassandra 每个核心的读取速度约为 4000 次/秒。否则你就会像其他事情一样被寻求所束缚。
我注意到,通常情况下,“彻底调整”系统是在您开始向系统施加负载之后保留的。 :)
请参阅:
Sounds like you are io-bottlenecked. Cassandra does about 4000 reads/s per core, IF your data fits in ram. Otherwise you will be seek-bound just like anything else.
I note that normally "tuning the hell" out of a system is reserved for AFTER you start putting load on it. :)
See:
是否可以选择将多重获取分成更小的块?通过这样做,您将能够将您的获取分布在多个节点上,并可能通过在节点之间分布负载和反序列化较小的数据包来提高性能。
这引出了下一个问题,您的读取一致性设置为多少?除了 @jbellis 提到的 IO 瓶颈之外,如果您需要特别高的一致性,您还可能会遇到网络流量问题。
Is it an option to split up the multi-get into smaller chunks? By doing this you would be able to spread your get across multiple nodes, and potentially increase your performance, both by spreading the load across nodes and having smaller packets to deserialize.
That brings me to the next question, what is your read consistency set to? In addition to an IO bottleneck as @jbellis mentioned, you could also have a network traffic issue if you are requiring a particularly high level of consistency.