计算列数,CountQuery 与 SliceQuery 操作非常慢
我编写了一个“人口普查”程序来迭代列族中的所有行,并在每行中对列进行计数,记录最大值和行键。我花了更多的时间在 Hector 客户端上,但也编写了一个 Pelops 客户端来测试。
基本流程是使用 RangeSlicesQuery 迭代行,然后在每一行使用 SliceQuery 迭代并收集统计信息。在 Pelops 中的工作原理类似,只是 API 不同。缺点是必须手动进行缓冲,为行和列选择缓冲区大小...我当前的数据是 1200 万行,最大列数约为 25K,所以是的需要一段时间...在我当前的配置中,我得到> 每秒 25K 行。
寻找改进方法并发现了 Hector 的 CountQuery(我假设它使用 Thrift 客户端 get_count())。我认为仅迭代键(使用 RangeSlicesQuery.setReturnKeysOnly())然后在每个行键上重新使用 CountQuery 会更快,因此我修改了代码。
不仅慢了,而且慢了 30 倍! (每秒仅处理 900 行)...
是否有更好的方法来计算列数?
I've written a "census" program to iterate through all the rows in a Column Family and within each row count the columns, recording the max value and row key. I've been spending more time with the Hector client but have written a Pelops client as well to test.
The basic flow is to use use a RangeSlicesQuery to iterate through the rows, and then at each row, use a SliceQuery to iterate through and collect the stats. Works similar in Pelops, just different APIs. Downside is having to do the buffering manually, picking buffer sizes for both rows and columns... My current data is 12 million rows, with largest column count ~25K, so yeah takes a while... in my current configuration, am getting >25K rows per second.
Looking for ways to improve and discovered Hector's CountQuery (which I assume, uses Thrift client get_count()). Thinking it would be faster to just iterate keys (use RangeSlicesQuery.setReturnKeysOnly()), and then re-use a CountQuery on each row key, I revised the code.
Not only was it slower, but 30x slower! (processed only 900 rows per second)...
Is there a better way to count columns?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
不确定 Hector 发生了什么——我预计它会慢大约 2 倍,而不是慢 30 倍。
更一般地说,使用计数器列保持非规范化计数可能比完整的 CF 扫描更好:http://www.datastax.com/dev/blog/whats-new-in-cassandra-0-8-part-2-counters
Not sure what's going on with Hector -- I'd expect it to be roughly 2x slower, not 30x slower.
More generally, keeping a denormalized count using a counter column is probably better than a full CF scan: http://www.datastax.com/dev/blog/whats-new-in-cassandra-0-8-part-2-counters