Cassandra 的插入性能
提前抱歉我的英语。
我是 Cassandra 及其数据模型的初学者。我正在尝试在一个节点上的本地 cassandra 数据库中插入一百万行。每行有 10 列,我仅将它们插入一个列族中。
对于一个线程,该操作大约需要 3 分钟。但我想对 200 万行执行相同的操作,并保持愉快的心情。然后我尝试使用 2 个线程插入 200 万行,预计在 3-4 分钟左右会得到类似的结果。但我得到的结果是 7 分钟……是第一个结果的两倍。当我查看不同的论坛时,建议使用多线程来提高性能。 这就是为什么我问这个问题:使用多线程在本地节点(客户端和服务器位于同一台计算机上)仅在一个列族中插入数据是否有用?
一些信息: - 我使用pycassa - 我已将提交日志存储库和数据存储库分离到不同的磁盘上 - 我对每个线程使用批量插入 - 一致性级别:一级 - 复制因子:1
sorry for my English in advance.
I am a beginner with Cassandra and his data model. I am trying to insert one million rows in a cassandra database in local on one node. Each row has 10 columns and I insert those only in one column family.
With one thread, that operation took around 3 min. But I would like do the same operation with 2 millions rows, and keeping a good time. Then I tried with 2 threads to insert 2 millions rows, expecting a similar result around 3-4min. bUT i gor a result like 7min...twice the first result. As I check on differents forums, multithreading is recommended to improve performance.
That is why I am asking that question : is it useful to use multithreading to insert data in local node (client and server are in the same computer), in only one column family?
Some informations :
- I use pycassa
- I have separated commitlog repertory and data repertory on differents disks
- I use batch insert for each thread
- Consistency Level : ONE
- Replicator factor : 1
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您有可能遇到了 python GIL,但更可能的是您做错了什么。
例如,将 2M 行放入一个批次中就是错误的做法。
It's possible you're hitting the python GIL but more likely you're doing something wrong.
For instance, putting 2M rows in a single batch would be Doing It Wrong.
尝试在多个进程中运行多个客户端,而不是线程。
然后尝试不同的刀片尺寸。
3 分钟内 1M 插入约为 5500 次插入/秒,这对于单个本地客户端来说相当不错。在多核计算机上,如果您使用多个客户端(可能插入小批量的行或单个行),您应该能够获得此数量的几倍。
Try running multiple clients in multiple processes, NOT threads.
Then experiment with different insert sizes.
1M inserts in 3 mins is about 5500 inserts/sec, which is pretty good for a single local client. On a multi-core machine you should be able to get several times this amount provided that you use multiple clients, probably inserting small batches of rows, or individual rows.
你可能会考虑Redis。它的单节点吞吐量应该更快。但它与 Cassandra 不同,因此它是否是合适的选项将取决于您的用例。
You might consider Redis. Its single-node throughput is supposed to be faster. It's different from Cassandra though, so whether or not it's an appropriate option would depend on your use case.
由于插入的数据量是原来的两倍,因此所花费的时间加倍。您是否有可能受到 I/O 限制?
The time taken doubled because you inserted twice as much data. Is it possible that you are I/O bound?