Cassandra 中的数据存储
我目前正在努力寻找与 Cassandra 一起使用的正确数据格式。我想这是因为它比标准键值存储提供了额外的深度。
我的数据格式当前定义如下:
- 不同应用程序的键空间。
- 不同应用程序部分的列族。
- 在这些列族中我有数据。
大多数数据以以下格式存储在单个列族中:
Key: UUID-1|UUID-2|UUID-3
Value: Array of PHP Values
插入多个 100.000 个条目(每个条目 <1kb)后,我发现读取数据时性能下降。
根据我的理解,列族应该正是存储数据主要部分的位置。将我的大部分数据放在一个列族中而不是几个不同的列族中不应该是重点。
我是否应该考虑将数据拆分为不同的列族,或者该方法是否正确,但其他原因可能是问题的原因?
编辑回答评论中 DNA 的问题:
我正在比较开始测试之前插入的单个密钥所需的读取时间。
测试密钥在开始时数据库仍为空时,在<0.0010s内持续读取>1.000次。测试中写入的数据结构如下:
- 由 5 个字符 + 20 个数字构建的键标识的行
- ,其中一列(1 个字符)包含当前的 unix 时间戳
我添加了条目并重新运行相同的读取测试以比较如何阅读次数。我在这里列出的读取时间是较低的数字:
Entries | Read Time
0 | 0.0010
150.000 | 0.0013
300.000 | 0.0014
500.000 | 0.0016
750.000 | 0.0019
1.000.000 | 0.0022
因为这仅用于基本测试,所以仅在 Amazon 的单个节点(ec2 实例)上运行。每 250.000 个新行,读取时间似乎增加约 0.0003 秒。
我知道这些数字确实很小而且很棒,但是阅读时间的线性增长并不是我所期望的。
我计划将一个包含大量小条目的大型 MySQL 服务器迁移到 Cassandra。它目前包含大约 750 亿个条目,并且它正在收集的新数据集数量非常快,因此读取时间的线性增加让我怀疑我是否正在走向正确的方向。
I am currently struggling with the correct data format to use with Cassandra. I guess this is because of the additional depth it offers over standard key-value storages.
My data format is currently defined like this:
- Keyspaces for different Applications.
- Column Families for different Application parts.
- In these Column Families I have the data.
Most of the data is stored within a single Column Family in the format:
Key: UUID-1|UUID-2|UUID-3
Value: Array of PHP Values
After inserting several 100.000 entries (<1kb each) I see a performance degradation when reading data.
From my understanding the Column Families should be exactly where to store the main part of my data. Having most of my data in a single Column Family instead of several different ones should not be the point.
Should I look into splitting my data into different Column Families or is the approach correct but something else likely to be the reason for the problem?
Edit to answer DNA's questions in the comment:
I am comparing the read time needed for a single key I have inserted before starting my tests.
The test key consistently read within <0.0010s for >1.000 times in the beginning while the database is still empty. The data written in the tests is structured like this:
- A row identified by a Key built with 5 chars + 20 numbers
- with one Column (1 Character) containing the current unix timestamp
I added entries and re-ran the same read test to compare how the read times. The read times I am listing here are the lower numbers:
Entries | Read Time
0 | 0.0010
150.000 | 0.0013
300.000 | 0.0014
500.000 | 0.0016
750.000 | 0.0019
1.000.000 | 0.0022
Because this is only for basic testing this is only run on a single node (ec2 instance) at Amazon. The read time seems to increase by about 0.0003s for every 250.000 new rows.
I know that these are really small numbers and they are great, but the linear growing of the read time is not what I expected.
I am planning to move a big MySQL Server with a huge amount of small entries to Cassandra. It currently contains about 75 billion entries and the amount of new datasets it is collecting is really fast, a linear increase for read time is therefore making me wonder if am going into the right direction.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
感谢您更新问题。
您可能应该阅读这篇有关 Netflix 基准测试的文章。
使用相对较少的行数进行基准测试不会告诉您有关大型数据集的可扩展性的任何信息。对数百万行运行这种测试并不困难。
如果您目前只是进行测试,您可能应该升级到 1.0 分支(当前为 1.0.7),因为这比 0.7 快得多。
云服务器上的性能可能不能很好地代表真实本地硬件上的性能 - 尽管云服务器对于集群测试来说是一个好主意。 请参阅 http://wiki.apache.org/cassandra/CassandraHardware
如果读取延迟是您最关心的问题, ,然后确保您熟悉 Cassandra 中的缓存设置(keys_cached 和 rows_cached) - 请参阅 http://wiki.apache.org/cassandra/StorageConfiguration。
Thanks for updating the question.
You should probably read this article about the Netflix benchmarking.
Benchmarking with relatively small numbers of rows won't tell you anything about the scalability for large datasets. It's not difficult to run this kind of test for many millions of rows.
If you are just testing at the moment, you should probably upgrade to the 1.0 branch (currently 1.0.7) as this is significantly faster than 0.7.
Performance on cloud servers may not be very representative of the performance on real local hardware - although cloud servers are a great idea for cluster testing. See http://wiki.apache.org/cassandra/CassandraHardware
If read latency is your key concern, then make sure you are familiar with the cache settings in Cassandra (keys_cached and rows_cached) - see http://wiki.apache.org/cassandra/StorageConfiguration, for example.