Cassandra 中的数据存储

发布于 2024-12-29 21:04:11 字数 1148 浏览 3 评论 0原文

我目前正在努力寻找与 Cassandra 一起使用的正确数据格式。我想这是因为它比标准键值存储提供了额外的深度。

我的数据格式当前定义如下：

不同应用程序的键空间。
不同应用程序部分的列族。
在这些列族中我有数据。

大多数数据以以下格式存储在单个列族中：

Key: UUID-1|UUID-2|UUID-3
Value: Array of PHP Values

插入多个 100.000 个条目（每个条目 <1kb）后，我发现读取数据时性能下降。

根据我的理解，列族应该正是存储数据主要部分的位置。将我的大部分数据放在一个列族中而不是几个不同的列族中不应该是重点。

我是否应该考虑将数据拆分为不同的列族，或者该方法是否正确，但其他原因可能是问题的原因？

编辑回答评论中 DNA 的问题：

我正在比较开始测试之前插入的单个密钥所需的读取时间。

测试密钥在开始时数据库仍为空时，在<0.0010s内持续读取>1.000次。测试中写入的数据结构如下：

由 5 个字符 + 20 个数字构建的键标识的行
，其中一列（1 个字符）包含当前的 unix 时间戳

我添加了条目并重新运行相同的读取测试以比较如何阅读次数。我在这里列出的读取时间是较低的数字：

   Entries | Read Time
         0 |   0.0010
   150.000 |   0.0013
   300.000 |   0.0014
   500.000 |   0.0016
   750.000 |   0.0019
 1.000.000 |   0.0022

因为这仅用于基本测试，所以仅在 Amazon 的单个节点（ec2 实例）上运行。每 250.000 个新行，读取时间似乎增加约 0.0003 秒。

我知道这些数字确实很小而且很棒，但是阅读时间的线性增长并不是我所期望的。

我计划将一个包含大量小条目的大型 MySQL 服务器迁移到 Cassandra。它目前包含大约 750 亿个条目，并且它正在收集的新数据集数量非常快，因此读取时间的线性增加让我怀疑我是否正在走向正确的方向。

原文

I am currently struggling with the correct data format to use with Cassandra. I guess this is because of the additional depth it offers over standard key-value storages.

My data format is currently defined like this:

Keyspaces for different Applications.
Column Families for different Application parts.
In these Column Families I have the data.

Most of the data is stored within a single Column Family in the format:

Key: UUID-1|UUID-2|UUID-3
Value: Array of PHP Values

After inserting several 100.000 entries (<1kb each) I see a performance degradation when reading data.

From my understanding the Column Families should be exactly where to store the main part of my data. Having most of my data in a single Column Family instead of several different ones should not be the point.

Should I look into splitting my data into different Column Families or is the approach correct but something else likely to be the reason for the problem?

Edit to answer DNA's questions in the comment:

I am comparing the read time needed for a single key I have inserted before starting my tests.

The test key consistently read within <0.0010s for >1.000 times in the beginning while the database is still empty. The data written in the tests is structured like this:

A row identified by a Key built with 5 chars + 20 numbers
with one Column (1 Character) containing the current unix timestamp

I added entries and re-ran the same read test to compare how the read times. The read times I am listing here are the lower numbers:

   Entries | Read Time
         0 |   0.0010
   150.000 |   0.0013
   300.000 |   0.0014
   500.000 |   0.0016
   750.000 |   0.0019
 1.000.000 |   0.0022

Because this is only for basic testing this is only run on a single node (ec2 instance) at Amazon. The read time seems to increase by about 0.0003s for every 250.000 new rows.

I know that these are really small numbers and they are great, but the linear growing of the read time is not what I expected.

I am planning to move a big MySQL Server with a huge amount of small entries to Cassandra. It currently contains about 75 billion entries and the amount of new datasets it is collecting is really fast, a linear increase for read time is therefore making me wonder if am going into the right direction.

分享到QQ

分享到微博