Apache Cassandra 中的复合索引

发布于 2024-10-10 06:42:56 字数 1022 浏览 0 评论 0原文

我正在尝试设置一个 cassandra 列族，其中一些列上有二级索引，在读回数据时我需要进行过滤。在我最初的测试中，当我一起使用多个索引时，速度会变慢。以下是我当前的配置方式（通过 cassandra-cli）：

update column family bulkdata with comparator=UTF8Type and column_metadata=[{column_name: test_field, validation_class: UTF8Type}, {column_name: create_date, validation_class: LongType, index_type: KEYS}, {column_name: domain, validation_class: UTF8Type, index_type: KEYS}];

我想获取 create_date > 的所有数据somevalue1 和column_name = somevalue2。我为我的客户使用 pycassa 执行以下操作：

  domain_expr = create_index_expression('domain', 'whatever.com')
  cd_expr = create_index_expression('create_date', 1293650000, GT)
  clause = create_index_clause([domain_expr, cd_expr], count=10000)
  for key, item in col_fam.get_indexed_slices(clause):
    ...

当然，这是 SQL 中的一个常见错误，通常必须根据查询需求创建一个复合索引。不过，我对卡桑德拉很陌生，所以我不知道这样的东西是否需要甚至存在。

我与 cassandra 的交互将包括大量写入、大量读取和更新。我已经设置了索引，认为它们是在这里做的正确的事情，但也许我完全错了。我对任何建立高性能系统的想法感兴趣，无论是否有我的索引设置。

哦，这是在 cassandra 0.7.0-rc3 上

原文

I am trying to set up a cassandra column family with secondary indexes on a few columns I will need to filter by when reading data back out. In my initial testing, when I use multiple indexes together, things slow down. Here is how I have it configured currently (via cassandra-cli):

update column family bulkdata with comparator=UTF8Type and column_metadata=[{column_name: test_field, validation_class: UTF8Type}, {column_name: create_date, validation_class: LongType, index_type: KEYS}, {column_name: domain, validation_class: UTF8Type, index_type: KEYS}];

I want to get all data where create_date > somevalue1 and column_name = somevalue2. Using pycassa for my client I do the following:

  domain_expr = create_index_expression('domain', 'whatever.com')
  cd_expr = create_index_expression('create_date', 1293650000, GT)
  clause = create_index_clause([domain_expr, cd_expr], count=10000)
  for key, item in col_fam.get_indexed_slices(clause):
    ...

This is a common mistake in SQL of course, where one would normally have to create a compound index, based on the query needs. I'm quite new to cassandra though, so I don't know if such a thing is required or even exists.

My interactions with cassandra will include large numbers of writes, and large numbers of reads and updates. I have set up the indexes figuring they were the right thing to do here, but perhaps I am completely wrong. I'd be interested in any ideas for setting up a performant system, with my index setup or without.

oh, and this is on cassandra 0.7.0-rc3

分享到QQ

分享到微博