Cassandra/BigTable 数据模型 - 构建索引的最佳方法是什么?
我正在为 PenWag.com 进行从 MySQL 到 Cassandra 的转换。在 Cassandra 中,我存储通过 GUID 关闭的用户,但用户使用他们的电子邮件登录,而不是 GUID(显然)。 GUID 作为用户的密钥对我来说比电子邮件更有意义,原因有两个。从实际角度来看,更改或删除/添加一行及其所有超级列似乎太麻烦了。从理论上来看,它仍然是同一个用户,为什么他们的密钥要改变?
不过,这是我的问题:我正在单独的 ColumnFamily 中构建索引,映射 email->GUID 以支持登录。它是标准类型 CF,其中列名称为 email,值为 GUID。这是标准,而不是超级,以避免为每个映射加载整个 SC。支持“更改电子邮件”很简单,只需删除/添加列即可。但似乎另一种方法是将索引存储为行而不是列,其中行键是电子邮件,列保存 GUID。删除/添加这些行不会很麻烦,因为只有列(GUID)需要管理。
看来这两种方法都有效。各自的优点和缺点是什么?有最佳实践吗?
I'm in the process of spiking a conversion from MySQL to Cassandra for PenWag.com. In Cassandra, I'm storing Users keyed off of a GUID, but users sign in with their email, not the GUID (obviously). GUID as a key for Users makes sense to me more than email for two reasons. From a practical perspective it seems that it's too cumbersome to change or delete/add a row with all of its SuperColumns. From a theoretical standpoint, it's still the same user, why should their key change?
Nevertheless, here's my question: I'm building an index in a separate ColumnFamily, mapping email->GUID to support login. It's a Standard type CF, where the column name is email, and the value is GUID. It's Standard, not Super, to avoid loading an entire SC for every mapping. Supporting "change email" is easy, it's just a column delete/add. But it seems that an alternative to this is to store the index as rows instead of columns, where the row key is email, and a column holds the GUID. Delete/add on those rows would not be cumbersome, since there's only column (the GUID) to manage.
It seems that either approach works. What are the pros and cons of each? Is there a best practice?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
由于我没有使用 Cassandra 或类似数据库的实践经验,因此您需要对我的回答持保留态度:)
如果您将每个映射存储为一列,并使用电子邮件地址作为列名称,则这意味着单行包含大量列。根据维基百科[1]:
如果所有映射都存储在单行中,这可能会导致大量的锁定开销。
Cassandra Wiki 指出[2]:
这让我相信根据行键进行查找比根据列名进行查找更有效。基于此信息,我建议使用电子邮件地址作为行键并将 GUID 存储在列中。
Since I have no hands-on experience with Cassandra or similar databases, you'll need to take my answer with a grain of salt :)
If you'd store each mapping as a column, using the email address as the column name, this would imply a single row containing an enormous amount of columns. According to Wikipedia[1]:
This could result in significant locking overhead if all mappings are stored in a single row.
The Cassandra Wiki states[2]:
This makes me believe that it's more efficient to do lookups based on row key than on column name. Based on this information, I would suggest to use the email address as the row key and store the GUID in the column.
尼尔斯是正确的;每个用户一行是手动执行此操作的正确方法。
我对此有资格,因为在 0.7 中,您可以在行中包含一个电子邮件列,其中包含其余由 UUID 键入的用户数据,并要求 Cassandra 将其编入索引: http://www.riptano.com/blog/whats-new-cassandra-07-secondary-indexes
Niels is correct; one row per user would be the right way to do this manually.
I qualify that because in 0.7 you could just have a an email column in the row with the rest of your keyed-by-UUID user data and ask Cassandra to index it: http://www.riptano.com/blog/whats-new-cassandra-07-secondary-indexes