服务此查询的 cassandra 架构可能是什么?

发布于 2024-10-10 11:44:36 字数 278 浏览 5 评论 0原文

假设一个社交应用程序拥有数百万用户和大约有 200-300 个主题,用户可以发布最多可以标记 5 个主题的帖子。我对此数据有两种查询:

  1. 查找某个用户的帖子,
  2. 查找标记为特定主题的所有最近帖子。

对于第一个查询,我可以使用用户列族中的 superColumns 轻松创建架构(在这个超级列中,我可以将用户所有帖子的 postId 存储为列)。

我的问题是我应该如何设计架构来服务 Cassandra 中的第二个查询?

Assume a social application that has some million users & there are around 200-300 topics, Users can make posts which could be tagged on upto 5 topics. I have 2 kind of queries on this data:

  1. find post by a certain user
  2. find all recent posts tagged on a specific topic.

For 1st query I can easily create the schema using superColumns in the User Columnfamily(in this supercolumn, I can store the postIds of all posts by user as columns).

My question is how should I design the schema to serve 2nd query in Cassandra?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

深陷 2024-10-17 11:44:36

尽管 Justice 的答案可行,但我不喜欢它,因为它需要 OrderPreservingPartitioner 来执行范围扫描。 OPP 有很多与之相关的问题。请参阅我一直在 链接到不断了解详细信息。

相反,我建议这样做:

topic|YYMMDDHH: {TimeUUID: postID, TimeUUID: postID, etc... }

其中“topic|YYMMDDHH”是行键,每列名称是 TimeUUID,列值是 postID。

要获取任何主题的最新帖子,您可以从该主题的最新行的末尾获取切片。如果该行没有足够的列,您会及时转到上一行,等等。

这有一些很好的属性。首先,如果您不关心某个主题的真正旧帖子,只关心相对较新的帖子,则可以定期清除旧行并节省一些空间;这甚至可以通过列 TTL 来完成,这样您就不必做任何额外的工作。其次,您的行的大小将受到限制,因为它们每小时都会分割一次。第三,您不需要 OPP :)

这样做的一个缺点是,如果有一个非常热门的主题,一个节点可能会在一个小时内一次接收到比其他节点更高的流量。

Although Justice's answer would work, I don't like it because it requires an OrderPreservingPartitioner to perform the range scan. OPP has a lot of problems associated with it. See the article that I've been linking to constantly for details.

Instead, I would recommend this:

topic|YYMMDDHH: {TimeUUID: postID, TimeUUID: postID, etc... }

where "topic|YYMMDDHH" is the row key, each column name is a TimeUUID, and the column values are postIDs.

To get the latest posts for any topic, you get a slice off the end of the most recent row for that topic. If that row didn't have enough columns, you go to the previous one in time, etc.

This has a few nice properties. First, if you don't care about really old posts on a topic, only relatively recent ones, you can purge old rows on a regular basis and save yourself some space; this could even be done with column TTLs so that you don't have to do any extra work. Second, your rows will be bounded in size because they are split every hour. Third, you don't need OPP :)

One downside to this is that if there's a really hot topic, one node may receive higher traffic than the others for an hour at a time.

皇甫轩 2024-10-17 11:44:36

对于第二个查询,构建一个二级索引列族,其键为#{topic}:#{unix_timestamp}。行将有一个包含帖子 ID 的列。然后您可以进行范围扫描。

For the second query, build a secondary-index column family whose keys are #{topic}:#{unix_timestamp}. Rows would have a single column with the post ID. You can then do a range scan.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文