Twitter Streaming API 的 Apache Cassandra 数据架构
我知道 Twissandra 这是一个使用 Cassandra 的 twitter 克隆示例,但我有兴趣看看是否有人共享了 Cassandra模式不是克隆 Twitter 而是用于存储通过 Twitter Streaming API 发送的推文?
I am aware of Twissandra which is an example twitter clone using Cassandra but I was interested to see if anyone has shared a Cassandra schema not to clone Twitter but to use for storing tweets coming through Twitter Streaming API?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这在很大程度上取决于您在摄取数据后想要对数据执行哪种查询 - 我从您之前的问题“转储 Twitter Streaming API 推文...”中看到,您可能只想对其进行大批量处理。
如果是这种情况,您只需要担心负载平衡,确保集群中的每个节点处理 1/n 的写入负载,并包含 1/n 的数据 - 使用随机分区并为每条推文插入一行使用状态 id 作为行键将实现此目的。
但是,如果您想要执行诸如“给我给定用户的所有推文”之类的查询,您将需要一个稍微复杂的模式,因为上面建议的模式将要求您扫描所有数据。您可以在每行插入多条推文,行键是用户 ID,列键是推文 ID,值是推文。然后您可以使用 get_slice 来回答该查询。
一篇很好的(有些相关的)博客文章: http://blog.insidesystems.net/基本时间序列与卡桑德拉
It very much depends what sort of queries you want to do with the data after you have ingested it - I see from your previous question "Dumping Twitter Streaming API tweets..." you probably just want to do big batch processing on it.
If this is the case, you just need to worry about load balancing, making sure each node in the cluster handles 1/n of the write load, and contains 1/n of the data - using the random partition and inserting one row per tweets with the status id as the row key will achieve this.
However, if you want to do queries like "give me all tweets for a given user" you will need a slightly more complicated schema, as the schema suggested above will require you to scan all the data. You could insert multiple tweets per row, the row key being the userid, the column key being the tweet id and the value being the tweet. Then you could use get_slice to answer that query.
A good (somewhat related) blog post: http://blog.insidesystems.net/basic-time-series-with-cassandra