数据存储区设计 - 如何模拟高效连接
我有一个关于 Google 数据库 Cloud Datastore 的设计问题。让我用一个例子来解释一下:
我有“Article”类型的实体,具有以下属性:
- title
- userId
- ....
- sumOfScore
SumOfScore 应该是所有相关“分数”实体的总和,其中有 属性如:
- articleId
- userId
- score
在伪 SQL 中:
sumOfScore = select sum(score) from Score where Score.articleId = Article.id
我看到设计这个的两种可能性(使用 Google 的数据存储 API):
1.) 文章没有属性 sumOfScore;但查询总是:
这意味着:每次阅读一篇文章时,我都需要对这篇特定文章进行查询以计算 sumOfScore。 想象一下向用户显示的包含 100 篇文章的列表。这将需要对数据库进行额外的 100 次查询,只是为了显示每篇文章的分数。
尽管如此:在使用关系数据库时,这将是我的首选方式。无冗余且标准化良好。 通过 SQL,您只需使用一个连接选择即可捕获所有数据。 但这对于 Cloud Datastore 来说感觉不太合适。
2.) 每当分数实体发生更改时,计算 sumOfScore:
这意味着:每当添加、删除或更改分数实体时,相关的文章 更新 sumOfScore 属性。
优点:阅读文章时不需要额外的查询。 sumOfScore 对于实体本身来说是多余的。
缺点:每次更改分数,都会增加一次查询和一次写入(更新 Article 实体)。而且 sumOfScore 可能与实际的 Score 实体不匹配(例如,通过 DB-Console 更改值)
更有经验的人是怎么想的?对于这种情况是否有通用的最佳实践? JPA 或 JDO 实现的幕后工作是什么?
非常感谢
莫斯
I've got a design question regarding Google's database Cloud Datastore. Let me explain it by using an example:
I've got Entities of the kind "Article" with the following properties:
- title
- userId
- ....
- sumOfScore
SumOfScore should be the sum of all related "Score" entities, which have
properties like:
- articleId
- userId
- score
In Pseudo-SQL:
sumOfScore = select sum(score) from Score where score.articleId = article.id
I see two possibilities to design this (using Google' datastore API):
1.) No property sumOfScore for Articles; but query always:
This means: Every time an article is read, I need to do an query for this specific article for calculating the sumOfScore.
Imagine a list of 100 Articles that is shown to a user. This would need additional 100 queries to the database, just to show the score for each article.
Nevertheless: This would be my preferred way when using a Relational-DB. No redundancy and good normalization.
And with SQL you can use just one join-select to catch all data.
But it doesn't feel right for Cloud Datastore.
2.) Calculate the sumOfScore whenever Score entities are changed:
This means: Whenever a Score-Entity is added, removed or changed, the related Article
updates the sumOfScore property.
Advantage: When reading articles no additional queries are needed. The sumOfScore is redundant on the entity itself.
Disadvantage: Every time a score is changed, there is one additional query and an additional write (updating an Article entity). And sumOfScore may mismatch with the actual Score entities (e.g. value is changed via DB-Console)
What are more experienced people think? Is there a common best practice for such scenario?
What are doing the JPA or JDO implementation under the hood?
Thanks a lot
Mos
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我建议您首先查看有关分片计数器的 GAE 文章。
这是一篇来自 GAE 最佳实践的文章,涉及如何处理计数器/总和。这可能有点棘手,因为每次更新元素时,您都必须使用逻辑随机选择分片计数器;当您检索计数时,您实际上是在获取一组实体并对它们求和。我已经走了这条路,但不会在这里提供有关我如何做到这一点的代码,因为我还没有对其进行战斗测试。但是,如果您只是将示例分片代码复制/粘贴到各处,您的代码可能会很快变得草率,因此,如果您决定走这条路,请创建一个抽象或类型化的计数器类来重用您的分片逻辑。
另一种选择是使用模糊计数。此方法使用内存缓存并以牺牲准确性为代价提供更好的性能。
请参阅此处标记为“瞬态且频繁更新的数据”的部分
以及最后一个选择;就是只使用SQL。 它是实验性的并且刚出炉(与在 GAE 上使用有关)但这可能值得研究一下。
The first thing I recommend you look into the GAE article about sharding counters.
That is an article from the GAE best practices relating to how you should be handling counters/sums. It can be a little tricky because every time you update an element you have to use logic to randomly pick a sharded counter; and when you retrieve your count you're actually fetching a group of entities and summing them. I've gone this route but won't provide code here on how I did it because I haven't battle tested it yet. But your code can get sloppy in a hurry if you just copy/paste the sample sharding code all over the place, so make an abstract or typed counter class to reuse your sharding logic if you decide to go this route.
Another alternative would be to use a fuzzy count. This method uses memcache and offers better performance at the cost of accuracy.
See the section here labeled "Transient and frequently updated data"
And the last alternative; is to just use SQL. Its experimental and hot out of the oven (in relation to being used on GAE) but it might be worth looking into.
还有第三种可能性,这种可能性并不妥协。
您将 Score 作为 Article 的子级,并将 sumOfScore 保留在 Article 中。出于排序目的,该字段会派上用场。由于这两个类来自同一实体组,因此您可以在事务中创建分数并更新文章。您甚至可以通过查询父级是给定文章的所有分数来进行仔细检查。
这种方法的问题是每秒只能更新实体 5 次。如果您认为您的活动会比这多得多(请记住,这只是对单个实体的限制,而不是对整个表的限制),您应该查看 分片计数器教程 或参阅google io 的视频解释了这一点..
编辑:
这是关于同一主题的精彩讨论:Google 汇问如何避免争用?
Theres third possibility which doesn't make a compromise.
You make Score a child of Article, and keep the sumOfScore in Article. For sorting purposes, this field will come in handy. As this two classes are from the same entity group, you can create a Score and update the Article in a transaction. You could even double check by querying all the Score who's parent is a given Article.
The problem with this approach, is that you can only update an entity 5 times per second. If you think you'll have much more activity than that (remember, it's just a limitation on a single entity not the entier table), you should check out sharded counter tutorial or see the google io's video explaining this..
edit:
Heres a great discussion about this same topic: How does Google Moderator avoid contention?