对多对一关系的聚合数据进行建模的有效方法(例如 stackoverflow 问题上的投票计数)
我很好奇对此进行建模以优化性能的最佳方法是什么...不太关心实时数据完整性
我将继续使用 stackoverflow 示例
Question
id
title
Votes
id
user
question
一个问题有很多票
但是对于许多查询,我们只关心与总票数(例如显示在问题旁边)。
好的关系数据库理论会将两个实体(Q 和 V)创建为单独的关系,需要连接,然后进行求和或计数聚合调用。
另一种可能性是打破正常形式并偶尔将投票的总价值具体化为问题中的属性(例如问题.投票)。 性能是通过读取获得的,但是,取决于您愿意让“投票”数据变得多么陈旧,它需要对该问题记录有更多的权限……反过来又会阻碍性能。
可以使用涉及缓存等的其他技术。 但我只是想知道,就性能而言,最好的解决方案是什么? 假设该网站的流量较高,收到的选票数量远多于问题数量。
也对非关系模型开放。
I'm curious about what be the best way to model this for optimized performance... not as concerned about real time data integrity
I'll continue with the stackoverflow example
Question
id
title
Votes
id
user
question
A question has many votes
For many queries however, we're only concerned with the aggregate number of votes (e.g. to show next to the question).
Good relational db theory would create the two entities (Q and V) as separate relations, requiring a join then a sum or count aggregate call.
Another possibility is to break normal form and occasionally materialize the aggregate value of votes as an attribute in Question (e.g. Question.votes). Performance is gained on reads, however, depending on how stale you are willing to let your "votes" data get, it requires a lot more rights to that Question record... in turn hindering performance.
Other techniques involving caching, etc. can be used. But I'm just wondering, performance wise what's the best solution? Let's say the site is higher traffic and receiving a considerable more amount of votes than questions.
Open to non-relational models as well.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我建议在应用程序的整个生命周期中将投票保留在内存中。
为什么要使用数据库来处理像计数这样简单的事情,因为在某些时候您将加载该项目一次并根据请求询问初始金额是多少。
它还与您如何实现存储库有很大关系,如果您的问题对象延迟加载投票但急切加载投票计数,那么您可以加快该过程,同时不存在将其保留在内存中的问题。 仍然将选票保留在数据库中,只需在应用程序中维护计数即可
I would suggest keeping the vote in memory for the lifetime of the application.
Why hit a db for something as simple as a count, when at some point you will have loaded the item once and asked what the initial amount was on a request basis.
It also has alot to do with how you are implementing repositories, if your question object lazy loads votes but eager loads the count of votes then you can speed up the process while not having an issue about keeping it in memory. Still keep the votes in db, just maintain the count in your application
我在社交网站上到处使用 sql 2005 的索引视图来处理此类事情。 我们的负载绝对是高读/写比率,所以它对我们来说效果很好。
I used indexed views from sql 2005 all over the place for this kind of thing on a social networking site. Our load was definitely a high ratio of reads/writes so it worked well for us.
在这种情况下,联接不太可能太慢,特别是如果您在 Votes 表中的(问题)上有索引。
如果真的太慢,您可以将投票计数缓存在问题表中:
您可以在记录投票时更新投票计数。 例如,从存储过程或直接从应用程序代码。
这些更新很棘手,但既然你不太担心一致性,我想如果投票有时不完全正确也没关系。 要修复任何错误,您可以定期重新生成所有缓存的计数,例如:
如果未找到问题,则聚合 count(v.question) 返回 0,而不是 count(*),后者将返回 1。
如果锁是一个问题,请考虑使用“with (nolock)”或“设置事务隔离级别读未提交”来绕过锁(同样,基于数据完整性的优先级较低。)
作为 nolock 的替代方案,请考虑“读已提交快照”,这适用于数据库读取活动较多,写入活动较少。 您可以通过以下方式打开它:
它适用于 SQL Server 2005 及更高版本。 这就是 Oracle 默认的工作方式,也是 stackoverflow 本身使用的方式。 甚至还有一个关于它的编码恐怖博客条目。
It's unlikely that a join will be too slow in this case, especially if you have an index on (question) in the Votes table.
If it is REALLY too slow, you can cache the vote count in the Question table:
You can update the votecount whenever you record a vote. For example, from a stored procedure or directly from your application code.
Those updates are tricky, but since you're not that worried about consistency, I guess it's ok if the vote is sometimes not exactly right. To fix any errors, you can periodically regenerate all cached counts like:
The aggregate count(v.question) returns 0 if no question was found, as opposed to count(*), which would return 1.
If locks are an issue, consider using "with (nolock)" or "set transaction isolation level read uncommited" to bypass locks (again, based on data integrity being a low priority.)
As an alternative to nolock, consider "read committed snapshot", which is meant for databases with heavy read and less write activity. You can turn it on with:
It is available for SQL Server 2005 and higher. This is how Oracle works by default, and it's what stackoverflow itself uses. There's even a coding horror blog entry about it.