在数据库中,什么时候应该存储派生数据?
我的问题是关于非规范化。在数据库中,什么时候应该将派生数据存储在自己的列中,而不是每次需要时都进行计算?
例如,假设您有一些用户因其问题获得了赞成票。您可以在用户的个人资料中显示他们的声誉。当用户被投票时,您应该增加他们的声誉,还是应该在检索他们的个人资料时计算它:
SELECT User.id, COUNT(*) AS reputation FROM User
LEFT JOIN Question
ON Question.User_id = User.id
LEFT JOIN Upvote
ON Upvote.Question_id = Question.id
GROUP BY User.id
在值得通过其增量跟踪用户声誉之前,获取用户声誉的查询的处理器密集程度必须达到多少自己的专栏?
继续我们的示例,假设一个赞成票的权重取决于投票的用户拥有多少赞成票(而不是多少声誉)。检索其声誉的查询突然爆炸:
SELECT
User.id AS User_id,
SUM(UpvoteWeight.weight) AS reputation
FROM User
LEFT JOIN Question
ON User.id = Question.User_id
LEFT JOIN (
SELECT
Upvote.Question_id,
COUNT(Upvote2.id)+1 AS weight
FROM Upvote
LEFT JOIN User
ON Upvote.User_id = User.id
LEFT JOIN Question
ON User.id = Question.User_id
LEFT JOIN Upvote AS Upvote2
ON
Question.id = Upvote2.Question_id
AND Upvote2.date < Upvote.date
GROUP BY Upvote.id
) AS UpvoteWeight ON Question.id = UpvoteWeight.Question_id
GROUP BY User.id
这与增量解决方案的难度远远不成比例。什么时候规范化是值得的,规范化的好处什么时候会输给非规范化的好处(在这种情况下是查询难度和/或性能)?
My question is about denormalization. In a database, when should you store derived data in its own column, rather than calculating it every time you need it?
For example, say you have Users who get Upvotes for their Questions. You display a User's reputation on their profile. When a User is Upvoted, should you increment their reputation, or should you calculate it when you retrieve their profile:
SELECT User.id, COUNT(*) AS reputation FROM User
LEFT JOIN Question
ON Question.User_id = User.id
LEFT JOIN Upvote
ON Upvote.Question_id = Question.id
GROUP BY User.id
How processor intensive does the query to get a User's reputation have to be before it would be worthwhile to keep track of it incrementally with its own column?
To continue our example, suppose an Upvote has a weight that depends on how many Upvotes (not how much reputation) the User who cast it has. The query to retrieve their reputation suddenly explodes:
SELECT
User.id AS User_id,
SUM(UpvoteWeight.weight) AS reputation
FROM User
LEFT JOIN Question
ON User.id = Question.User_id
LEFT JOIN (
SELECT
Upvote.Question_id,
COUNT(Upvote2.id)+1 AS weight
FROM Upvote
LEFT JOIN User
ON Upvote.User_id = User.id
LEFT JOIN Question
ON User.id = Question.User_id
LEFT JOIN Upvote AS Upvote2
ON
Question.id = Upvote2.Question_id
AND Upvote2.date < Upvote.date
GROUP BY Upvote.id
) AS UpvoteWeight ON Question.id = UpvoteWeight.Question_id
GROUP BY User.id
This is far out of proportion with the difficulty of an incremental solution. When would normalization be worth it, and when do the benefits of normalization lose to the benefits of denormalization (in this case query difficulty and/or performance)?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这里确实有两个问题:(1)这种改变会提高性能吗?(2)性能改进值得付出努力吗?
至于性能是否提升,这基本上是一个标准的利弊分析。
规范化的好处基本上有两个方面:
更容易实现数据完整性
重新计算不会出现问题(例如,如果基础数据发生变化,则需要重新计算派生列) .
如果您使用可靠实施的解决方案(例如触发器、仅存储过程的数据更改以及撤销的直接表更改权限等)来覆盖数据完整性,那么这将成为验证源是否有效的成本的简单计算数据更改需要重新计算派生数据,而不是每次都重新计算派生数据。 (注意:保持数据完整性的另一种方法是强制按计划重新计算派生数据,其中数据可以在一定的时间容限内不准确。StackExchange 在某些数字上采用了这种方法)。
在典型情况下(更多的数据检索和对基础数据的更改少得多),数学显然倾向于在表中保留非规范化的派生数据。
在某些极少数情况下,基础数据更改非常频繁,但派生数据却没有那么频繁地检索,这样做可能是有害的。
现在,我们讨论一个更重要的问题:性能改进值得付出努力吗?
请注意,与所有优化一样,最大的问题是“优化是否值得?” ”,因此需要考虑两个主要因素:
测量准确的性能差异和总体分析。
系统全局中此特定优化的上下文。
例如,如果查询性能的差异 - 在优化时必须首先测量 - 缓存的派生数据和计算数据之间的差异为 2%,那么实现信誉缓存列的额外系统复杂性可能不值得首先。但是,就边际改进而言,关心与不关心的阈值取决于应用程序的整体情况。如果您可以采取措施在不同的地方将查询性能提高 10%,那么请集中精力,而不是 2%。如果您是 Google,并且额外 2% 的查询性能需要花费 20 亿美元的额外硬件来承受,那么无论如何都需要对其进行优化。
There really are two questions here in guise of one: (1) Will this change improve the performace and (2) Will the performance improvement be worth the effort?
As far as whether the performance improvement, this is basically a standard pros/cons analysis.
The benefits of normalization are basically two-fold:
Easier data integrity
No issues with re-calculation (e.g. if the underlying data changes, the derived column needs to be re-calculated).
If you cover the data integrity with a robustly implemented solution (e.g. trigger, Sstored-proc-only data changes with revoked direct table change perms, etc...), then this becomes a straightforward calculation of whether the cost of verifying whether the source data change warrants the derived data re-calculation vs. recalculating the derived data every time. (NOTE: Another approach to keeping data integrity is to force the recalculation of derived data on schedule, where that data can afford to be inaccurate with some time tolerance. StackExchange takes this approach with some of its numbers).
In a typical scenario (many more retrieval of data and far less changes to the underlying data) the math pretty obviously skews in favor of keeping de-normalized derived data in the table.
In some rare cases where the underlying data changes VERY often yet the derived data is not retrieved that often, doing that might be detrimental.
Now, we are onto the far more important question: Will the performance improvement be worth the effort?
Please note that, as with ALL optimizations, the biggest question is "is the optimization even worth it at all?", and as such is the subject to two main considerations:
Measuring exact performance difference and generally profiling.
Context of this specific optimization in the big picture of your system.
E.g. if the difference in query performace - which as always when optimizing must first be measured - is 2% between cached derived data and computed one, the extra system complexity in implementing the reputation cache column may not be worth it in the first place. But what the threshold of caring vs. not caring is as far as marginal improvement depends on the big picture of your app. If you can take steps to improve query performance 10% in a different place, concentrate on that vs. 2%. If you're Google and extra 2% of query performance carries cost of 2 billion dollars in extra hardware to bear it, it needs to be optimized anyway.
实际上没有明确的答案,因为这取决于很多因素,例如网站的数量以及您显示声誉的频率(即仅在他们的个人资料页面上或在他们的用户名的每个实例旁边,在任何地方)。唯一真正的答案是“当它变得太慢时”;换句话说,您可能需要测试这两种场景并获得一些现实世界的性能统计数据。
就我个人而言,我会在这种特殊情况下进行非规范化,并在赞成表上插入触发器或定期更新查询来更新非规范化信誉列。如果有人的代表在页面刷新之前说“204”而不是“205”,这真的会是世界末日吗?
There is really no clear-cut answer because it depends a lot of factors like the volume of the site and how often you display the reputation (i.e. only on their profile page or next to EVERY instance of their user name, everywhere). The only real answer is "when it gets too slow"; in other words, you would probably need to test both scenarios and get some real-world perfromance stats.
Personally I'd denormalize in this particular situation and have either an insert trigger on the upvote table or a periodic update query that updates the denromalized reputation column. Would it really be the end of the world is someone's rep said "204" instead of "205" until the page refreshes?
我只是想从另一个角度来探讨 DVK 在上述回复中很好地阐述的数据完整性问题。考虑其他系统是否可能需要访问/计算派生数据——即使是像报告系统这样简单的数据。如果其他系统需要使用派生值或更新点赞值,那么您可能需要额外考虑如何重用计算代码或如何确保派生值一致更新,无论哪个系统更改点赞值。
I just wanted to throw in another angle on the data integrity concern that DVK covered so well in the response above. Think about whether other systems may need to access/calculate the derived data -- even something as simple as a reporting system. If other systems need to use the derived value or update the upvote value then you may have additional considerations around how to reuse the calculation code or how to ensure that the derived value is consistently updated no matter what system changes the upvote.