在数据库中，什么时候应该存储派生数据？

发布于 2024-10-01 13:37:53 字数 1115 浏览 3 评论 0原文

我的问题是关于非规范化。在数据库中，什么时候应该将派生数据存储在自己的列中，而不是每次需要时都进行计算？

例如，假设您有一些用户因其问题获得了赞成票。您可以在用户的个人资料中显示他们的声誉。当用户被投票时，您应该增加他们的声誉，还是应该在检索他们的个人资料时计算它：

SELECT User.id, COUNT(*) AS reputation FROM User
LEFT JOIN Question
  ON Question.User_id = User.id
LEFT JOIN Upvote
  ON Upvote.Question_id = Question.id
GROUP BY User.id

在值得通过其增量跟踪用户声誉之前，获取用户声誉的查询的处理器密集程度必须达到多少自己的专栏？

继续我们的示例，假设一个赞成票的权重取决于投票的用户拥有多少赞成票（而不是多少声誉）。检索其声誉的查询突然爆炸：

SELECT
  User.id AS User_id,
  SUM(UpvoteWeight.weight) AS reputation
FROM User
LEFT JOIN Question
  ON User.id = Question.User_id
LEFT JOIN (
  SELECT
    Upvote.Question_id,
    COUNT(Upvote2.id)+1 AS weight
  FROM Upvote
  LEFT JOIN User
    ON Upvote.User_id = User.id
  LEFT JOIN Question
    ON User.id = Question.User_id
  LEFT JOIN Upvote AS Upvote2
    ON
      Question.id = Upvote2.Question_id
      AND Upvote2.date < Upvote.date
  GROUP BY Upvote.id
) AS UpvoteWeight ON Question.id = UpvoteWeight.Question_id
GROUP BY User.id

这与增量解决方案的难度远远不成比例。什么时候规范化是值得的，规范化的好处什么时候会输给非规范化的好处（在这种情况下是查询难度和/或性能）？

原文

My question is about denormalization. In a database, when should you store derived data in its own column, rather than calculating it every time you need it?

For example, say you have Users who get Upvotes for their Questions. You display a User's reputation on their profile. When a User is Upvoted, should you increment their reputation, or should you calculate it when you retrieve their profile:

SELECT User.id, COUNT(*) AS reputation FROM User
LEFT JOIN Question
  ON Question.User_id = User.id
LEFT JOIN Upvote
  ON Upvote.Question_id = Question.id
GROUP BY User.id

How processor intensive does the query to get a User's reputation have to be before it would be worthwhile to keep track of it incrementally with its own column?

To continue our example, suppose an Upvote has a weight that depends on how many Upvotes (not how much reputation) the User who cast it has. The query to retrieve their reputation suddenly explodes:

SELECT
  User.id AS User_id,
  SUM(UpvoteWeight.weight) AS reputation
FROM User
LEFT JOIN Question
  ON User.id = Question.User_id
LEFT JOIN (
  SELECT
    Upvote.Question_id,
    COUNT(Upvote2.id)+1 AS weight
  FROM Upvote
  LEFT JOIN User
    ON Upvote.User_id = User.id
  LEFT JOIN Question
    ON User.id = Question.User_id
  LEFT JOIN Upvote AS Upvote2
    ON
      Question.id = Upvote2.Question_id
      AND Upvote2.date < Upvote.date
  GROUP BY Upvote.id
) AS UpvoteWeight ON Question.id = UpvoteWeight.Question_id
GROUP BY User.id

This is far out of proportion with the difficulty of an incremental solution. When would normalization be worth it, and when do the benefits of normalization lose to the benefits of denormalization (in this case query difficulty and/or performance)?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

你的呼吸 2024-10-08 13:37:53

为了获得用户的声誉，查询的处理器密集程度必须达到多少，才值得使用自己的列增量跟踪用户？

这里确实有两个问题：（1）这种改变会提高性能吗？（2）性能改进值得付出努力吗？

至于性能是否提升，这基本上是一个标准的利弊分析。

规范化的好处基本上有两个方面：

更容易实现数据完整性
重新计算不会出现问题（例如，如果基础数据发生变化，则需要重新计算派生列） .

如果您使用可靠实施的解决方案（例如触发器、仅存储过程的数据更改以及撤销的直接表更改权限等）来覆盖数据完整性，那么这将成为验证源是否有效的成本的简单计算数据更改需要重新计算派生数据，而不是每次都重新计算派生数据。（注意：保持数据完整性的另一种方法是强制按计划重新计算派生数据，其中数据可以在一定的时间容限内不准确。StackExchange 在某些数字上采用了这种方法）。

在典型情况下（更多的数据检索和对基础数据的更改少得多），数学显然倾向于在表中保留非规范化的派生数据。

在某些极少数情况下，基础数据更改非常频繁，但派生数据却没有那么频繁地检索，这样做可能是有害的。

现在，我们讨论一个更重要的问题：性能改进值得付出努力吗？

请注意，与所有优化一样，最大的问题是“优化是否值得？” ”，因此需要考虑两个主要因素：

测量准确的性能差异和总体分析。
系统全局中此特定优化的上下文。

例如，如果查询性能的差异 - 在优化时必须首先测量 - 缓存的派生数据和计算数据之间的差异为 2%，那么实现信誉缓存列的额外系统复杂性可能不值得首先。但是，就边际改进而言，关心与不关心的阈值取决于应用程序的整体情况。如果您可以采取措施在不同的地方将查询性能提高 10%，那么请集中精力，而不是 2%。如果您是 Google，并且额外 2% 的查询性能需要花费 20 亿美元的额外硬件来承受，那么无论如何都需要对其进行优化。

How processor intensive does the query to get a User's reputation have to be before it would be worthwhile to keep track of it incrementally with its own column?

There really are two questions here in guise of one: (1) Will this change improve the performace and (2) Will the performance improvement be worth the effort?

As far as whether the performance improvement, this is basically a standard pros/cons analysis.

The benefits of normalization are basically two-fold:

Easier data integrity
No issues with re-calculation (e.g. if the underlying data changes, the derived column needs to be re-calculated).

If you cover the data integrity with a robustly implemented solution (e.g. trigger, Sstored-proc-only data changes with revoked direct table change perms, etc...), then this becomes a straightforward calculation of whether the cost of verifying whether the source data change warrants the derived data re-calculation vs. recalculating the derived data every time. (NOTE: Another approach to keeping data integrity is to force the recalculation of derived data on schedule, where that data can afford to be inaccurate with some time tolerance. StackExchange takes this approach with some of its numbers).

In a typical scenario (many more retrieval of data and far less changes to the underlying data) the math pretty obviously skews in favor of keeping de-normalized derived data in the table.

In some rare cases where the underlying data changes VERY often yet the derived data is not retrieved that often, doing that might be detrimental.

Now, we are onto the far more important question: Will the performance improvement be worth the effort?

Please note that, as with ALL optimizations, the biggest question is "is the optimization even worth it at all?", and as such is the subject to two main considerations:

Measuring exact performance difference and generally profiling.
Context of this specific optimization in the big picture of your system.

E.g. if the difference in query performace - which as always when optimizing must first be measured - is 2% between cached derived data and computed one, the extra system complexity in implementing the reputation cache column may not be worth it in the first place. But what the threshold of caring vs. not caring is as far as marginal improvement depends on the big picture of your app. If you can take steps to improve query performance 10% in a different place, concentrate on that vs. 2%. If you're Google and extra 2% of query performance carries cost of 2 billion dollars in extra hardware to bear it, it needs to be optimized anyway.

回复收藏 0 原文