哪个数据库设计更好?
给定像 StackOverflow 这样的网站,是否最好创建 num_comments 列来存储提交的评论数量,然后在发表评论时更新它,或者只是使用 COUNT 函数查询行数?看起来后者会更具可读性和优雅,但前者会更有效。那么觉得怎么样?
Given a site like StackOverflow, would it be better to create num_comments column to store how many comments a submission has and then update it when a comment is made or just query the number of rows with the COUNT function? It seems like the latter would be more readable and elegant but the former would be more efficient. What does SO think?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
肯定要使用 COUNT。存储评论数量是一种经典的非规范化,会带来麻烦。它的检索效率稍高,但插入的成本更高:每个新评论不仅需要插入评论表,还需要对包含评论计数的行进行写入锁定。
Definitely to use COUNT. Storing the number of comments is a classic de-normalization that produces headaches. It's slightly more efficient for retrieval but makes inserts much more expensive: each new comment requires not only an insert into the comments table, but a write lock on the row containing the comment count.
前者没有标准化,但会产生更好的性能(假设读取次数多于写入次数)。
后者更加规范化,但需要更多资源,因此性能较差。
哪个更好归结为应用程序要求。
The former is not normalized but will produce better performance (assuming many more reads than writes).
The latter is more normalized, but will require more resources and hence be less performant.
Which is better boils down to application requirements.
我建议统计评论记录。尽管另一种方法速度更快,但它可以使数据库更干净。添加计数列将是一种数据重复,更不用说需要额外的代码步骤和插入。
如果您预计有数百万条评论,那么您可能需要选择计数列方法。
I would suggest counting comment records. Although the other method would be faster it lends to a cleaner database. Adding a count column would be a sort of data duplication not to mention require on additional code step and insert.
If you were to expect millions of comments, then you may want to pick the count column approach.
我同意@Oded。这取决于应用程序的要求以及网站的活跃程度,但是这也是我的两分钱,
I agree with @Oded. It depends on the app requirements and also how active is the site, however here is also my two cents
“正确”的设计方法是使用另一个表,将其与
COUNT
连接起来。这与数据库规范化所教导的内容一致。标准化的问题在于它无法扩展。剥猫皮的方法只有这么多,所以如果每天有数百万个查询,其中很多涉及表 X,那么数据库性能就会下降,因为服务器还必须处理并发写入、事务等 。
为了解决这个问题,常见的做法是分片 分片有一个副作用,即表的行不存储在同一物理位置,其主要后果是您无法再
JOIN
;如何JOIN
针对半个表并获得有意义的结果?显然,尝试对表的所有分区进行 JOIN 并合并结果将比疾病更糟糕。因此,您会发现,您所研究的替代方案不仅在实践中用于实现高性能,而且工程师可以并且确实采取了更激进的步骤。
当然,除非您确实遇到性能问题,否则分片甚至反规范化只会让您的生活变得更加困难,而不会带来任何实际好处。
The "correct" way to design is to use another table, join it and
COUNT
. This is consistent with what database normalization teaches.The problem with normalization is that it cannot scale. There are only so many ways to skin a cat, so if you have millions of queries per day and a lot of them involve table X, the database performance is going below ground as the server also has to deal with concurrent writes, transactions, etc.
To deal with this problem, a common practice is sharding. Sharding has the side effect that the rows of a table are not stored in the same physical location, and a primary consequence of this is that you cannot
JOIN
anymore; how can youJOIN
against half a table and receive meaningful results? And obviously, trying toJOIN
against all partitions of a table and merge the results is going to be worse than the disease.So you see that not only the alternative you examine is used in practice to achieve high performance, but also that there are even more radical steps that engineers can and do take.
Of course, unless you do have performance issues, sharding or even de-normalizing is just making your life harder for no tangible benefit.