是否不鼓励将同一条数据保存在关系数据库的两个地方？

发布于 2024-10-04 21:21:54 字数 139 浏览 3 评论 0原文

我正在编写一个基于 Django/Python 的应用程序，我想避免编写代码或向数据库发出不必要的请求。将已存储在较高级别的数据存储在较低级别以便更容易从较低级别访问是否不合理？

希望这很清楚 - 如果您需要澄清，请告诉我。

谢谢！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

赠佳期 2024-10-11 21:21:54

如果所需的计算极其冗长或复杂，那么对数据库进行部分非规范化可能会很有用，但只有在其他优化尝试几乎没有成功或没有成功之后才应该这样做，而且只能谨慎进行。

回复收藏 0 原文

对你的占有欲 2024-10-11 21:21:54

除了非规范化之外，还有其他方法可以提高查询性能。当然，第一步是实际测量代码的性能。仅仅因为连接查询不太容易扩展并不意味着使用它们是一个坏主意。并非每个网站的请求负载都需要超优化的请求处理。

事实上，Cal Henderson 计算出，除了 100 个网站之外，所有网站都不在访问量最大的 100 个网站。

如果事实证明您经过深思熟虑且非常易于维护的查询实际上是瓶颈（即使您不一定位于前 100 名，也许您位于前 1,000,000 名），则第一个步骤是仔细查看成为瓶颈的实际查询。有时，在正确的位置添加或删除索引，或者只是保持数据库整洁（VACUUM 或 ANALYZE 命令），就可以产生不同的效果。

下一步是寻找缓存机会。很大比例的典型站点对每个页面和用户一遍又一遍地执行相同的查询，并且它们的变化不会很快。将经常使用的数据存储在缓存（例如内存缓存）中将减轻数据库服务器的一些负担。

当我们讨论数据库服务器时，请确保托管数据库的计算机可以自由地响应查询，而不是执行任何其他工作。数据库是最难扩展的技术之一，因此为了充分利用它，请确保您的数据库运行在具有大量内存、快速磁盘且没有其他责任的计算机上。将Web应用程序服务器放在自己的机器上，将静态文件放在另一台机器上，让防火墙驻留在单独的机器上。查询缓存可以运行在它自己的机器上。

仅仅因为数据库很难扩展（尽管很容易扩展），但这并非不可能。如果两个表很少在同一查询中使用，则可以将它们拆分到两个设备上。如果某组表的连接条件始终相同（例如 UserID），则可以将行拆分到多个服务器上（这称为分片）。

一旦您真正用尽了所有这些选项，下一步可能就是非规范化。这看起来很简单，但有几个原因导致您采取的步骤如此之晚。

最大的问题是因为实际上很难保持重复数据的正确性。如果没有数据重复，则不会发生冲突，但如果数据确实重复，并且应用程序的一部分没有仔细更新这两个位置，则不同的查询将返回不同的冲突结果。对数据进行非规范化时的良好选择可以使不同步情况变得轻微（只需从权威来源重建非规范化数据），但它总是会增加您必须执行的代码维护工作。

第二个是因为非规范化不一定会加速。就像添加索引一样，它以写入性能换取读取性能。如果您确实进行了大量写入，则实际上会使平均性能变慢。与添加索引不同，数据库无法告诉您何时需要非规范化；这始终是一个猜测。

最后，当您决定采用非规范化时，请注意您的 ORM（如果您使用 ORM，因为这是您可能会做的 django 问题）真正在做什么。确保您设计的查询只提供您实际使用的数据，并确保您的 ORM 不会帮助提取更多数据。

Besides denormalization, there are other ways to improve query performance. The first step, of course, is to actually measure your code's performance. Just because a join query doesn't scale up very easily doesn't mean it's a bad idea to use them. Not every website has the request-load to require ultra-optimized request handling.

In fact, Cal Henderson has calculated that all but 100 websites are not in the top 100 most visited websites.

If it turns out your well thought out and very mantainable queries are actually a bottleneck (even if you're not necessarily in the top 100, maybe you are in the top 1,000,000), the first step is to take a really good look at the actual queries that are bottlenecks. Sometimes adding or removing indexes in the right place, or just keeping the database tidy (VACUUM or ANALYZE commands) on a scehdule can make the difference.

A next step is to look for caching opportunities. A significant percentage of typical sites perform the same queries over and over for every page and user, and they don't change very quickly. Stashing oft-used data in a cache (memcache, for instance) will take some burden off of the database server.

While we're on the subject of database servers, make sure that the machine hosting the database is free to respond to queries, instead of performing any other work. Databases are among the hardest technology to scale out, so to get the most out of it, make sure your database is running on a machine with loads of ram, fast disks and no other responsibilities. Put the web-app server on its own machine, put the static files on another machine, let the firewall live on a separate machine. The query cache can go on its own machine.

Just because databases are hard to scale out (though easy to scale up), it is not impossible. If two tables are rarely used in the same query, they can be split across two devices. If the join conditions of a certain set of tables are always the same (like UserID's), then rows can be split across multiple servers (this is called sharding).

Once you've really exhausted all of these options, denormalization is probably the next step. It seems simple, but there's a few reasons why this is so late in the steps you want to take.

The biggest is because it's actually hard to keep the duplicated data right. If no data is duplicated, then there can't be a conflict, but if data does get duplicated and one portion of the app doesn't carefully update both places, then different queries will return different, conflicting results. good choices when denromalizing your data can make out of sync conditions minor (just rebuild the denormalized data from the authoritative source), but it always increases the code maintenance work you will have to do.

The second is because denormalization isn't neccesarily a speed-up. Much like adding indexes, it trades write performance for read performance. If you actually do a decent number of writes, it will actually make the average performance slower. Unlike adding indexes, databases can't tell you when you need denormalization; It is always a guess.

Finally, when you do decide to go with denormalization, be aware of what your ORM (if you use one, since this is a django question you probably do) is really doing. Make sure you are designing your queries to give you only the data you actually use, and make sure your ORM isn't helpfully pulling more than that.

回复收藏 0 原文