我应该如何保留汇总多个表的准确记录?
我有一个规范化的数据库,需要经常生成基于 Web 的报告,其中涉及跨多个表的联接。 这些查询花费的时间太长,因此我想保留计算结果,以便可以快速加载页面。 我总结的表格经常更新,我需要摘要来反映迄今为止的所有更新。
所有表都有自动增量主整数键,我几乎总是添加新行,并且可以安排在它们更改时清除计算结果。
我遇到了一个类似的问题,我需要通过安排迭代表中的每一行来总结单个表,并跟踪迭代器状态和所看到的最高主敏锐度(即“高水位”)。 对于单个表来说这很好,但对于多个表我最终会为每个表保留一个高水位值,这感觉很复杂。 或者,我可以非规范化为一个表(具有相当广泛的应用程序更改),这感觉是一种倒退,并且可能会将我的数据库大小从大约 5GB 更改为大约 20GB。
(我现在使用的是 sqlite3,但 MySQL 也是一个选项)。
I have a normalized database and need to produce web based reports frequently that involve joins across multiple tables. These queries are taking too long, so I'd like to keep the results computed so that I can load pages quickly. There are frequent updates to the tables I am summarising, and I need the summary to reflect all update so far.
All tables have autoincrement primary integer keys, and I almost always add new rows and can arrange to clear the computed results in they change.
I approached a similar problem where I needed a summary of a single table by arranging to iterate over each row in the table, and keep track of the iterator state and the highest primary keen (i.e. "highwater") seen. That's fine for a single table, but for multiple tables I'd end up keeping one highwater value per table, and that feels complicated. Alternatively I could denormalise down to one table (with fairly extensive application changes), which feels a step backwards and would probably change my database size from about 5GB to about 20GB.
(I'm using sqlite3 at the moment, but MySQL is also an option).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
我看到两种方法:
将数据移动到单独的数据库中,进行非规范化,进行一些预先计算,以优化它以实现快速访问和报告(听起来像一个小型数据仓库)。 这意味着您必须考虑一些将数据从源复制和转换到目标的作业(脚本、单独的应用程序等)。 根据您希望完成复制的方式(完整/增量)、复制的频率以及数据模型的复杂性(源和目标),可能需要一段时间来实施并优化该过程。 它的优点是使您的源数据库保持不变。
您保留当前数据库,但对其进行非规范化。 正如您所说,这可能意味着应用程序逻辑的改变(但您可能会找到一种方法来最小化对使用数据库的逻辑的影响,您比我更了解情况:))。
I see two approaches:
You move the data in a separate database, denormalized, putting some precalculation, to optimize it for quick access and reporting (sounds like a small datawarehouse). This implies you have to think of some jobs (scripts, separate application, etc.) that copies and transforms the data from the source to the destination. Depending on the way you want the copying to be done (full/incremental), the frequency of copying and the complexity of data model (both source and destination), it might take a while to implement and then to optimizie the process. It has the advantage that leaves your source database untouched.
You keep the current database, but you denormalize it. As you said, this might imply changing in the logic of the application (but you might find a way to minimize the impact on the logic using the database, you know the situation better than me :) ).
报表可以增量刷新,还是完全重新计算来返工报表? 如果必须完全重新计算,那么您基本上只想缓存结果集,直到需要下一次刷新为止。 您可以创建一些表来包含报告输出(以及元数据表来定义可用的报告输出版本),但大多数时候这有点过头了,您最好将查询结果保存到文件或其他缓存存储中。
如果是增量刷新,那么您无论如何都需要 PK 范围,因此您需要类似高水位线数据的数据(除非您可能想要存储最小/最大对)。
Can the reports be refreshed incrementally, or is it a full recalculation to rework the report? If it has to be a full recalculation then you basically just want to cache the result set until the next refresh is required. You can create some tables to contain the report output (and metadata table to define what report output versions are available), but most of the time this is overkill and you are better off just saving the query results off to a file or other cache store.
If it is an incremental refresh then you need the PK ranges to work with anyhow, so you would want something like your high water mark data (except you may want to store min/max pairs).
您可以创建触发器。
一旦计算值之一发生更改,您可以执行以下操作之一:
示例:
有关 SQLite 触发器的更多信息:http://www.sqlite.org/lang_createtrigger.html
You can create triggers.
As soon as one of the calculated values changes, you can do one of the following:
Example:
More Information on SQLite triggers: http://www.sqlite.org/lang_createtrigger.html
最后,我安排了一个程序实例来进行所有数据库更新,并在其堆中维护摘要,即根本不在数据库中。 在这种情况下,这非常有效,但如果我有多个程序进行数据库更新,那就不合适了。
In the end I arranged for a single program instance to make all database updates, and maintain the summaries in its heap, i.e. not in the database at all. This works very nicely in this case but would be inappropriate if I had multiple programs doing database updates.
您还没有提及您的索引策略。 我会首先查看这一点 - 确保您的索引已覆盖。
那么我觉得讨论的触发选项也是一个非常好的策略。
另一种可能性是使用适合高性能报告的模型(例如 Kimball 模型)定期填充数据仓库。
You haven't said anything about your indexing strategy. I would look at that first - making sure that your indexes are covering.
Then I think the trigger option discussed is also a very good strategy.
Another possibility is the regular population of a data warehouse with a model suitable for high performance reporting (for instance, the Kimball model).