最有效的查询是什么?
我有一个名为 Projects 的表,它具有以下关系:
有许多贡献 有许多付款
在我的结果集中,我需要以下聚合值:
- 唯一贡献者数量(贡献表中的 DonorID)
- 贡献总额(贡献表中的金额总和)
- 支付总额(付款表中的 PaymentAmount 总和)
因为有许多聚合函数和多个联接,使用标准聚合函数和 GROUP BY 子句会变得混乱。我还需要能够对这些字段进行排序和过滤。所以我提出了两个选项:
使用子查询:
SELECT Project.ID AS PROJECT_ID,
(SELECT SUM(PaymentAmount) FROM Payment WHERE ProjectID = PROJECT_ID) AS TotalPaidBack,
(SELECT COUNT(DISTINCT DonorID) FROM Contribution WHERE RecipientID = PROJECT_ID) AS ContributorCount,
(SELECT SUM(Amount) FROM Contribution WHERE RecipientID = PROJECT_ID) AS TotalReceived
FROM Project;
使用临时表:
DROP TABLE IF EXISTS Project_Temp;
CREATE TEMPORARY TABLE Project_Temp (project_id INT NOT NULL, total_payments INT, total_donors INT, total_received INT, PRIMARY KEY(project_id)) ENGINE=MEMORY;
INSERT INTO Project_Temp (project_id,total_payments)
SELECT `Project`.ID, IFNULL(SUM(PaymentAmount),0) FROM `Project` LEFT JOIN `Payment` ON ProjectID = `Project`.ID GROUP BY 1;
INSERT INTO Project_Temp (project_id,total_donors,total_received)
SELECT `Project`.ID, IFNULL(COUNT(DISTINCT DonorID),0), IFNULL(SUM(Amount),0) FROM `Project` LEFT JOIN `Contribution` ON RecipientID = `Project`.ID GROUP BY 1
ON DUPLICATE KEY UPDATE total_donors = VALUES(total_donors), total_received = VALUES(total_received);
SELECT * FROM Project_Temp;
两者的测试相当相似,在 1,000 行的 0.7 - 0.8 秒范围内。但我真的很担心可扩展性,并且我不想随着表的增长而重新设计所有内容。最好的方法是什么?
I have a table named Projects that has the following relationships:
has many Contributions
has many Payments
In my result set, I need the following aggregate values:
- Number of unique contributors (DonorID on the Contribution table)
- Total contributed (SUM of Amount on Contribution table)
- Total paid (SUM of PaymentAmount on Payment table)
Because there are so many aggregate functions and multiple joins, it gets messy do use standard aggregate functions the the GROUP BY clause. I also need the ability to sort and filter these fields. So I've come up with two options:
Using subqueries:
SELECT Project.ID AS PROJECT_ID,
(SELECT SUM(PaymentAmount) FROM Payment WHERE ProjectID = PROJECT_ID) AS TotalPaidBack,
(SELECT COUNT(DISTINCT DonorID) FROM Contribution WHERE RecipientID = PROJECT_ID) AS ContributorCount,
(SELECT SUM(Amount) FROM Contribution WHERE RecipientID = PROJECT_ID) AS TotalReceived
FROM Project;
Using a temporary table:
DROP TABLE IF EXISTS Project_Temp;
CREATE TEMPORARY TABLE Project_Temp (project_id INT NOT NULL, total_payments INT, total_donors INT, total_received INT, PRIMARY KEY(project_id)) ENGINE=MEMORY;
INSERT INTO Project_Temp (project_id,total_payments)
SELECT `Project`.ID, IFNULL(SUM(PaymentAmount),0) FROM `Project` LEFT JOIN `Payment` ON ProjectID = `Project`.ID GROUP BY 1;
INSERT INTO Project_Temp (project_id,total_donors,total_received)
SELECT `Project`.ID, IFNULL(COUNT(DISTINCT DonorID),0), IFNULL(SUM(Amount),0) FROM `Project` LEFT JOIN `Contribution` ON RecipientID = `Project`.ID GROUP BY 1
ON DUPLICATE KEY UPDATE total_donors = VALUES(total_donors), total_received = VALUES(total_received);
SELECT * FROM Project_Temp;
Tests for both are pretty comparable, in the 0.7 - 0.8 seconds range with 1,000 rows. But I'm really concerned about scalability, and I don't want to have to re-engineer everything as my tables grow. What's the best approach?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
了解每 1K 行的时序固然很好,但真正的问题是如何使用它们。
您打算将所有这些发送回用户界面吗? Google 每页提供 25 个结果;也许你也应该这样做。
您打算在中间层进行计算吗?也许您可以在数据库上进行这些计算,从而节省自己通过网络传输所有这些字节的时间。
我的观点是,如果您仔细考虑如何处理 1,000 或 100 万行,您可能永远不需要处理它们。
您可以通过 EXPLAIN PLAN 来查看两个查询之间的差异。
Knowing the timing for each 1K rows is good, but the real question is how they'll be used.
Are you planning to send all these back to a UI? Google doles out results 25 per page; maybe you should, too.
Are you planning to do calculations in the middle tier? Maybe you can do those calculations on the database and save yourself bringing all those bytes across the wire.
My point is that you may never need to work with 1,000 or one million rows if you think carefully about what you do with them.
You can EXPLAIN PLAN to see what the difference between the two queries is.
我会采用第一种方法。您允许 RDBMS 完成它的工作,而不是试图替它完成它的工作。
通过创建临时表,您将始终为每个查询创建完整的表。如果您只需要一个项目的数据,您最终仍然会创建完整的表(除非您相应地限制每个 INSERT 语句)。当然,您可以对其进行编码,但它已经成为相当数量的代码和复杂性,从而获得了很小的性能提升。
通过 SELECT,数据库可以获取适当数量的数据,根据上下文优化整个查询。如果其他用户查询了相同的数据,它甚至可能被缓存(查询,可能还有数据,具体取决于您的数据库)。如果性能确实是一个问题,您可以考虑使用索引/物化视图,或在 INSERT/UPDATE/DELETE 触发器上生成表。横向扩展时,您可以使用服务器集群和分区视图 - 我认为如果您要创建临时表,这会很困难。
编辑:上面的内容是在没有考虑任何特定rdbms的情况下编写的,尽管OP补充说mysql是目标数据库。
I would go with the first approach. You are allowing the RDBMS to do it's job, rather than trying to do it's job for it.
By creating a temp table, you will always create the full table for each query. If you only want data for one project, you still end up creating the full table (unless you restrict each INSERT statement accordingly.) Sure, you can code it, but it's already becoming a fair amount code and complexity for a small performance gain.
With a SELECT, the db can fetch the appriate amount of data, optimizing the whole query based on context. If other users have queried the same data, it may even be cached (query, and possibly data, depending upon your db). If performance is truly a concern, you might consider using Indexed/Materialized Views, or generating a table on an INSERT/UPDATE/DELETE trigger. Scaling out, you can use server clusters and partioned views - something that I believe will be difficult if you are creating temporary tables.
EDIT: the above is written without any specific rdbms in mind, although the OP added that mysql is the target db.
还有第三个选项,它是派生表:
我不确定它是否会表现得更好,但你可能会尝试一下。
There is a third option which is derived tables:
I'm not sure if it will perform better, but you might give it shot.
一些想法:
派生表的想法在其他平台上会很好,但 MySQL 的派生表与视图存在同样的问题:它们没有索引。这意味着 MySQL 将在应用 WHERE 子句之前执行派生表的全部内容,这根本无法扩展。
选项 1 适合紧凑,但当您想要开始将派生表达式放入 WHERE 子句时,语法可能会变得棘手。
选项 1 适合紧凑,但当
物化视图的建议是一个很好的建议,但不幸的是 MySQL 不支持它们。我喜欢使用触发器的想法。您可以将该临时表转换为持久存在的真实表,然后在 Payments 和 Contribution 表上使用 INSERT/UPDATE/DELETE 触发器来更新 Project Stats 表。
最后,如果您不想弄乱触发器,并且不太关心新鲜度,您始终可以拥有单独的统计表并离线更新它,有一个每隔几分钟运行一次的 cron 作业执行您在上面的查询 #2 中指定的工作,但在实际表上除外。根据您的应用程序的细微差别,您的用户可能会也可能不会接受更新统计数据的这种轻微延迟。
A few thoughts:
The derived table idea would be good on other platforms, but MySQL has the same issue with derived tables that it does with views: they aren't indexed. That means that MySQL will execute the full content of the derived table before applying the WHERE clause, which doesn't scale at all.
Option 1 is good for being compact, but syntax might get tricky when you want to start putting the derived expressions in the WHERE clause.
The suggestion of materialized views is a good one, but MySQL unfortunately doesn't support them. I like the idea of using triggers. You could translate that temporary table into a real table that persists, and then use INSERT/UPDATE/DELETE triggers on the Payments and Contribution tables to update the Project Stats table.
Finally, if you don't want to mess with triggers, and if you aren't too concerned with freshness, you can always have the separate stats table and update it offline, having a cron job that runs every few minutes that does the work that you specified in Query #2 above, except on the real table. Depending on the nuances of your application, this slight delay in updating the stats may or may not be acceptable to your users.