数据库行为 HAVING-SUM 与 WHERE / DISTINCT 与 GROUP BY

发布于 2024-10-08 09:31:28 字数 1953 浏览 5 评论 0原文

假设我有一个非常大的汇总表，其中我们保存活动点的总和，每个用户每天一行，以及用户当天所做的活动总和（对于每种类型不同的列）：

CREATE TABLE summry_data
(
    UserID INT NOT NULL,
    ActivityDate DATE,
    t1 INT NOT NULL,
    t2 INT NOT NULL,
    t3 INT NOT NULL,
    t4 INT NOT NULL,
    PRIMARY KEY(UserID, ActivityDate)
)

每天早上我们填充前一天的数据。我们为每个用户插入一行：

INSERT summery_data
SELECT UserID, '2010-12-16'
    , SUM(IF(TypeID = 1, Points, 0))
    , SUM(IF(TypeID = 2, Points, 0))
    , SUM(IF(TypeID = 3, Points, 0))
    , SUM(IF(TypeID = 4, Points, 0))
FROM activities
WHERE ActivityDate >= '2010-12-16' AND ActivityDate < '2010-12-17'
GROUP BY UserID

表数据看起来像这样：

UserID  ActivityDate   t1   t2  t3  t4
1       2010-01-01      0   82  0   0
1       2010-01-02      100 1   12  0
2       2010-01-01      0   0   0   41
2       2010-01-02      0   0   0   1
3       2010-01-02      0   0   0   106
3       2010-01-03      2   5   0   4

如果我想获取具有 t1、t2 或 t3 任何活动点的用户 ID 列表，则该表非常大（10M+ 行）（但是我们不想在任何一天计算 t4)。我的最终结果将包括 UserID 1 和 3。

以下哪个查询更好：

SELECT DISTINCT UserID
FROM summery_data
WHERE t1 > 0 OR t2 > 0 OR t3 > 0

SELECT UserID
FROM summery_data
GROUP BY UserID
HAVING SUM(t1) > 0 OR SUM(t2) > 0 OR SUM(t3) > 0

为了了解哪个更快，我对幕后发生的事情有一些疑问：

一个 DISTINCT 查询，如何数据库确保只有 1 个 UserID 将添加到结果集中，它是否检查每个 UserID 以查看它是否已存在于结果集中？或者由于表无论如何都是按 UserID 聚集的，只需在扫描行时保留添加到结果集中的最后一个 UserID 的变量？
在 DISTINCT 查询中，一旦数据库找到与当前 UserID 条件匹配的单行，它是否会停止检查 where 子句中的谓词，直到遇到下一个 UserID？
在GROUP BY查询中，在对t1列求和时，一旦数据库找到t1列>的记录， 0，这将与 HAVING 匹配，它是否停止对当前 UserID 的其他 t1 行求和（因为谓词 > 0 这已经是 true）？或者至少它不会对其他列（t2 和 t3）求和，因为不需要这样做？或者数据库是否首先在 GROUP BY 查询中在评估 HAVING 子句之前对 t1、t2 和 t3 进行求和？

注意：我使用 MySql 作为数据库服务器，但是我想知道 Sql Server 或任何其他数据库系统是否会以不同的方式工作。

非常感谢任何帮助。

原文

Suppose I have a very large summery table where we keep the sum of the activity points, a row for every user, for every day and the sum of the activity - for each type a different column - that the user did that day:

CREATE TABLE summry_data
(
    UserID INT NOT NULL,
    ActivityDate DATE,
    t1 INT NOT NULL,
    t2 INT NOT NULL,
    t3 INT NOT NULL,
    t4 INT NOT NULL,
    PRIMARY KEY(UserID, ActivityDate)
)

Every morning we populate the previous day's data. we insert a row for every user:

INSERT summery_data
SELECT UserID, '2010-12-16'
    , SUM(IF(TypeID = 1, Points, 0))
    , SUM(IF(TypeID = 2, Points, 0))
    , SUM(IF(TypeID = 3, Points, 0))
    , SUM(IF(TypeID = 4, Points, 0))
FROM activities
WHERE ActivityDate >= '2010-12-16' AND ActivityDate < '2010-12-17'
GROUP BY UserID

The table data looks something like this:

UserID  ActivityDate   t1   t2  t3  t4
1       2010-01-01      0   82  0   0
1       2010-01-02      100 1   12  0
2       2010-01-01      0   0   0   41
2       2010-01-02      0   0   0   1
3       2010-01-02      0   0   0   106
3       2010-01-03      2   5   0   4

The table is very large (10M+ rows), if i want to get a list of user ID's who had any activity points for either t1, t2 or t3 (but we do not want to count t4), on any day. my end result would include UserID 1 and 3.

which of the following queries are better:

SELECT DISTINCT UserID
FROM summery_data
WHERE t1 > 0 OR t2 > 0 OR t3 > 0

SELECT UserID
FROM summery_data
GROUP BY UserID
HAVING SUM(t1) > 0 OR SUM(t2) > 0 OR SUM(t3) > 0

in order to understand which will be faster, i have some question about what goes on behind the scenes:

a DISTINCT query, how does the database insure that only 1 UserID will be added to the result set, does it check each UserID to see if it already exists in the set? or since the table is clusterd by UserID anyway, just keep a variable - while scanning the rows - of the last UserID added to the result set?
in a DISTINCT query, Once the database find a single row that matches the criteria for the current UserID, does it stop checking the predicate in the where clause until it hits the next UserID?
in a GROUP BY query, while summing the t1 column, once the database find a record that the column t1 > 0, which would match the HAVING, does it stop summing the other t1 rows for the current UserID (since the predicate is > 0 which is already true)? or at least does it not sum the other columns (t2 and t3) since there is no need for that? or does the database first first do the summing of t1, t2 and t3 before evaluating the HAVING clause?

Note: I am using MySql as the database server, however i would like to know if Sql Server or any other database systems would work differently.

Any help is greatly appreciated.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

我们只是彼此的过ke 2024-10-15 09:31:28

如果您在（t1、t2、t3、t4）中的任何一个中允许负数，则您的查询不相同相同。
考虑以下数据：

user_id   T1   T2   T3   T4
-------  ---  ---  ---  ---
   1      -2   0    0    0
   1       2   0    0    0
   2       1   0    0    0
   2       2   0    0    0

您的第一个查询（不同）将包括用户 1 和 2，因为每个用户至少有一行 T1 值 > 2。 0.

第二个查询（gbyhaving）将排除用户1，因为T1值的总和为0（即使组内的值>0）。这也是一个很好的例子，说明了having和where之间的区别。（WHERE 对单独的行进行操作；HAVING 对整个组进行操作）。

答案的其余部分不仅高度依赖于供应商，而且从 SQL 角度来看也完全无关，因为最终是数据库做出选择。话虽如此，通过了解一点，您可以通过以某种方式编写查询来影响优化器。

问题 1

我知道数据库可以使用三种策略来生成不同值的列表。使用哪一个将由使用该操作的估计成本决定。

排序。对结果集进行排序。运行排序后的结果，并跟踪先前的值。如果无法装入内存，这可能会非常昂贵（缓慢）。

散列。哈希函数应用于结果集中的所有行。结果存储在中间哈希表中。这通常比排序更快。

索引行走。这基本上与排序技术相同，但由于索引已经排序，因此跳过该步骤。

问题 2

数据库是否可以自由地按其想要的任何顺序评估您的谓词。你自己无法轻易决定这一点。优化器可以使用启发式或统计来找到最佳评估顺序。它必须遵守与我们其他人相同的布尔原则。当（t1=1或t2=2或t3=3）中的任何一个为真时，我们可以停止评估其他的。

问题 3

否。我上面关于 WHERE/HAVING 的示例对此进行了解释。

Your queries are not identical in case you allow negative numbers in any of (t1, t2, t3, t4).
Consider the following data:

user_id   T1   T2   T3   T4
-------  ---  ---  ---  ---
   1      -2   0    0    0
   1       2   0    0    0
   2       1   0    0    0
   2       2   0    0    0

Your first query (distinct) will include both user 1 and 2, as there are at least one row for each user with a T1 value > 0.

The second query (gby having) will exclude user 1 as the sum of T1 values is 0 (even though values within the group are > 0). This is also a good example of the difference between having and where. (WHERE operate on idividual rows; HAVING operates on the group as a whole).

The rest of the answer is not only highly vendor dependant, but also completely irrelevant from a SQL perspective, since it is the database that ultimately does the choices. Having said that, by knowing a little about it, you can influence the optimizer by writing your queries in a certain way.

Question 1

I know of three stretegies a database can use to produce a list of distinct values. Which one to use will be determined by the estimated cost of using that operation.

Sorting. Sort the resultset. Run through the sorted result, and keep track of the previous value. This is potentially very expensive (slow) if it cannot fit into memory.

Hashing. A hash function is applied to all rows in the resultset. The result is stored in an intermediate hashtable. This is often faster than sorting.

Index walk. This is basically the same technique as sorting, but as the index is already sorted, that step is skipped.

Question 2

The database if free to evaluate your predicates in any order it wants. You cannot easily decide this yourself. The optimizer can use heuristics or statistics to find the optimal evaluation order. It has to obey the same boolean principles as the rest of us. When any of (t1=1 or t2=2 or t3=3) is true, we can stop evaluating the others.

Question 3

No. This is explained by my example above regardin WHERE/HAVING.

回复收藏 0 原文

灯角 2024-10-15 09:31:28

您的许多具体问题都取决于实施。

SQL 查询是声明性的。他们没有指定获得答案的方式，只是表明您正在寻找什么。 DMBS（数据库管理系统）决定如何将这些付诸实践。大多数 SELECT 查询都包含某种类型的表扫描迭代（除非通过相关字段上的索引克服了这一点），但您在查询中间看不到显式循环。

我可以明确建议您的是，如果您对总和的实际值不感兴趣，则不要使用总和等聚合函数。如果您想要获取在任何行的这三个字段中的任何一个中具有正值的 UserId，请使用 DISTINCT。这至少让 DMBS 有机会做正确的事情并优化该查询。

索引可能对这个查询有帮助，但作用不大。索引真正有用的地方是在不同的表之间进行等式连接（当您将具有 m 行的表与具有 n 行的表进行等连接时，这可能需要 m*n 时间）。在这里，您要做的就是过滤，只要这 3 个字段之一为正。在最坏的情况下，您将每行查看一次。 UserId 上的索引与 DISTNCT 结合使用可以帮助排除您已决定包含的用户的检查行。

回复收藏 0 原文

~没有更多了~