复杂分组-设计/性能问题
警告:这是一个大问题
我有一个设计问题,一开始很简单,但一步步发展就完全难住了我。
现实的简单版本有一个漂亮的平面事实表......
为了保护无辜者,所有名称均已更改
CREATE TABLE raw_data (
tier0_id INT, tier1_id INT, tier2_id INT, tier3_id INT,
metric0 INT, metric1 INT, metric2 INT, metric3 INT
)
tierID 与固定深度树中的实体相关。例如业务层次结构。
这些指标只是绩效数据,例如捕获的青蛙数量或释放的鸽子数量。
在报告中,好心的用户会做出如下选择:
- tier0_id 的 34 和 55 - 单独显示
- 所有 tier1_id - 将
- 所有 tier2_id 分组在一起 - 将
- 所有 tier3_id 分组在一起 - 分别显示
- 指标 2 和 3
这给了我以下内容查询类型:
SELECT
CASE WHEN @t0_grouping = 1 THEN NULL ELSE tier0_id END AS tier0_id,
CASE WHEN @t1_grouping = 1 THEN NULL ELSE tier1_id END AS tier1_id,
CASE WHEN @t2_grouping = 1 THEN NULL ELSE tier2_id END AS tier2_id,
CASE WHEN @t3_grouping = 1 THEN NULL ELSE tier3_id END AS tier3_id,
SUM(metric2) AS metric2, SUM(metric3) AS metric3
FROM
raw_data
INNER JOIN tier0_values ON tier0_values.id = raw_data.tier0_id OR tier0_values.id IS NULL
INNER JOIN tier1_values ON tier1_values.id = raw_data.tier1_id OR tier1_values.id IS NULL
INNER JOIN tier2_values ON tier2_values.id = raw_data.tier2_id OR tier2_values.id IS NULL
INNER JOIN tier3_values ON tier3_values.id = raw_data.tier3_id OR tier3_values.id IS NULL
GROUP BY
CASE WHEN @t0_grouping = 1 THEN NULL ELSE tier0_id END,
CASE WHEN @t1_grouping = 1 THEN NULL ELSE tier1_id END,
CASE WHEN @t2_grouping = 1 THEN NULL ELSE tier2_id END,
CASE WHEN @t3_grouping = 1 THEN NULL ELSE tier3_id END
它是动态 SQL 和参数化查询的完美混合体。是的,我知道,但是 SQL-CE 会让人们做一些奇怪的事情。此外,当合并以下更改时,可以对其进行整理...
从现在开始,我们需要能够在不同的层中包含 NULL。这意味着“适用于该层中的所有实体”。
例如,使用以下非常简化的数据:
Activity WorkingTime ActiveTime BusyTime
1 0m 10m 0m
2 0m 15m 0m
3 0m 20m 0m
NULL 60m 0m 45m
WorkingTime 永远不会应用于活动,因此所有值都使用 NULL ID。但 ActiveTime 专门针对特定活动,因此它带有合法的 ID。 BusyTime 也反对 NULL 活动,因为它是所有 ActiveTime 的累积。
如果要报告此数据,每一行中总是包含 NULL 值,因为 NULL 意味着“适用于所有内容”。数据看起来像...
Activity WorkingTime ActiveTime BusyTime (BusyOnOtherActivities)
1 60m 10m 45m (45-10 = 35m)
2 60m 15m 45m (45-15 = 30m)
3 60m 20m 45m (45-20 = 25m)
1&2 60m 25m 45m (45-25 = 20m)
1&3 60m 30m 45m (45-30 = 15m)
2&3 60m 35m 45m (45-35 = 10m)
ALL 60m 45m 45m (45-45 = 0m)
希望这个示例有意义,因为它实际上是一个多层层次结构(按照原始示例),并且在每一层中都允许 NULL。所以我将尝试一个 3 层的示例...
t0_id | t1_id | t2_id | m1 | m2 | m3 | m4 | m5
1 3 10 | 0 10 0 0 0
1 4 10 | 0 15 0 0 0
1 5 10 | 0 20 0 0 0
1 NULL 10 | 60 0 45 0 0
2 3 10 | 0 5 0 0 0
2 5 10 | 0 10 0 0 0
2 6 10 | 0 15 0 0 0
2 NULL 10 | 50 0 30 0 0
1 3 11 | 0 7 0 0 0
1 4 11 | 0 8 0 0 0
1 5 11 | 0 9 0 0 0
1 NULL 11 | 30 0 24 0 0
2 3 11 | 0 8 0 0 0
2 5 11 | 0 10 0 0 0
2 6 11 | 0 12 0 0 0
2 NULL 11 | 40 0 30 0 0
NULL NULL 10 | 0 0 0 60 0
NULL NULL 11 | 0 0 0 60 0
NULL NULL NULL | 0 0 0 0 2
这将在报告中给出许多、许多可能不同的输出记录,但这里有一些示例...
t0_id | t1_id | t2_id | m1 | m2 | m3 | m4 | m5
1 3 10 | 60 10 45 60 2
1 4 10 | 60 15 45 60 2
1 5 10 | 60 20 45 60 2
2 3 10 | 50 5 30 60 2
2 5 10 | 50 10 30 60 2
2 6 10 | 50 15 30 60 2
1 ALL 10 | 60 45 45 60 2
2 ALL 10 | 50 30 30 60 2
ALL 3 10 | 110 15 75 60 2
ALL 4 10 | 60 15 45 60 2
ALL 5 10 | 110 30 75 60 2
ALL 6 10 | 50 15 30 60 2
ALL 3 ALL | 180 30 129 120 2
ALL 4 ALL | 90 23 69 120 2
ALL 5 ALL | 180 49 129 120 2
ALL 6 ALL | 90 27 60 120 2
ALL ALL 10 | 110 129 129 60 2
ALL ALL 11 | 70 129 129 60 2
ALL ALL ALL | 180 129 129 120 2
1 3&4 ALL | 90 40 69 120 2
ALL 3&4 ALL | 180 53 129 120 2
尽管解释起来很混乱,但它使完整和我头脑中的逻辑感。我明白所问的是什么,但在我的一生中,我似乎无法为此编写一个不需要花费大量时间来执行的查询。
那么,您将如何编写这样的查询和/或重构架构?
我很高兴人们会询问我到目前为止所做的事情的示例,但我很想先听听其他人的真实想法和建议;)
WARNING : This is one BIG Question
I have a design problem that started simple, but in one step of growth has stumped me completely.
The simple version of reality has a nice flat fact table...
All names have been changed to protect the innocent
CREATE TABLE raw_data (
tier0_id INT, tier1_id INT, tier2_id INT, tier3_id INT,
metric0 INT, metric1 INT, metric2 INT, metric3 INT
)
The tierIDs relate to entities in a fixed depth tree. Such as a business hierarchy.
The metrics are just performance figures, such as number of frogs captured, or pigeons released.
In the reporting the kindly user would make selections to mean something like the following:
- tier0_id's 34 and 55 - shown separately
- all of tier1_id's - grouped together
- all of tier2_id's - grouped together
- all of tier3_id's - shown separately
- metrics 2 and 3
This gives me the following type of query:
SELECT
CASE WHEN @t0_grouping = 1 THEN NULL ELSE tier0_id END AS tier0_id,
CASE WHEN @t1_grouping = 1 THEN NULL ELSE tier1_id END AS tier1_id,
CASE WHEN @t2_grouping = 1 THEN NULL ELSE tier2_id END AS tier2_id,
CASE WHEN @t3_grouping = 1 THEN NULL ELSE tier3_id END AS tier3_id,
SUM(metric2) AS metric2, SUM(metric3) AS metric3
FROM
raw_data
INNER JOIN tier0_values ON tier0_values.id = raw_data.tier0_id OR tier0_values.id IS NULL
INNER JOIN tier1_values ON tier1_values.id = raw_data.tier1_id OR tier1_values.id IS NULL
INNER JOIN tier2_values ON tier2_values.id = raw_data.tier2_id OR tier2_values.id IS NULL
INNER JOIN tier3_values ON tier3_values.id = raw_data.tier3_id OR tier3_values.id IS NULL
GROUP BY
CASE WHEN @t0_grouping = 1 THEN NULL ELSE tier0_id END,
CASE WHEN @t1_grouping = 1 THEN NULL ELSE tier1_id END,
CASE WHEN @t2_grouping = 1 THEN NULL ELSE tier2_id END,
CASE WHEN @t3_grouping = 1 THEN NULL ELSE tier3_id END
It's a nice hybrid of Dynamic SQL, and parametrised queries. And yes, I know, but SQL-CE makes people do strange things. Besides, that can be tidied up as and when the following change gets incorporated...
From now on, we need to be able to include NULLs in the different tiers. This will mean "applies to ALL entities in that tier".
For example, with the following very simplified data:
Activity WorkingTime ActiveTime BusyTime
1 0m 10m 0m
2 0m 15m 0m
3 0m 20m 0m
NULL 60m 0m 45m
WorkingTime never applies to an activity, so al the values go in with a NULL ID. But ActiveTime is specifically about a specific activity, so it goes in with a legitimate ID. BusyTime is also against a NULL activity because it's the cumulation of all the ActiveTime.
If one were to report on this data, the NULL values -always- get included in every row, because the NULL -means- "applies to everything". The data would look like...
Activity WorkingTime ActiveTime BusyTime (BusyOnOtherActivities)
1 60m 10m 45m (45-10 = 35m)
2 60m 15m 45m (45-15 = 30m)
3 60m 20m 45m (45-20 = 25m)
1&2 60m 25m 45m (45-25 = 20m)
1&3 60m 30m 45m (45-30 = 15m)
2&3 60m 35m 45m (45-35 = 10m)
ALL 60m 45m 45m (45-45 = 0m)
Hopefully this example makes sense, because it's actually a multi-tiered hierarchy (as per the original example), and in every tier NULLs are allowed. So I'll try an example with 3 tiers...
t0_id | t1_id | t2_id | m1 | m2 | m3 | m4 | m5
1 3 10 | 0 10 0 0 0
1 4 10 | 0 15 0 0 0
1 5 10 | 0 20 0 0 0
1 NULL 10 | 60 0 45 0 0
2 3 10 | 0 5 0 0 0
2 5 10 | 0 10 0 0 0
2 6 10 | 0 15 0 0 0
2 NULL 10 | 50 0 30 0 0
1 3 11 | 0 7 0 0 0
1 4 11 | 0 8 0 0 0
1 5 11 | 0 9 0 0 0
1 NULL 11 | 30 0 24 0 0
2 3 11 | 0 8 0 0 0
2 5 11 | 0 10 0 0 0
2 6 11 | 0 12 0 0 0
2 NULL 11 | 40 0 30 0 0
NULL NULL 10 | 0 0 0 60 0
NULL NULL 11 | 0 0 0 60 0
NULL NULL NULL | 0 0 0 0 2
This would give many, many possible different output records in the reporting, but here are a few examples...
t0_id | t1_id | t2_id | m1 | m2 | m3 | m4 | m5
1 3 10 | 60 10 45 60 2
1 4 10 | 60 15 45 60 2
1 5 10 | 60 20 45 60 2
2 3 10 | 50 5 30 60 2
2 5 10 | 50 10 30 60 2
2 6 10 | 50 15 30 60 2
1 ALL 10 | 60 45 45 60 2
2 ALL 10 | 50 30 30 60 2
ALL 3 10 | 110 15 75 60 2
ALL 4 10 | 60 15 45 60 2
ALL 5 10 | 110 30 75 60 2
ALL 6 10 | 50 15 30 60 2
ALL 3 ALL | 180 30 129 120 2
ALL 4 ALL | 90 23 69 120 2
ALL 5 ALL | 180 49 129 120 2
ALL 6 ALL | 90 27 60 120 2
ALL ALL 10 | 110 129 129 60 2
ALL ALL 11 | 70 129 129 60 2
ALL ALL ALL | 180 129 129 120 2
1 3&4 ALL | 90 40 69 120 2
ALL 3&4 ALL | 180 53 129 120 2
As messy as this is to explain, it makes complete and logical sense in my head. I understand what is being asked, but for the life of me I can not seem to write a query for this that doesn't take excruciating amounts of time to execute.
So, how would you write such a query, and/or refactor the schema?
I appreciate that people will ask for examples of what I've done so far, but I'm eager to hear other people's uncorrupted ideas and advice first ;)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这个问题看起来更像是一个标准化活动。我将从标准化表格开始
类似于:(根据您的使用情况,您可能需要更多身份字段)
我将创建一个类似于以下内容的分层表:(tierplan 允许多个分组。如果 tier_id 没有父级可以汇总,则 tierparent_id 是NULL 这允许在查询中进行递归。)
最后,我将创建一个将层和活动相关联的表,例如:
对此的查询应该“不太困难”。
The problem looks more like a normalization activity. I would start with normalizing the table
to something like: (You may need some more identity fields depending on your usage)
I'd create a tiering table that looks something like: (tierplan allows for multiple groupings. If a tier_id has no parent to roll up under, then tierparent_id is NULL This alllows for recursion in the query.)
Finally, I'd create a table that relates tiers and Activities something like:
Queries off of this ought to be "not too difficult."