复杂分组-设计/性能问题

发布于 2024-11-28 17:16:52 字数 5925 浏览 4 评论 0原文

警告:这是一个大问题


我有一个设计问题,一开始很简单,但一步步发展就完全难住了我。

现实的简单版本有一个漂亮的平面事实表......
为了保护无辜者,所有名称均已更改

CREATE TABLE raw_data (
  tier0_id INT, tier1_id  INT, tier2_id INT, tier3_id INT,
  metric0  INT, metric1   INT, metric2  INT, metric3  INT
)

tierID 与固定深度树中的实体相关。例如业务层次结构。

这些指标只是绩效数据,例如捕获的青蛙数量或释放的鸽子数量。

在报告中,好心的用户会做出如下选择:

  • tier0_id 的 34 和 55 - 单独显示
  • 所有 tier1_id - 将
  • 所有 tier2_id 分组在一起 - 将
  • 所有 tier3_id 分组在一起 - 分别显示
  • 指标 2 和 3

这给了我以下内容查询类型:

SELECT
  CASE WHEN @t0_grouping = 1 THEN NULL ELSE tier0_id END AS tier0_id,
  CASE WHEN @t1_grouping = 1 THEN NULL ELSE tier1_id END AS tier1_id,
  CASE WHEN @t2_grouping = 1 THEN NULL ELSE tier2_id END AS tier2_id,
  CASE WHEN @t3_grouping = 1 THEN NULL ELSE tier3_id END AS tier3_id,
  SUM(metric2) AS metric2, SUM(metric3) AS metric3
FROM
  raw_data
INNER JOIN tier0_values ON tier0_values.id = raw_data.tier0_id OR tier0_values.id IS NULL
INNER JOIN tier1_values ON tier1_values.id = raw_data.tier1_id OR tier1_values.id IS NULL
INNER JOIN tier2_values ON tier2_values.id = raw_data.tier2_id OR tier2_values.id IS NULL
INNER JOIN tier3_values ON tier3_values.id = raw_data.tier3_id OR tier3_values.id IS NULL
GROUP BY
  CASE WHEN @t0_grouping = 1 THEN NULL ELSE tier0_id END,
  CASE WHEN @t1_grouping = 1 THEN NULL ELSE tier1_id END,
  CASE WHEN @t2_grouping = 1 THEN NULL ELSE tier2_id END,
  CASE WHEN @t3_grouping = 1 THEN NULL ELSE tier3_id END

它是动态 SQL 和参数化查询的完美混合体。是的,我知道,但是 SQL-CE 会让人们做一些奇怪的事情。此外,当合并以下更改时,可以对其进行整理...


从现在开始,我们需要能够在不同的层中包含 NULL。这意味着“适用于该层中的所有实体”。

例如,使用以下非常简化的数据:

Activity    WorkingTime    ActiveTime    BusyTime

   1            0m             10m          0m
   2            0m             15m          0m
   3            0m             20m          0m
  NULL         60m              0m         45m

WorkingTime 永远不会应用于活动,因此所有值都使用 NULL ID。但 ActiveTime 专门针对特定活动,因此它带有合法的 ID。 BusyTime 也反对 NULL 活动,因为它是所有 ActiveTime 的累积。

如果要报告此数据,每一行中总是包含 NULL 值,因为 NULL 意味着“适用于所有内容”。数据看起来像...

Activity    WorkingTime    ActiveTime    BusyTime   (BusyOnOtherActivities)

   1           60m             10m         45m            (45-10 = 35m)
   2           60m             15m         45m            (45-15 = 30m)
   3           60m             20m         45m            (45-20 = 25m)

  1&2          60m             25m         45m            (45-25 = 20m)
  1&3          60m             30m         45m            (45-30 = 15m)
  2&3          60m             35m         45m            (45-35 = 10m)

  ALL          60m             45m         45m            (45-45 =  0m)



希望这个示例有意义,因为它实际上是一个多层层次结构(按照原始示例),并且在每一层中都允许 NULL。所以我将尝试一个 3 层的示例...

t0_id  |  t1_id  |  t2_id   |   m1  |  m2  |  m3  |  m4  |  m5
    1         3        10   |    0     10      0      0      0
    1         4        10   |    0     15      0      0      0
    1         5        10   |    0     20      0      0      0
    1      NULL        10   |   60      0     45      0      0
    2         3        10   |    0      5      0      0      0
    2         5        10   |    0     10      0      0      0
    2         6        10   |    0     15      0      0      0
    2      NULL        10   |   50      0     30      0      0
    1         3        11   |    0      7      0      0      0
    1         4        11   |    0      8      0      0      0
    1         5        11   |    0      9      0      0      0
    1      NULL        11   |   30      0     24      0      0
    2         3        11   |    0      8      0      0      0
    2         5        11   |    0     10      0      0      0
    2         6        11   |    0     12      0      0      0
    2      NULL        11   |   40      0     30      0      0
 NULL      NULL        10   |    0      0      0     60      0
 NULL      NULL        11   |    0      0      0     60      0
 NULL      NULL      NULL   |    0      0      0      0      2

这将在报告中给出许多、许多可能不同的输出记录,但这里有一些示例...

t0_id  |  t1_id  |  t2_id   |   m1  |  m2  |  m3  |  m4  |  m5

    1         3        10   |   60     10     45     60      2
    1         4        10   |   60     15     45     60      2
    1         5        10   |   60     20     45     60      2

    2         3        10   |   50      5     30     60      2
    2         5        10   |   50     10     30     60      2
    2         6        10   |   50     15     30     60      2

    1       ALL        10   |   60     45     45     60      2
    2       ALL        10   |   50     30     30     60      2

  ALL         3        10   |  110     15     75     60      2
  ALL         4        10   |   60     15     45     60      2
  ALL         5        10   |  110     30     75     60      2
  ALL         6        10   |   50     15     30     60      2

  ALL         3       ALL   |  180     30    129    120      2
  ALL         4       ALL   |   90     23     69    120      2
  ALL         5       ALL   |  180     49    129    120      2
  ALL         6       ALL   |   90     27     60    120      2

  ALL       ALL        10   |  110    129    129     60      2
  ALL       ALL        11   |   70    129    129     60      2
  ALL       ALL       ALL   |  180    129    129    120      2

    1       3&4       ALL   |   90     40     69    120      2
  ALL       3&4       ALL   |  180     53    129    120      2


尽管解释起来很混乱,但它使完整和我头脑中的逻辑感。我明白所问的是什么,但在我的一生中,我似乎无法为此编写一个不需要花费大量时间来执行的查询。

那么,您将如何编写这样的查询和/或重构架构?

我很高兴人们会询问我到目前为止所做的事情的示例,但我很想先听听其他人的真实想法和建议;)

WARNING : This is one BIG Question

I have a design problem that started simple, but in one step of growth has stumped me completely.

The simple version of reality has a nice flat fact table...
All names have been changed to protect the innocent

CREATE TABLE raw_data (
  tier0_id INT, tier1_id  INT, tier2_id INT, tier3_id INT,
  metric0  INT, metric1   INT, metric2  INT, metric3  INT
)

The tierIDs relate to entities in a fixed depth tree. Such as a business hierarchy.

The metrics are just performance figures, such as number of frogs captured, or pigeons released.

In the reporting the kindly user would make selections to mean something like the following:

  • tier0_id's 34 and 55 - shown separately
  • all of tier1_id's - grouped together
  • all of tier2_id's - grouped together
  • all of tier3_id's - shown separately
  • metrics 2 and 3

This gives me the following type of query:

SELECT
  CASE WHEN @t0_grouping = 1 THEN NULL ELSE tier0_id END AS tier0_id,
  CASE WHEN @t1_grouping = 1 THEN NULL ELSE tier1_id END AS tier1_id,
  CASE WHEN @t2_grouping = 1 THEN NULL ELSE tier2_id END AS tier2_id,
  CASE WHEN @t3_grouping = 1 THEN NULL ELSE tier3_id END AS tier3_id,
  SUM(metric2) AS metric2, SUM(metric3) AS metric3
FROM
  raw_data
INNER JOIN tier0_values ON tier0_values.id = raw_data.tier0_id OR tier0_values.id IS NULL
INNER JOIN tier1_values ON tier1_values.id = raw_data.tier1_id OR tier1_values.id IS NULL
INNER JOIN tier2_values ON tier2_values.id = raw_data.tier2_id OR tier2_values.id IS NULL
INNER JOIN tier3_values ON tier3_values.id = raw_data.tier3_id OR tier3_values.id IS NULL
GROUP BY
  CASE WHEN @t0_grouping = 1 THEN NULL ELSE tier0_id END,
  CASE WHEN @t1_grouping = 1 THEN NULL ELSE tier1_id END,
  CASE WHEN @t2_grouping = 1 THEN NULL ELSE tier2_id END,
  CASE WHEN @t3_grouping = 1 THEN NULL ELSE tier3_id END

It's a nice hybrid of Dynamic SQL, and parametrised queries. And yes, I know, but SQL-CE makes people do strange things. Besides, that can be tidied up as and when the following change gets incorporated...

From now on, we need to be able to include NULLs in the different tiers. This will mean "applies to ALL entities in that tier".

For example, with the following very simplified data:

Activity    WorkingTime    ActiveTime    BusyTime

   1            0m             10m          0m
   2            0m             15m          0m
   3            0m             20m          0m
  NULL         60m              0m         45m

WorkingTime never applies to an activity, so al the values go in with a NULL ID. But ActiveTime is specifically about a specific activity, so it goes in with a legitimate ID. BusyTime is also against a NULL activity because it's the cumulation of all the ActiveTime.

If one were to report on this data, the NULL values -always- get included in every row, because the NULL -means- "applies to everything". The data would look like...

Activity    WorkingTime    ActiveTime    BusyTime   (BusyOnOtherActivities)

   1           60m             10m         45m            (45-10 = 35m)
   2           60m             15m         45m            (45-15 = 30m)
   3           60m             20m         45m            (45-20 = 25m)

  1&2          60m             25m         45m            (45-25 = 20m)
  1&3          60m             30m         45m            (45-30 = 15m)
  2&3          60m             35m         45m            (45-35 = 10m)

  ALL          60m             45m         45m            (45-45 =  0m)

Hopefully this example makes sense, because it's actually a multi-tiered hierarchy (as per the original example), and in every tier NULLs are allowed. So I'll try an example with 3 tiers...

t0_id  |  t1_id  |  t2_id   |   m1  |  m2  |  m3  |  m4  |  m5
    1         3        10   |    0     10      0      0      0
    1         4        10   |    0     15      0      0      0
    1         5        10   |    0     20      0      0      0
    1      NULL        10   |   60      0     45      0      0
    2         3        10   |    0      5      0      0      0
    2         5        10   |    0     10      0      0      0
    2         6        10   |    0     15      0      0      0
    2      NULL        10   |   50      0     30      0      0
    1         3        11   |    0      7      0      0      0
    1         4        11   |    0      8      0      0      0
    1         5        11   |    0      9      0      0      0
    1      NULL        11   |   30      0     24      0      0
    2         3        11   |    0      8      0      0      0
    2         5        11   |    0     10      0      0      0
    2         6        11   |    0     12      0      0      0
    2      NULL        11   |   40      0     30      0      0
 NULL      NULL        10   |    0      0      0     60      0
 NULL      NULL        11   |    0      0      0     60      0
 NULL      NULL      NULL   |    0      0      0      0      2

This would give many, many possible different output records in the reporting, but here are a few examples...

t0_id  |  t1_id  |  t2_id   |   m1  |  m2  |  m3  |  m4  |  m5

    1         3        10   |   60     10     45     60      2
    1         4        10   |   60     15     45     60      2
    1         5        10   |   60     20     45     60      2

    2         3        10   |   50      5     30     60      2
    2         5        10   |   50     10     30     60      2
    2         6        10   |   50     15     30     60      2

    1       ALL        10   |   60     45     45     60      2
    2       ALL        10   |   50     30     30     60      2

  ALL         3        10   |  110     15     75     60      2
  ALL         4        10   |   60     15     45     60      2
  ALL         5        10   |  110     30     75     60      2
  ALL         6        10   |   50     15     30     60      2

  ALL         3       ALL   |  180     30    129    120      2
  ALL         4       ALL   |   90     23     69    120      2
  ALL         5       ALL   |  180     49    129    120      2
  ALL         6       ALL   |   90     27     60    120      2

  ALL       ALL        10   |  110    129    129     60      2
  ALL       ALL        11   |   70    129    129     60      2
  ALL       ALL       ALL   |  180    129    129    120      2

    1       3&4       ALL   |   90     40     69    120      2
  ALL       3&4       ALL   |  180     53    129    120      2

As messy as this is to explain, it makes complete and logical sense in my head. I understand what is being asked, but for the life of me I can not seem to write a query for this that doesn't take excruciating amounts of time to execute.

So, how would you write such a query, and/or refactor the schema?

I appreciate that people will ask for examples of what I've done so far, but I'm eager to hear other people's uncorrupted ideas and advice first ;)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

往昔成烟 2024-12-05 17:16:52

这个问题看起来更像是一个标准化活动。我将从标准化表格开始
类似于:(根据您的使用情况,您可能需要更多身份字段)

CREATE TABLE raw_data (
    rawData_ID INT,
    Activity_id INT, 
    metric0  INT)

我将创建一个类似于以下内容的分层表:(tierplan 允许多个分组。如果 tier_id 没有父级可以汇总,则 tierparent_id 是NULL 这允许在查询中进行递归。)

CREATE TABLE tiers (
    tierplan_id INT,
    tier_id INT,
    tierparent_id INT)

最后,我将创建一个将层和活动相关联的表,例如:

CREATE TABLE ActivTiers (
    Activplan_id INT, --id on the table
    tierplan_id INT,  --tells what tierplan the raw_data falls under
    rawdata_id INT)   --this allows the ActivityId to be payload instead of identifier.

对此的查询应该“不太困难”。

The problem looks more like a normalization activity. I would start with normalizing the table
to something like: (You may need some more identity fields depending on your usage)

CREATE TABLE raw_data (
    rawData_ID INT,
    Activity_id INT, 
    metric0  INT)

I'd create a tiering table that looks something like: (tierplan allows for multiple groupings. If a tier_id has no parent to roll up under, then tierparent_id is NULL This alllows for recursion in the query.)

CREATE TABLE tiers (
    tierplan_id INT,
    tier_id INT,
    tierparent_id INT)

Finally, I'd create a table that relates tiers and Activities something like:

CREATE TABLE ActivTiers (
    Activplan_id INT, --id on the table
    tierplan_id INT,  --tells what tierplan the raw_data falls under
    rawdata_id INT)   --this allows the ActivityId to be payload instead of identifier.

Queries off of this ought to be "not too difficult."

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文