OLAP 度量是否可以是通过集合并集聚合的集合,而不是通过加法聚合的数字?

发布于 2024-12-29 09:24:09 字数 1392 浏览 5 评论 0原文

通常,OLAP 多维数据集中的度量是数字,这些数字通过加法(或通过一些不太奇特的函数,如 times、MAX 或 MIN)进行聚合。我想知道是否有任何主要的 OLAP 服务器允许您进行数字或字符串的测量。 (这里的“集合”是数学意义上的,而不是 OLAP 行话中的“元组列表”。)虽然 OLAP 通常通过数值函数聚合度量,但我假设的“集合度量”将通过集合操作聚合,例如集合并集或设置交集。

我对理论和实践都很感兴趣。理论上/抽象/数学上,整数加法和集合并集之间有一个很好的相似之处,并且似乎有人在编写 OLAP 服务器时可以考虑这种相似之处。 (如果提前知道所有可能的集合成员,一种潜在的实现是将每个集合表示为一个(可能很大的)整数,然后通过执行按位或来聚合/联合。)

至于实践,我将尝试提供一个具体的案例,其中这看起来至少有点用:假设您有一个数据集,其中每个事实都是与学术论文相关的元数据。每篇论文可能有一个日期、一个主题以及一组一个或多个作者,如下所示:

  • fact1: {"Year": 1997, "Topic" : "AI", "AuthorSet": ["Bill Jones", "玛莎 X"]}
  • 事实 2: {"年份": 1997, "主题" : "语言学", "AuthorSet": ["约翰 Q", "山姆 S"]}
  • 事实 3: {"Year": 1997, "Topic" : "Linguistics", "AuthorSet": ["John Q", "Jack X"]}

(我在这里使用准 JSON 只是因为它有助于让人们明白什么是多值。)

如果您围绕此数据创建了一个 OLAP 多维数据集,那么能够制作一份报告来显示针对给定主题撰写的作者集每年如何变化,这似乎是很自然的事情。在 MDX 中,它可能如下所示:

select
[Measures].[AuthorSet] on columns,
[Year].[Year].All on rows
where ([Topic].[Topic].[AI])

对于每一年,此查询将通过集合并集汇总作者列表。

为了充分利用此功能,您可能需要专门了解设置度量的自定义 OLAP 客户端工具。但对于现有的客户端,您可能只能退回到集合的某种字符串表示形式。 (例如,上面的查询可能返回包含字符串“Bill Jones; Martha X; John Q; Sam S; Jack X”的单元格)

我最熟悉SSAS,而SSAS似乎不支持这样的东西开箱即用。似乎有一种方法可以使用字符串类型的度量将其组合在一起,也许还可以加上自定义 CLR 函数,但我还没有弄清楚。

这种特殊情况您也许也只能在 MDX 中解决,也许使用自定义成员、Generate 和 SetToStr? (欢迎解决方案!)但我的直觉是,随着事情变得越来越复杂,使用“集合成员”功能在幕后更自然地进行集合聚合会更自然。

Typically measures in an OLAP cube are a numbers, and these numbers get aggregated via addition (or via some not-very-exotic function like times or MAX or MIN). I'm wondering if any of the major OLAP servers let you make measures that are sets of numbers, or sets of strings. ("Set" here is in the mathematical sense, not the OLAP jargon sense of "a list of tuples".) Whereas OLAP typically aggregates measures via numeric functions, my hypothetical "set measures" would be aggregated via set operations, e.g. set union or set intersection.

I'm interested both theoretically and practically. Theoretically/abstractly/mathematically, there's a nice parallel between addition-over-integers and union-over-sets, and it seems like someone could have considered this parallel in writing an OLAP server. (One potential implementation, if all the possible set members were known in advance, is to represent each set as a (potentially large) integer, and then to aggregate/union by performing bitwise OR.)

As for practice, I'll attempt to provide a concrete case where this might seem at least marginally useful: Suppose you had a dataset where each fact was the metadata associated with an academic paper. Each paper might have a date, a topic, and a set of one or more authors, like so:

  • fact1: {"Year": 1997, "Topic" : "AI", "AuthorSet": ["Bill Jones", "Martha X"]}
  • fact2: {"Year": 1997, "Topic" : "Linguistics", "AuthorSet": ["John Q", "Sam S"]}
  • fact3: {"Year": 1997, "Topic" : "Linguistics", "AuthorSet": ["John Q", "Jack X"]}
  • etc.

(I'm using quasi-JSON here only because it helps make it obvious what's multi-valued.)

If you created an OLAP cube around this data, it would seem very natural to be able to make a report showing how the set of authors writing on a given topic changed from year to year. In MDX, it might look like this:

select
[Measures].[AuthorSet] on columns,
[Year].[Year].All on rows
where ([Topic].[Topic].[AI])

For each year, this query would roll up the list of authors via set union.

To get the very most out of this feature you'd probably need custom OLAP client tools that knew about set measures in particular. But for existing clients you could probably just fall back to some string representation of a set. (e.g. the above query could return cells containing, e.g., the string "Bill Jones; Martha X; John Q; Sam S; Jack X")

I'm most familiar with SSAS, and SSAS doesn't seem to support anything like this out of the box. It seems like there might be a way to hack it together using measures of string type, maybe plus custom CLR functions, but I haven't figured it out yet.

This particular case you might also be able to solve in MDX only, maybe using custom members, Generate and SetToStr? (Solutions welcome!) But my intuition is that as things get more complicated it'd be more natural to have the set aggregation happen more naturally behind the scenes, with this "set member" functionality.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

幻想少年梦 2025-01-05 09:24:09

这就是我使用 Factless Fact 表的想法...

FactlessAcademicPaper
   YearKey_FK
   CategoryKey_FK
   AuthorGroupKey_FK

DimYear
   YearKey_PK

DimCategory
   CategoryKey_PK

FactlessAuthorGroup
   AuthorGroupKey_PK
   AuthorKey_PK

DimAuthor
   AuthorKey_PK

这......

fact1: {"Year": 1997, "Topic" : "AI", "AuthorSet": ["Bill Jones", "Martha X"]}
fact2: {"Year": 1997, "Topic" : "Linguistics", "AuthorSet": ["John Q", "Sam S"]}
fact3: {"Year": 1997, "Topic" : "Linguistics", "AuthorSet": ["John Q", "Jack X"]}

变成了...

**FactlessAcademicPaper**
   Year:           1997
   Topic:          AI
   AuthorGroupKey: 1

   Year:           1997
   Topic:          Linguistics
   AuthorGroupKey: 2

   Year:           1997
   Topic:          Linguistics
   AuthorGroupKey: 3

**FactlessAuthorGroup**
   AuthorGroupKey: 1
   Author:         Bill Jones

   AuthorGroupKey: 1
   Author:         Martha X

   AuthorGroupKey: 2
   Author:         John Q

   AuthorGroupKey: 2
   Author:         Sam S

   AuthorGroupKey: 3
   Author:         John Q

   AuthorGroupKey: 3
   Author:         Jack X

现在,如果您想跟踪论文的分数,那么您只需将其添加到 FactlessAcademicPaper 表中...

FactlessAcademicPaper
   YearKey_FK
   CategoryKey_FK
   AuthorGroupKey_FK
   Score
   '1' AS PaperCount

正如您在上面看到的,我添加了一个名为 PaperCount 的计算字段,这应该可以更轻松地计算平均分数。

Here's what I had in mind using a Factless Fact tables...

FactlessAcademicPaper
   YearKey_FK
   CategoryKey_FK
   AuthorGroupKey_FK

DimYear
   YearKey_PK

DimCategory
   CategoryKey_PK

FactlessAuthorGroup
   AuthorGroupKey_PK
   AuthorKey_PK

DimAuthor
   AuthorKey_PK

This...

fact1: {"Year": 1997, "Topic" : "AI", "AuthorSet": ["Bill Jones", "Martha X"]}
fact2: {"Year": 1997, "Topic" : "Linguistics", "AuthorSet": ["John Q", "Sam S"]}
fact3: {"Year": 1997, "Topic" : "Linguistics", "AuthorSet": ["John Q", "Jack X"]}

...becomes this...

**FactlessAcademicPaper**
   Year:           1997
   Topic:          AI
   AuthorGroupKey: 1

   Year:           1997
   Topic:          Linguistics
   AuthorGroupKey: 2

   Year:           1997
   Topic:          Linguistics
   AuthorGroupKey: 3

**FactlessAuthorGroup**
   AuthorGroupKey: 1
   Author:         Bill Jones

   AuthorGroupKey: 1
   Author:         Martha X

   AuthorGroupKey: 2
   Author:         John Q

   AuthorGroupKey: 2
   Author:         Sam S

   AuthorGroupKey: 3
   Author:         John Q

   AuthorGroupKey: 3
   Author:         Jack X

Now if you wanted to track scores of the papers, then you would just add it to the FactlessAcademicPaper table...

FactlessAcademicPaper
   YearKey_FK
   CategoryKey_FK
   AuthorGroupKey_FK
   Score
   '1' AS PaperCount

And as you can see above, I've added a calculated field called PaperCount which should make it easier to compute the average score.

半岛未凉 2025-01-05 09:24:09

我的评论有偏见,因为我正在与构建 icCube OLAP Server 的团队合作。

除了空间 OLAP 领域之外,我不知道有哪些 OLAP 服务器能够灵活处理其他基本测量类型。有一些技术困难:您必须将新类型输入系统(它是一列一度量),在内部存储它们(这与行的 mio ),聚合(再次在行的 mio 上)并最终发送结果返回给客户端(XMLA)。从我们这边来看,服务器已经准备好获取新类型了;我们想要支持 VaR,为此你需要支持矩阵和向量。对于其他供应商来说,这是一个客户是否足够感兴趣的问题。

现在,对于您来说,“iPolvo”指出的特定问题可以在维度或层次结构级别上解决。您可以定义一个计算方法生成现有作者的字符串。

如何解决这个问题取决于您的尺寸。为了性能,如果模型不大,我会将其建模为可以使用children()的单个维度。这里不需要有事实。另一方面,bug 大小,您可以构建三个维度,并且我们对所有现有作者都是非空的(这可能并不快)。你确实可以将两者结合起来..

为此你需要进行

fact1: {"Year": 1997, "Topic" : "AI", "AuthorSet": ["Bill Jones", "Martha X"]}

改造

fact1.1: {"Year": 1997, "Topic" : "AI", "Author": "Martha X"}
fact1.2: {"Year": 1997, "Topic" : "AI", "Author": "Bill Jones"}

My comment is biased as I'm working with the team building icCube OLAP Server.

Except in the Spatial OLAP field, I don't know of OLAP servers dealing flexible with other as basic Measure types. There are some techinical difficulties: you've to enter the new types into the system (it's very one column one measure), store them internally (this with mio of rows), aggregate (again this over mio of rows) and eventually send the result back to the client (XMLA). From our side the server is mostly ready to get new types; we wanted to support VaR and for this you need to support matrices and vectors. Here as for other vendors is a question of a customer being interested enough.

Now for you particular problem as 'iPolvo' is pointing out this can be solved at dimension or hierarchy level. You can define a calculated method Generating a string with the existing authors.

How you solve this depends on your sizing. For performance, if the model is not big, I'd model this as a single dimension where you can use children(). No need to have facts here. On the opposite side, bug size, you can build three dimensions and us nonempty over all existing authors (that's might be not fast). You can indeed combine both..

For this you need to transform

fact1: {"Year": 1997, "Topic" : "AI", "AuthorSet": ["Bill Jones", "Martha X"]}

in

fact1.1: {"Year": 1997, "Topic" : "AI", "Author": "Martha X"}
fact1.2: {"Year": 1997, "Topic" : "AI", "Author": "Bill Jones"}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文