Mondrian / OLAP 是否是连接大维度/集合的错误工具？

发布于 2024-12-16 14:27:11 字数 1318 浏览 2 评论 0原文

简介：我见过的大多数 MDX 连接示例都涉及连接相对较小的集合，例如每个集合有数十或数百个项目。但我发现自己也想尝试加入（特别是“非空加入”）集合，每个集合都有数千或数万个项目，但到目前为止效果不佳。我想知道这是否可以工作，或者我是否需要考虑使用 Mondrian/OLAP 以外的东西。

具体来说，我有一个记录公司（n=7000）和客户（n=27000）之间交互的立方体。目前公司和客户都是完全扁平的层次结构；有所有级别和个人公司级别，中间没有其他级别。有一个中央事实表，以及公司和客户的单独维度表。

我的用户至少似乎希望获得这些方面的摘要报告，聚合公司和客户之间的所有非空交互：

select
  [Measures].[Amount] on columns,
  NonEmptyCrossJoin([Firm].Children,
                      [Client].Children) on rows
from MyCube

但是这个查询及其变体在我的测试蒙德里安设置中不起作用。要么我得到 OutOfMemoryException（在 2GB Java 堆上），要么 Java 似乎在 mondrian.rolap.RolapResult$AxisMember.mergeTuple(TupleCursor) 中花费了难以置信的长时间。（如果有帮助的话，我可以提供更完整的堆栈跟踪。）我所说的“不可能长”是指在我放弃之前，Java 将在查询上花费数小时的时间。

我最初期望上面的查询能够正常执行，因为从概念上讲，只需按照以下方式执行 SQL 查询就可以有效地完成某种程度上：（

select Firm, Client, Sum(Amount) as n
from fact, firm, client
where fact.firmid = firm.firmid and fact.clientid = client.clientid
group by Firm, Client

事实上，如果我直接在 MySql 中执行类似的操作，它不会执行时间不会超过 15 秒。）

但从调试日志来看，Mondrian 似乎并没有尝试这种优化。相反，它似乎是在内部进行连接，并且最终速度特别慢。我在 mondrian.properties 中设置了 mondrian.native.crossjoin.enable=true ，但这似乎不是 Mondrian 能够“原生”的连接类型之一。（如果我打开 mondrian.native.unsupported.alert=ERROR 那么我会得到相应的异常。）

我想知道是否需要阻止我的用户尝试在如此大的维度/集合上进行连接，或者 Mondrian 是否可能不需要我在这里寻找的工具。但也许我只是做错了什么。

原文

Summary: Most of the examples I've seen of MDX joins have involved joining relatively small sets, say with tens or hundreds of items each. But I find myself also wanting to try joining (in particular "non-empty joining") sets that have thousands or tens of thousands of items each, and it's not working well so far. I'm wondering if this could be made to work, or if I perhaps need to consider using something other than Mondrian/OLAP.

To be concrete, I have a cube that records interactions between Firms (n=7000) and Clients (n=27000). Currently both Firm and Client are completely flat hierarchies; there's the All level and the individual-company level, with no other levels in between. There is a central fact table, and separate dimension tables for Firms and for Clients.

My users at least appear to want to get summary reports along these lines, aggregating all the non-empty interactions between Firms and Clients:

select
  [Measures].[Amount] on columns,
  NonEmptyCrossJoin([Firm].Children,
                      [Client].Children) on rows
from MyCube

But this query and variations on it don't work in my test Mondrian setup. Either I get an OutOfMemoryException (on a 2GB Java heap), or Java seems to spend impossibly long time in mondrian.rolap.RolapResult$AxisMember.mergeTuple(TupleCursor). (I can provide a more complete stack trace if it would help.) By "impossibly long" I mean Java will stay slaving away at the query for hours and hours before I give up.

I initially expected the above query to perform ok, because conceptually it could be done somewhat efficiently by just doing a SQL query along these lines:

select Firm, Client, Sum(Amount) as n
from fact, firm, client
where fact.firmid = firm.firmid and fact.clientid = client.clientid
group by Firm, Client

(In fact, if I execute something like this directly in MySql it doesn't take more than 15sec to execute.)

But from the debug logs Mondrian doesn't seem to attempt this optimization. Instead it appears to be doing the join internally, and in a way that ends up being particularly slow. I've set mondrian.native.crossjoin.enable=true in my mondrian.properties, but this doesn't seem like one of the join types that Mondrian is able to "make native". (If I turn on mondrian.native.unsupported.alert=ERROR then I get the corresponding exception.)

I'm left wondering whether I need to prevent my users from attempting joins on such large dimensions/sets, or whether Mondrian is maybe not the tool I'm looking for here. But maybe I'm just doing something wrong.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

娇女薄笑 2024-12-23 14:27:11

为了跟进，我尝试在 Sql Server Analysis Services (Sql Server 2008) 中设置一个类似的多维数据集，似乎 icCube 对于不同 OLAP 工具的性能不同有一个观点：

即使在我了解 SSAS 最佳实践之前，这方面的性能MDX 类型得到很大改进。按照这些思路进行的查询

select
  [Measures].[Amount] on columns,
  NON EMPTY
  crossjoin([Firms].[Firm Name].Children,
            [Clients].[Client Name].Children)
  on rows
from MyCube

从在 Mondrian 中不可行变为在 Sql Server 中需要大约十秒。可以想象，这与 MS 的商业智能开发工作室默认引导我创建 MOLAP 多维数据集有关，或者 SSAS 可能有更智能的查询规划器。

无论如何，也许这对我来说已经足够快了。如果没有，我还不确定 SSAS 在这种情况下可以得到多少优化。（一件令人失望的事情是，即使当我第二次重新运行查询时，仍然需要大约 10 秒；我希望缓存可能会产生更显着的效果。）

顺便说一句，您可能会注意到在刚刚引用的 MDX I 中已将我原来的 NonEmptyCrossJoin 替换为与 NON EMPTY 结合的普通交叉连接。这是因为，至少在 Sql Server 世界中，NonEmptyCrossJoin 显然被视为已弃用的不良实践。（这是在 Microsoft 的 MDX 语言参考中提到的。Mosha，其中一位前 SSAS 开发人员在一篇名为 MDX：非空、存在和邪恶NonEmptyCrossJoin 的简短版本是NonEmptyCrossJoin 语义混乱且应用有限，而且由于Sql Server大约在 2005 年，查询优化器已经足够智能，可以在不使用 NonEmptyCrossJoin 的情况下加快查询速度。）因此，我在上面的 MDX 中替换了更现代、经过批准的等效项。（它仍然可以与 NonEmptyCrossJoin 一起使用，尽管 NonEmptyCrossJoin 根本不会加快速度。）

To follow up, I tried setting up an analogous cube in Sql Server Analysis Services (Sql Server 2008), and it seems that icCube has a point about different OLAP tools performing differently:

Even before I learned much about SSAS best practices, performance on this type of MDX was much improved. A query along these lines

select
  [Measures].[Amount] on columns,
  NON EMPTY
  crossjoin([Firms].[Firm Name].Children,
            [Clients].[Client Name].Children)
  on rows
from MyCube

went from being non-viable with Mondrian to taking around ten seconds under Sql Server. Conceivably this has to do with MS' Business Intelligence Development Studio guiding me into creating a MOLAP cube by default, or perhaps SSAS has a smarter query planner.

In any case, perhaps this is fast enough for me. If not, I'm not yet sure how much more optimized SSAS can get in this case. (One disappointing thing is that, even when I re-run the query a second time, it still takes about 10 sec; I was hoping caching might have a more dramatic effect.)

Tangentially, you may notice in the just-quoted MDX I've replaced my original NonEmptyCrossJoin with a normal crossjoin combined with NON EMPTY. This is because, at least in the Sql Server world, NonEmptyCrossJoin is apparently regarded as a deprecated bad practice. (This is noted in Microsoft's MDX Language Reference. Mosha, one of the former SSAS devs, describes the situation in an article called MDX: NonEmpty, Exists and evil NonEmptyCrossJoin. Short version is that NonEmptyCrossJoin has confusing semantics and limited application, and since Sql Server 2005 or so, the query optimizer has been smart enough to make your query fast without NonEmptyCrossJoin.) So I've substituted a more modern approved equivalent in the above MDX. (It does still work with NonEmptyCrossJoin as well, though NonEmptyCrossJoin does not speed things up at all.)

回复收藏 0 原文