Mondrian / OLAP 是否是连接大维度/集合的错误工具?

发布于 2024-12-16 14:27:11 字数 1318 浏览 2 评论 0原文

简介: 我见过的大多数 MDX 连接示例都涉及连接相对较小的集合,例如每个集合有数十或数百个项目。但我发现自己也想尝试加入(特别是“非空加入”)集合,每个集合都有数千或数万个项目,但到目前为止效果不佳。我想知道这是否可以工作,或者我是否需要考虑使用 Mondrian/OLAP 以外的东西。

具体来说,我有一个记录公司(n=7000)和客户(n=27000)之间交互的立方体。目前公司和客户都是完全扁平的层次结构;有所有级别和个人公司级别,中间没有其他级别。有一个中央事实表,以及公司和客户的单独维度表。

我的用户至少似乎希望获得这些方面的摘要报告,聚合公司和客户之间的所有非空交互:

select
  [Measures].[Amount] on columns,
  NonEmptyCrossJoin([Firm].Children,
                      [Client].Children) on rows
from MyCube

但是这个查询及其变体在我的测试蒙德里安设置中不起作用。要么我得到 OutOfMemoryException(在 2GB Java 堆上),要么 Java 似乎在 mondrian.rolap.RolapResult$AxisMember.mergeTuple(TupleCursor) 中花费了难以置信的长时间。 (如果有帮助的话,我可以提供更完整的堆栈跟踪。)我所说的“不可能长”是指在我放弃之前,Java 将在查询上花费数小时的时间。

我最初期望上面的查询能够正常执行,因为从概念上讲,只需按照以下方式执行 SQL 查询就可以有效地完成某种程度上:(

select Firm, Client, Sum(Amount) as n
from fact, firm, client
where fact.firmid = firm.firmid and fact.clientid = client.clientid
group by Firm, Client

事实上,如果我直接在 MySql 中执行类似的操作,它不会执行时间不会超过 15 秒。)

但从调试日志来看,Mondrian 似乎并没有尝试这种优化。相反,它似乎是在内部进行连接,并且最终速度特别慢。我在 mondrian.properties 中设置了 mondrian.native.crossjoin.enable=true ,但这似乎不是 Mondrian 能够“原生”的连接类型之一。 (如果我打开 mondrian.native.unsupported.alert=ERROR 那么我会得到相应的异常。)

我想知道是否需要阻止我的用户尝试在如此大的维度/集合上进行连接,或者 Mondrian 是否可能不需要我在这里寻找的工具。但也许我只是做错了什么。

Summary: Most of the examples I've seen of MDX joins have involved joining relatively small sets, say with tens or hundreds of items each. But I find myself also wanting to try joining (in particular "non-empty joining") sets that have thousands or tens of thousands of items each, and it's not working well so far. I'm wondering if this could be made to work, or if I perhaps need to consider using something other than Mondrian/OLAP.

To be concrete, I have a cube that records interactions between Firms (n=7000) and Clients (n=27000). Currently both Firm and Client are completely flat hierarchies; there's the All level and the individual-company level, with no other levels in between. There is a central fact table, and separate dimension tables for Firms and for Clients.

My users at least appear to want to get summary reports along these lines, aggregating all the non-empty interactions between Firms and Clients:

select
  [Measures].[Amount] on columns,
  NonEmptyCrossJoin([Firm].Children,
                      [Client].Children) on rows
from MyCube

But this query and variations on it don't work in my test Mondrian setup. Either I get an OutOfMemoryException (on a 2GB Java heap), or Java seems to spend impossibly long time in mondrian.rolap.RolapResult$AxisMember.mergeTuple(TupleCursor). (I can provide a more complete stack trace if it would help.) By "impossibly long" I mean Java will stay slaving away at the query for hours and hours before I give up.

I initially expected the above query to perform ok, because conceptually it could be done somewhat efficiently by just doing a SQL query along these lines:

select Firm, Client, Sum(Amount) as n
from fact, firm, client
where fact.firmid = firm.firmid and fact.clientid = client.clientid
group by Firm, Client

(In fact, if I execute something like this directly in MySql it doesn't take more than 15sec to execute.)

But from the debug logs Mondrian doesn't seem to attempt this optimization. Instead it appears to be doing the join internally, and in a way that ends up being particularly slow. I've set mondrian.native.crossjoin.enable=true in my mondrian.properties, but this doesn't seem like one of the join types that Mondrian is able to "make native". (If I turn on mondrian.native.unsupported.alert=ERROR then I get the corresponding exception.)

I'm left wondering whether I need to prevent my users from attempting joins on such large dimensions/sets, or whether Mondrian is maybe not the tool I'm looking for here. But maybe I'm just doing something wrong.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

娇女薄笑 2024-12-23 14:27:11

为了跟进,我尝试在 Sql Server Analysis Services (Sql Server 2008) 中设置一个类似的多维数据集,似乎 icCube 对于不同 OLAP 工具的性能不同有一个观点:

即使在我了解 SSAS 最佳实践之前,这方面的性能MDX 类型得到很大改进。按照这些思路进行的查询

select
  [Measures].[Amount] on columns,
  NON EMPTY
  crossjoin([Firms].[Firm Name].Children,
            [Clients].[Client Name].Children)
  on rows
from MyCube

从在 Mondrian 中不可行变为在 Sql Server 中需要大约十秒。可以想象,这与 MS 的商业智能开发工作室默认引导我创建 MOLAP 多维数据集有关,或者 SSAS 可能有更智能的查询规划器。

无论如何,也许这对我来说已经足够快了。如果没有,我还不确定 SSAS 在这种情况下可以得到多少优化。 (一件令人失望的事情是,即使当我第二次重新运行查询时,仍然需要大约 10 秒;我希望缓存可能会产生更显着的效果。)

顺便说一句,您可能会注意到在刚刚引用的 MDX I 中已将我原来的 NonEmptyCrossJoin 替换为与 NON EMPTY 结合的普通交叉连接。这是因为,至少在 Sql Server 世界中,NonEmptyCrossJoin 显然被视为已弃用的不良实践。 (这是在 Microsoft 的 MDX 语言参考中提到的。Mosha,其中一位前 SSAS 开发人员在一篇名为 MDX:非空、存在和邪恶NonEmptyCrossJoin 的简短版本是NonEmptyCrossJoin 语义混乱且应用有限,而且由于Sql Server大约在 2005 年,查询优化器已经足够智能,可以在不使用 NonEmptyCrossJoin 的情况下加快查询速度。)因此,我在上面的 MDX 中替换了更现代、经过批准的等效项。 (它仍然可以与 NonEmptyCrossJoin 一起使用,尽管 NonEmptyCrossJoin 根本不会加快速度。)

To follow up, I tried setting up an analogous cube in Sql Server Analysis Services (Sql Server 2008), and it seems that icCube has a point about different OLAP tools performing differently:

Even before I learned much about SSAS best practices, performance on this type of MDX was much improved. A query along these lines

select
  [Measures].[Amount] on columns,
  NON EMPTY
  crossjoin([Firms].[Firm Name].Children,
            [Clients].[Client Name].Children)
  on rows
from MyCube

went from being non-viable with Mondrian to taking around ten seconds under Sql Server. Conceivably this has to do with MS' Business Intelligence Development Studio guiding me into creating a MOLAP cube by default, or perhaps SSAS has a smarter query planner.

In any case, perhaps this is fast enough for me. If not, I'm not yet sure how much more optimized SSAS can get in this case. (One disappointing thing is that, even when I re-run the query a second time, it still takes about 10 sec; I was hoping caching might have a more dramatic effect.)

Tangentially, you may notice in the just-quoted MDX I've replaced my original NonEmptyCrossJoin with a normal crossjoin combined with NON EMPTY. This is because, at least in the Sql Server world, NonEmptyCrossJoin is apparently regarded as a deprecated bad practice. (This is noted in Microsoft's MDX Language Reference. Mosha, one of the former SSAS devs, describes the situation in an article called MDX: NonEmpty, Exists and evil NonEmptyCrossJoin. Short version is that NonEmptyCrossJoin has confusing semantics and limited application, and since Sql Server 2005 or so, the query optimizer has been smart enough to make your query fast without NonEmptyCrossJoin.) So I've substituted a more modern approved equivalent in the above MDX. (It does still work with NonEmptyCrossJoin as well, though NonEmptyCrossJoin does not speed things up at all.)

淡淡的优雅 2024-12-23 14:27:11

我不是 100% 确定,但是您是否尝试过设置:

mondrian.native.nonempty.enable = true

这种优化似乎将一些类似的操作推到了 sql 级别 - 听起来好像它可以提供帮助。

I'm not 100% sure, but have you tried setting:

mondrian.native.nonempty.enable = true

This optimisation seems to push some operations like that down to the sql level - Sounds like it could help.

东北女汉子 2024-12-23 14:27:11

我来回答OLAP的部分。 OLAP 工具共有三大系列。 ROLAP、MOLAP 和 HOLAP。

ROLAP,Relational,建立在关系数据库之上。如果缓存丢失,MDX 请求将使用 SQL 语句在关系数据库中执行。它们通过延迟具有可扩展性的优势,但取决于它们在底层数据库上的性能。 QoS 可能很棘手,因为它是数据库 QoS。

MOLAP,InMemory,将数据复制到内部结构(内存)中。这里的 QoS、应答时间更加稳定和更快,因为所有处理都在同一台服务器中完成。 MOLAP 的问题是可扩展性,因为您可能会出现内存不足(>100mio)的情况。

HOLAP 是 ROLAP 和 MOLAP 的混合体。我没有直接的经验,但从理论上讲,它们可以带来两全其美的效果。

从这些数字来看,使用 MOLAP 工具您应该不会遇到任何问题,它实际上是一个小立方体。

因此,在离开 OLAP 世界之前,给 MOLAP 服务器一个机会。有关 OLAP 服务器的列表,您可以查看 wikipedia

I'll answer the part of OLAP. There three big families of OLAP tools. ROLAP, MOLAP and HOLAP.

ROLAP, Relational, are build on a relation database. MDX request, if cache is missed, are performed in a relational database using a SQL statement. They have an advantage of scalability, by delagation, but depened for their performance on the underlying data base. QoS might be tricky as it's the db QoS.

MOLAP, InMemory, copy the data into internal structures (memory). Here the QoS, answer times, are more stable and faster as all processing is done in the same server. The issue with MOLAP is scalability as you might get out-of memory (>100mio).

HOLAP those are a mixed of ROLAP and MOLAP. I've no direct experience but in theory they can bring the best of both worlds.

Looking at the numbers you should not get any problem with MOLAP tools, it's really a small cube.

So, before leaving the OLAP world, give a chance to the MOLAP servers. For a list of OLAP servers you can check wikipedia

完美的未来在梦里 2024-12-23 14:27:11

Mondrian OLAP 不支持大型数据库。

嗯,我正在开发位图连接索引OLAP工具(BJIn OLAP),这是一个基于Java开源的OLAP工具。这使用 SQL 差异语法,而不是 MDX。

此处的文档

这里是试用版

Mondrian OLAP dont have suport a large databases.

Well, im developing The Bitmap Join Index OLAP Tool (BJIn OLAP), that is a OLAP tool based in Java open source. This use a SQL variance syntax, not MDX.

Documentation here

Trial version here

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文