OLAP/MDX 是处理各种聚合级别的未知值数据的好方法吗
我是 OLAP 新手,所以也许我不知道解决这个问题的正确术语,但请耐心听我说。
我处理大量分层、多维数据,其中父/聚合单元格大多具有数据,但子/叶单元格通常缺少数据(属性值未知但非零)。我目前使用脚本和 SQL 的组合来使用它,但这变得很笨拙。看起来 OLAP 多维数据集和 MDX 更适合数据的结构,但不一定适合我需要处理的任务。例如:
- OLAP 似乎主要是为只读报表而设计的;我对批处理中的数据做了很多修改
- OLAP 似乎喜欢拥有完整的叶级数据来计算聚合;我的数据在各个级别上都有缺失值
我想要做的示例:
- 将原始多级数据加载到多维数据集中并保留已知的父级;不要覆盖或将其值显示为子项的计算聚合(这可能不完整)。
- 根据其他多维数据集的复杂查询/联接的结果创建/更新/删除多维数据集中的单元格。有时需要转换多维数据集以使用稍微不同的维度定义。
- 用户需要对未知值进行估计。我可以创建不错的估计,但需要调整它们,以便它们在所有维度和水平上都符合已知的父母/孩子(这比听起来要困难得多)。我已经在这样做了,但它涉及将数据从 RDBMS 提取到自定义可执行文件中。
- 查询和计算需要能够正确处理未知数。理想情况下,能够轻松查询聚合单元格的值有多少是由估计值与已知值组成,可能计算置信度/错误统计数据,或者检查当未知数具有已知的父级和所有值时,我们是否可以得出该未知数的精确值。已知的兄弟姐妹等。
- 数据可能很大...高达数千万事实表行。对于批处理作业来说,性能需要足够好(几分钟还可以,几个小时就不行了)。
OLAP 服务器和 MDX 可以成为此类工作的好工具吗?是否有其他工具可以很好地处理分层/多维/填充数据?
I'm new to OLAP, so perhaps I don't know the right terminology to use for this question, but bear with me here.
I work with lots of hierarchical, multidimensional data where parent/aggregated cells mostly have data, but child/leaf cells are often missing data (attribute values are unknown but non-zero). I currently use a combination of scripting and SQL to work with it, but that's getting unwieldy. It seems like OLAP cubes and MDX are better suited to the structure of the data, but not necessarily to tasks I need to do with it. For example:
- OLAP seems mainly designed for read-only reporting; I do a lot of modifications to the data in batch processes
- OLAP seems to like having complete leaf-level data to calculate aggregates; my data has missing values at various levels
Examples of what I want to do:
- Load original multi-level data into cube and preserve known parents; don't overwrite or display their values as calculated aggregates of children (which may be incomplete).
- Create/update/delete cells in a cube based on results from complicated queries/joins of other cubes. Sometimes a cube needs to be transformed to use a slightly different dimension definition.
- Users require estimates for unknown values. I can create decent estimates, but need to adjust them so they conform to known parents/children across all dimensions and levels (this is much harder than it sounds). I am already doing this, but it involves pulling the data out of the RDBMS into a custom executable.
- Queries and calculations need to be able to handle the unknowns properly. Ideally be able to easily query how much of an aggregated cell's value is made up of estimated vs. known values, possibly compute confidence/error statistics, or check whether we can derive an exact value for an unknown when it has a known parent and all known siblings, etc.
- Data can be large... up to tens of millions of fact table rows. Performance needs to be decent for batch jobs (minutes are ok, hours not so much).
Could an OLAP server and MDX be a good tool for this type of work? Are there any other tools that would work well for manipulating hierarchical/multidimensional/gap-filled data?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这是 OLAP 系统的一些需求,有趣且具有挑战性 :-) :
- 将原始多级数据加载到多维数据集中并保留已知的父级;不要将其值覆盖或显示为子级的计算聚合(这可能不完整)。
您可以更改多维数据集在层次结构中聚合值的方式。在一个层次结构中执行此操作很好,在多个层次结构中执行此操作可能会开始变得复杂。如果存在多个“特殊”层次结构问题的数学“唯一”解决方案,则值得检查两次。
根据其他多维数据集的复杂查询/联接的结果创建/更新/删除多维数据集中的单元格。有时需要转换多维数据集以使用稍微不同的维度定义。
在这里您可以使用写回(MDX 函数更新立方体),但我认为这对于您的需求来说有点太简单了。实施取决于供应商。请注意,创建单元格会破坏您的记忆,因为对于大立方体,您可以在子立方体中快速拥有数百万个单元格。
你的模型的稀疏度是多少? ->包含数据的单元格数量/总单元格数量
某些模型的稀疏度为 1e-30,如果您更新所有单元格,则很容易爆炸;-)。
用户需要对未知值进行估计。我可以创建不错的估计,但需要调整它们,以便它们在所有维度和水平上都符合已知的父母/孩子(这比听起来要困难得多)。我已经在这样做了,但它涉及将数据从 RDBMS 提取到自定义可执行文件中。
这看起来很复杂。这里的问题是算法的复杂性、使用 MDX 语言的可能解决方案以及它们如何与OLAP engige(足够快)。您冒着它爆炸的风险,但请查看作用域函数
数据可能很大...高达数千万事实表行。对于批处理作业来说,性能需要足够好(分钟还可以,小时不太多)。
这不应该是一个真正的挑战。
要回答你的问题,我不这么认为。我们在遗传领域也有类似的问题,我们将通过向 OLAP 解决方案“添加”专用计算模块来解决该问题。这是一个有趣的正在进行的项目
That's some needs for an OLAP system, interesting and challenging :-) :
- Load original multi-level data into cube and preserve known parents; don't overwrite or display their values as calculated aggregates of children (which may be incomplete).
You can change the way cubes aggregate values in a hierarchy. Doing this in one hierarchy is fine doing this using in multiple hierarchies might start to get complicated. It's worth checking twice if there is a mathematical 'unique' solution to the problem with multiple 'special' hierarchies.
Create/update/delete cells in a cube based on results from complicated queries/joins of other cubes. Sometimes a cube needs to be transformed to use a slightly different dimension definition.
Here you can use writeback (MDX function Update cube), but I think it's a bit too simple for your needs. Implementation depend on the vendors. Pay attention creating cells can kill your memory as for large cubes you can quickly have millions of cells in a subcube.
What is the sparsity of your model ? -> number of cells with data / number of total cells
Some models have sparsities of 1e-30, here it's easy to explode if you're updating all cells ;-).
Users require estimates for unknown values. I can create decent estimates, but need to adjust them so they conform to known parents/children across all dimensions and levels (this is much harder than it sounds). I am already doing this, but it involves pulling the data out of the RDBMS into a custom executable.
This is looking complicated The issue here is the complexity of the algos, a possible solution using MDX language and how they match with the OLAP engige (fast enough). You're taking the risk it explodes, but have a look at Scope function
Data can be large... up to tens of millions of fact table rows. Performance needs to be decent for batch jobs (minutes are ok, hours not so much).
That should not be a real challenge..
To answer your question, I don't think so. We've a similar problem - on the genetical field - and we are going to solve the problem 'adding' a dedicated calculation module to our OLAP solution. It's an interesting on going project