数据仓库任意字段
在我们的应用程序中,我们支持用户编写的插件。
这些插件生成各种类型的数据(int、float、str 或 datetime),并且这些数据标有大量元数据(用户、当前目录等)以及三个自由文本字段(MetricName、Var1) ,Var2) 。
现在我们拥有几年的这些数据,我正在尝试设计一个模式,允许以分析方式(图表和其他内容)非常快速地访问这些指标。只要我们感兴趣的指标很少,这很容易,但是我们有大量不同粒度的不同指标,并且我们希望存储用户添加的数据以供以后分析(可能是在架构更改)。
示例数据:(请记住,这非常简单)
=========================================================================================================
| BaseDir | User | TrialNo | Project | ... | MetricValue | MetricName | Var1 | Var2 |
=========================================================================================================
| /path/to/me | me | 0 | domino | ... | 20 | Errors | core | dumb |
| /path/to/me | me | 0 | domino | ... | 98.6 | Tempuratur | body | |
| /some/other/pwd | oneguy | 223 | farq | ... | 443 | ManMonths | waste | Mythical |
| /some/other/pwd | oneguy | 224 | farq | ... | 0 | Albedo | nose | PolarBear |
| /path/to/me | me | 0 | domino | ... | 70.2 | Tempuratur | room | |
| /path/to/me2 | me | 2 | domino | ... | 2020 | Errors | misc | filtered |
任何人都可以添加解析器插件来开始测量 AirSpeed 指标,并且我们希望我们的 analisys 工具能够“正常工作”该新指标。
更新:
考虑到许多 MetricName 是事先众所周知的,如果我可以启用对这些指标的分析,并简单地存储其他用户添加的指标,我就可以满足我的要求。我们可以接受这样一个事实:如果不编辑架构,新指标将无法用于重型分析。
大家觉得这个解决方案怎么样?
我将我们的指标分为三个事实表,一张用于不需要 MetricTopic 的事实,一张用于需要 MetricTopic 的事实,一张用于所有其他指标,包括意外的指标。
对于赏金:
我会接受任何展示如何使这个系统更实用的批评,或者带来使其与行业最佳实践更加一致。对文献的引用增加了分量。
In our application, we support user-written plugins.
Those plugins generate data of various types (int, float, str, or datetime), and those data are labeled with bunches of meta-data (user, current directory, etc.) as well as three free-text fields (MetricName, Var1, Var2) .
Now we have several years of this data, and I'm trying to design a schema which allows very fast access to those metrics in an analytical fashion (charts and stuff). This is easy as long as there are only a few metrics we're interested in, but we have a large number of different metrics at different granularities, and we'd like to store user-added data to allow for later analysis (possibly after a schema change).
Example data: (please keep in mind this is very simplified)
=========================================================================================================
| BaseDir | User | TrialNo | Project | ... | MetricValue | MetricName | Var1 | Var2 |
=========================================================================================================
| /path/to/me | me | 0 | domino | ... | 20 | Errors | core | dumb |
| /path/to/me | me | 0 | domino | ... | 98.6 | Tempuratur | body | |
| /some/other/pwd | oneguy | 223 | farq | ... | 443 | ManMonths | waste | Mythical |
| /some/other/pwd | oneguy | 224 | farq | ... | 0 | Albedo | nose | PolarBear |
| /path/to/me | me | 0 | domino | ... | 70.2 | Tempuratur | room | |
| /path/to/me2 | me | 2 | domino | ... | 2020 | Errors | misc | filtered |
Anyone can add a parser plugin to start measuring a AirSpeed metric, and we'd like our analisys tools to "just work" on that new metric.
Update:
Considering that many of the MetricName's are well-known beforehand, I can satisfy my requirements if I can enable analysis on those metrics, and simply store the other user-added metrics. We can accept the fact that new metrics won't be available for heavy-duty analysis without an edit to the schema.
What do you guys think of this solution?
I've divided our metrics into three fact tables, one for facts that don't need a MetricTopic, one for ones that do, and one for all the other metrics, including unexpected ones.
For the bounty:
I'll accept any critique which shows how to make this system more functional, or brings it into closer alignment with industry best-practices. References to literature gives added weight.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果我理解正确的话,您正在寻找一种模式来支持在 DW 中动态创建度量。在经典数据仓库中,每个度量都是一列,因此在 Kimball 星形中,您需要为每个新度量添加一列 - 更改架构。
您拥有的是 EAV 模型,而 EAV 分析并不容易且速度不快——请查看 此讨论。
我建议您查看诸如 splunk,适合这种类型的问题。
If I understand correctly, you are looking for a schema to support on-fly creation of measures in a DW. In a classical data warehouse each measure is a column, so in a Kimball star you would need to add a column for each new measure -- change the schema.
What you have is an EAV model, and analytics on EAV is not easy and not fast -- take a look at this discussion.
I would suggest you look at tools like splunk, which is suited for theis type of problems.
你没有那么多事实。没有那么多单位。
事实有单位。秒、磅、字节、美元。
您需要回顾“星型架构”设计。你有维度(可能很多)和可衡量的事实(可能很少)。
事实和所有相关维度之间存在连接。您可以进行求和、计算事实以及对维度进行分组。
你不可能拥有数千个独立的事实。这几乎是不可能的。但你可以有数千种维度组合,这很常见。
将事实(令人愉快地增加的可测量数量)与维度(定义质量)分开,你应该围绕一些事实有很多维度。
买一本《金博尔》。
You don't have that many facts. There aren't that many units.
Facts have units. Seconds, pounds, bytes, dollars.
You need to review the "Star Schema" design. You have dimensions (probably a lot) and measurable facts (probably very few).
You have a join between facts and all of the associated dimensions. You can do sum, count on the facts, and group-by on the dimensions.
You can't have thousands of independent facts. That's almost impossible. But you can have thousands of combinations of dimensions, that's common.
Separate facts (measurable quantities that add pleasantly) from dimensions (definitional qualities) and you should have a lot of dimensions around a few facts.
Buy a copy of Kimball.