名称值对和事实表
我正在研究用于分析发布的表单数据的星型模式。 表单数据将发布到的站点实际上是托管表单的站点的外部,因此只有表单中的数据可用。 我将提供包含一些额外有用信息的选项,包括隐藏字段、原始引荐来源网址、会话 ID 等。
我将能够使用正则表达式来匹配某些数据类型并将它们提取到特定维度,例如邮政编码。
我有一个解决方案来处理尺寸的任意性,它不是一个很好的方案,但它会起作用。
我遇到的问题是,我不知道我的事实表中会包含什么,它不像我可以聚合的很好的数值。 除了“是的,有一个表格帖子”满足这些标准这一事实之外。
我想知道我是否以正确的方式处理这个问题? 我是否使用了错误的工具来完成这项工作? 或者我只是错过了什么?
西蒙.
进一步的细节:
有两个功能区域,根据标准过滤表单帖子,例如在两个时间戳之间。 但在过滤方面几乎所有东西都可以争夺。 然后,所选的表单帖子将用于生成用于导出的 csv 文件。
另一个主要领域是分析,研究广告支出向客户线索的转化是一个明显的起点。 也有些开放式,取决于表单数据。
I'm working on a star schema for analysis of posted form data. The site that the form data will be posted to is actually external to the site hosting the form, so only the data in the form will be available. I'm going to give the option to include some extra useful information with hidden fields, original referrer, session id etc.
I'll be able to use regular expressions to match certain data types and pull them out to specific dimensions e.g. Postcodes.
I have a solution to dealing with the arbitrary nature of the dimensions, its not a great one but it will work.
The problem that I have is that I have no idea what is going to be in my fact table, its not like there is a nice numerical value that I can aggregate. Apart from the fact that "yes there is a form post" that satisfies these criteria.
I'm wondering if I'm approaching this in the right way? Am I using the wrong tool for the job? Or am I just missing something?
Simon.
Further detail:
There are two areas of functionality, filtering the form posts dependant on criteria e.g. between two timestamps. But pretty much anything is up for grabs in terms of filtering. The selected form posts will then be used to generate a csv file for export.
The other main area is analytics, studying the conversion of ad spend into customer leads is an obvious starting point. Also somewhat open ended and depends on the form data.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您没有设计星型模式。 您正在设计一个 Entity-Attribute-Value 表,其中包含所有您正在识别的问题。
如果您确实不知道数据是什么样子,即存在哪些表单字段以及每个表单字段应使用什么数据类型,那么关系数据库并不是保存信息的正确工具。 尝试 XML、YAML 或 JSON。 这些是结构化但动态的格式。 您可以即时建立元数据。 您可以将整个表单实例存储在文件中或数据库中的 BLOB 中。
另一种可以管理动态元数据的新兴技术是 RDF,其查询语言为 SPARQL。 Sesame 是语义数据引擎的一个示例。
You aren't designing a star schema. You're designing an Entity-Attribute-Value table, which has all the problems you're identifying.
If you really have no idea what your data will look like, i.e. what form fields exist and what data types should be used for each one, then a relational database is not the right tool to persist the information. Try XML or YAML or JSON. Those are structured, but dynamic, formats. You can establish metadata on the fly. You can store the whole form instance in a file or in a BLOB in your database.
Another emerging technology that can manage dynamic metadata is RDF, with the query language SPARQL. Sesame is an example of a semantic data engine.
拥有没有测量值的事实表是可以的——它们只是被称为“无事实的事实表”。 但您通常仍然会在其中放置一个 row_count 列 - 尽管它的值始终为 1 - 以轻松添加汇总表。 您最终可能会在稍后添加其他测量值 - 例如,对术语情绪的测量。
我不会太担心这看起来不像仓储 101 示例 - 有很多极端情况会发生奇怪的事情。 你当然可以有 field_name & field_value 作为列,如果没有 field_name,甚至只是 field_value。 这样可行。 它提供了大量的灵活性。
但您错过了一些重要的功能。 由于给定的项目或对象实际上分为多行 - 典型的 SQL 过滤效果不佳。 您通常需要将所有行拉入一个小应用程序中,该应用程序可以将它们作为一个整体进行评估 - 或者编写一些非常复杂的多步骤 sql,其中将每行评估的布尔结果插入到临时表中,然后按 session_id 分组(或无论等值),然后最终评估和/或逻辑。
另一种选择是走这条路,但逐渐开发 ETL 解析功能,以便随着时间的推移,您可以将其中一些内容拉出到更传统的维度。 也许这会成为您的暂存表或原始表,但您尝试让大多数报告符合更传统的星型模式。
最后一个选项 - 考虑非关系数据库。 更面向文档的东西可能会为您提供更好的功能。
It's ok to have fact tables with no measurements - they're just called "factless fact tables". But you still typically put a row_count column in there - even though it'll always have a value of one - to easily add summary tables. And you may end up adding other measurements later - like a measurement of the sentiment of the term for example.
And I wouldn't get too worried that this doesn't look like a warehousing 101 example - there are a lot of corner cases where odd things happen. You can certainly have field_name & field_value as columns, or even just field_value if you don't have field_name. That works. And it provides a ton of flexibility.
But you are missing out on some important functionality. Since a given item or object is really split across multiple rows - typical sql filtering won't work well. You typically need to pull all the rows into a little app that can evaluate them as a whole - or write some very complex multi-step sql where you insert the boolean results of each row evaluation to a temp table, then group by session_id (or whatever the equiv), then finally evaluate for and/or logic.
Another option - is to go this route, but gradually develop your ETL parsing functionality so that over time you can pull some of this stuff out into more traditional dimensions. Perhaps this becomes your staging or raw table, but you try to have most reports hit your more traditional star schema.
Last option - consider a non-relational database. Something more document-oriented may provide better functionality for you.