调查数据模型 - 如何避免 EAV 和过度非规范化?

发布于 2024-10-10 12:54:20 字数 1838 浏览 0 评论 0原文

我的数据库技能充其量只是平庸,我必须为调查数据设计一个数据模型。我对此进行了一些思考,现在我觉得我陷入了某种 EAV 模型以及涉及数百个表的设计,每个表都有数百个列(和数千条记录)。一定有更好的方法来做到这一点,我希望论坛上的智者能够帮助我。

我的问题是:我应该如何在 RDBMS 中对调查问题的答案进行建模?必须使用 SQL Server。因此,替代数据存储系统应排除在本次讨论之外。 (当然,有些应该并且将会被评估,但请不要在这里。)我不需要整个数据模型的解决方案,目前我只对答案部分感兴趣。

我已经搜索过各种论坛,但我找不到真正的解决方案。如果它已经在其他地方给出,请原谅并给我提供一个链接,以便我可以阅读。

关于我必须处理的数据的一些假设:

  1. 每项调查由 1 到 n 份调查问卷组成
  2. 每份调查问卷由 100-2,000 个问题组成(请忽略 2,000 个问题确实听起来需要回答很多...)
  3. 问题可以有多种类型:多项选择、自由文本、数字(如年龄、收入、百分比...)
  4. 每个调查涉及 10-200 个国家(这些不是受访者。受访者实际上是这些国家/地区的人民。)
  5. 取决于类型每份调查问卷由每个国家 100-20,000 名受访者回答。
  6. 一个国家可以调整调查问卷,即添加、删除或编辑问题。
  7. 一个国家的数据收集在该国家的单独数据库中。从一开始就不可能进行在线集成。
  8. 所有国家的数据必须稍后整合。这意味着,例如,如果一个国家删除了一个问题,则该数据必须以某种方式从他们发送的内容中导出,以便在所有国家/地区实现统一的设计,
  9. 我将不得不编写集成和清理软件,该软件需要与每个国家的数据
    1. 最后,需要将数据导出为平面文件、每个国家一个矩形网格和调查问卷。

我已经和不同背景的人讨论过这个话题,但还没有得出一个好的解决方案。我主要得到两种意见。

  1. 习惯于使用平面文件(电子表格样式)进行数据处理和分析的领域专家投票支持具有大量表和列的非规范化结构,如上所述(每个国家和调查问卷 1 个表)。这对我来说听起来很糟糕,因为我了解到要避免使用宽表,在使用表时确定表中实际包含哪些列会很烦人,数据库将变得混乱,有数百个表(或者我什至需要建立多个数据库,每个数据库都有相似但有点不同的设计)等。
  2. 面向对象程序员投票支持强烈的“标准化”设计,这将有效地产生一个包含所有受访者对所有问题的所有答案的中心表。该表要么需要包含 sql_variant 类型的列,要么需要包含不同类型的多个答案列来存储不同类型的答案(多项选择、自由文本等)。前者本质上是 EAV 模型。我倾向于遵循 Joe Celko 的观点,他强烈反对使用它(他称之为 OTLT 或“一个真正的查找表”)。后者意味着每行将包含设计上不适用的类型的空单元格。

我能想到的另一种选择是为每种答案类型创建一个表,即一个用于多项选择问题,一个用于自由文本问题等。这不是那么通用,它会导致大量的联合连接,我我认为如果发明了新的答案类型,我将不得不添加一个表格。

很抱歉让您厌倦了这些文字,感谢您的输入!

干杯, 亚历

克斯 PS:我在这里问了同样的问题:http://www.eggheadcafe.com/community/aspnet/13/10242616/survey-data-model--how-to-avoid-eav-and-excessive-denormalization.aspx

My database skills are mediocre at best and I have to design a data model for survey data. I have spent some thoughts on this and right now I feel that I am stuck between some kind of EAV model and a design involving hundreds of tables, each with hundreds of columns (and thousands of records). There must be a better way to do this and I hope that the wise folks on this forum can help me.

My question is: how should I model the answers to survey questions in an RDBMS? Using SQL Server is mandatory. So alternative data storage systems should be excluded from this discussion. (Sure, some should and will be evaluated, but not here please.) I don't need a solution for the entire data model, for now I'm only interested in the Answers part.

I have already searched various forums, but I couldn't really find a solution. If it has already been given elsewhere, please excuse me and provide me with a link so I can read it up.

Some assumptions about the data I have to deal with:

  1. Each survey consists of 1 to n questionnaires
  2. Each questionnaire consists of 100-2,000 questions (please ignore that 2,000 questions really sound like a lot to answer...)
  3. Questions can be of various types: multiple-choice, free text, a number (like age, income, percentages, ...)
  4. Each survey involves 10-200 countries (These are not the respondents. The respondents are actually people in the countries.)
  5. Depending on the type of questionnaire, each questionnaire is answered by 100-20,000 respondents per country.
  6. A country can adapt the questionnaires for a survey, i.e. add, remove or edit questions
  7. The data for one country is gathered in a separate database in that country. There is no possibility for online integration from the start.
  8. The data for all countries has to be integrated later. This means for example, if a country has deleted a question, that data must somehow be derived from what they sent in order to achieve a uniform design across all countries
  9. I will have to write the integration and cleaning software, which will need to work with every country's data
    1. In the end the data needs to be exported to flat files, one rectangular grid per country and questionnaire.

I have already discussed this topic with people from various backgrounds and have not come to a good solution yet. I mainly got two kinds of opinions.

  1. The domain experts, who are used to working with flat files (spreadsheet-style) for data processing and analysis vote for a denormalized structure with loads of tables and columns as I described above (1 table per country and questionnaire). This sounds terrible to me, because I learned that wide tables are to be avoided, it will be annoying to determine which columns are actually in a table when working with it, the database will become cluttered with hundreds of tables (or I even need to set up multiple databases, each with a similar yet a bit differetn design), etc.
  2. O-O-programmers vote for a strongly "normalized" design, which would effectively lead to a central table containing all the answers from all respondents to all questions. This table would either need to contain a column of type sql_variant type or multiple answer columns with different types to store answers of different types (multiple choice, free text, ..). The former would essentially be a EAV model. I tend to follow Joe Celko here, who strongly discourages its use (he calls it OTLT or "One True Lookup Table"). The latter would imply that each row would contain null cells for the not applicable types by design.

Another alternative I could think of would be to create one table per answer type, i.e., one for multiple-choice questions, one for free text questions, etc.. That's not so generic, it would lead to a lot of union joins, I think and I would have to add a table if a new answer type is invented.

Sorry for boring you with all this text and thank you for your input!

Cheers,
Alex

PS: I asked the same question here: http://www.eggheadcafe.com/community/aspnet/13/10242616/survey-data-model--how-to-avoid-eav-and-excessive-denormalization.aspx

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

夜还是长夜 2024-10-17 12:54:20

alt text嗯,imgur 已关闭,所以我稍后会发布图片。

我认为这在关系模型中是完全可行的。我构建了一个 CDM 来展示我将如何做到这一点。

出站

需要 4 个实体来定义一个国家/地区的调查。一些家长调查、国家和问题清单。您的问题具有内部关系,因此当一个国家/地区“编辑”某个问题时,您可以跟踪该国家/地区提出的问题及其来源。您需要的另一件事是一个可能的答案实体/表。每个问题可能有一个相关的可能答案列表(多项选择或范围等)。这 4 个应该完全定义“出站”方面。

入站

“入站”方只是 2 个新实体,即被投诉人和答案。受访者很简单,只是该人的人口统计数据(如果您认识他们),并且在这里您可以包括与回国的关系。每个受访者都在特定国家/地区回答了调查。 (如果此人旅行或具有双重国籍,则该人可能与受访者是 1:n)

答案是基本的;它要么是可能答案列表中列出的选项之一,要么是已提供的。不要全神贯注于答案可能是数字、日期等的事实。要么是 FK,要么是字符串。

报告

报告是所有这些内容的结合...您将选择一个国家/地区和一项调查,获取问题和答案列表。

答案复杂性

取决于您想要在哪里进行计算。如果您使用 Varchar2(4000) 列作为用户提供的答案,则可以向问题添加属性来描述答案的数据类型。问:年龄? DT:(0 到 130)之间的整数。然后,您的集成层可以进行验证,而不是由数据库强制执行。或者您可以有 4 列,其中一列用于数字、日期、字符和 CLOB。您的集成层将决定要使用的列。当您报告这些答案时,您只需使用 Coalesce() 选择所有四列。

这是 EAV 吗,因为“Answer”的数据类型有点模糊

不,不是。

EAV 模型将实体分解为属性列表。
像这样:

Entity     Attribute     Value
  1          Fname         Stephanie
  1          Lname         Page
  1          Age           30

因为您看到调查模式的“答案”列同时包含单词和数字,就像此处的“值”列一样,您认为它定义了 EAV。事实并非如此。就像我向该模型添加了 3 个数据类型列一样,它不会从 EAV 中更改它。

我非常讨厌它

有人告诉我,我正在调整的查询必须“尽可能快”。好吧,给我 10 亿美元和 30 年。 “等等,十亿什么?” “尽可能多”、“尽可能快”都不是要求。您可以验证数据库中您想要的任何内容...构建 Before 触发器的负载,瞧!大量验证。

年龄列的数据类型是什么?或者生日栏?取决于您的数据源是什么。一些较旧的记录可能只有月份和年份,或者只有年份,或者“大约”或“大约”某年。您不能只有一个数字列并进行“尽可能多的验证”。 NUMBER(2) 可能比 NUMBER 更好。所以现在你将有 NUMBER(1)、NUMBER(2)、NUMBER... 来表示“尽可能多”。

我认为你被绊倒的地方

将其视为概念数据模型,而不是物理模型。在这些术语中,调查是一个实体。 问题是调查的一个实体还是只是一个属性。如果您构建了每个表,那么您清楚地表明问题只是调查的一个属性,并且垂直存储它们使其成为 EAV。这个模型表明 Question 实际上是另一个实体。问题之间存在某种关系,例如“一个国家[可以]编辑问题”。有最初的问题和编辑过的问题。每个问题都有一个可能答案的集合。最重要的是,它们都是问题。在 EAV 中,我调用 fname、lname、bdate、age、major、salary 等……所有这些都是非常不同的东西,只是属性。在这种情况下,我们不会将发起调查的机构名称、发布日期、到期日期等作为问题。

让我换一种说法。你是联邦快递。您想要存储某些事件的时间戳。每次包裹进入或离开设施或车辆时。在提货卡车上的时间、从卡车上下来并进入第一个设施的时间、离开该设施并登上飞机的时间等。您是否水平存储它们?如何提前知道跳数?如果您垂直存储它们,是否会自动使其成为 EAV?如果是这样为什么。

您是一家气象公司,从全国各地的气象站获取临时数据。假设传感器设计为在温度变化+/-整度时发送读数。如果您存储的sensor_ID|timestamp|temp是一个读取表,那么它是一个EAV吗?每个读数不是传感器的属性,它们本身就是属于集合/系列的实体。

答案的垂直存储与 EAV 的共同点之一是其难以执行分析查询。如果您想要列出所有对问题 5 和 10 回答为 TRUE 但对问题 6 和 11 回答为 FALSE 的人,那么垂直完成时会非常困难。也许这就是您将其视为 EAV 的原因。如果您想这样做,则需要不同的存储。问题和答案的关系存储并不是最好的报告数据库。让我们回到联邦快递的例子。当行是垂直时,进行“传输”时间报告并不简单。

alt textWell imgur is down so i'll post the pic later.

I think this is completely feasible within a relational model. I've built a CDM to show how I would do this.

Outbound

It takes 4 entities to define a Country's Survey. Some Parent Survey, the country and a list of questions. Your questions have an internal relationship so when one country "edits" a question, you can track both the question asked by the country and the question it came from. The other thing you need is a Possible Answer entity/table. Each question may have an associated list of possible answers (multiple choice or ranges etc). Those 4 should completely define the "OUTBOUND" side of this.

Inbound

The "INBOUND" side is just 2 new entities, The Respondent and the answer. The respondent is straightforward, just the demographics of that person if you know them and here you can include a relationship back to country. Each respondent answered the survey in a given country. (Person may be 1:n with Respondent if the person travels or has dual citizenship)

The answer is basic; either it is one of the choices listed in the list of Possible Answers or it is provided. Don't get all caught up in the fact that the answer may be a number, date, etc just yet. Either it's a FK or a string of characters.

Reporting

A report is a join over all of these... You'll choose a country and a survey, get the list of questions and answers.

Answer Complexity

Depends on where you want to do your calculations. If you used a Varchar2(4000) column for your user-provided answers, you could add an attribute to question to describe the datatype of the answer. Q: Age? DT: Integer Between (0 and 130). Then your integration layer can do the validation instead of the database enforcing it. Or you can have 4 columns, one for number, date, character and CLOB. And your integration layer will determine the column to use. When you report those answers out, you'll just select all four columns with Coalesce().

Is this an EAV because there's a slight ambiguity to the datatype of "Answer"

No, it's not.

AN EAV model breaks down an Entity into a list of attributes.
like so:

Entity     Attribute     Value
  1          Fname         Stephanie
  1          Lname         Page
  1          Age           30

because you see the Answer column of the Survey schema is holding both words and numbers like the Value column does here you think that defines EAV. It does not. Just as if I added 3 datatype columns to this model it wouldn't change it FROM an EAV.

I soooo hate it when

I've had people tell me that the query I'm tuning has to go "as fast as possible". Ok, so give me a billion dollars and 30 years. "Wait, a Billion what?" "As much as", "as fast as" aren't requirements. You can validate anything you want in a database... build a shedload of Before triggers, voila! Validation galore.

What's the datatype of an age column? Or Birthdate column? Depends on what your data source is. Some older records may only have Month and Year, or just year, or 'around' or 'circa' some year. You couldn't have just a number column and do 'as much validation as possible'. and NUMBER(2) may be BETTER validation than just NUMBER. So now you'll have NUMBER(1), NUMBER(2), NUMBER... to have "as much as".

Where I think you are getting tripped up

Think of this as a Conceptual Data Model, not a Physical one. In those terms Survey is an entity. Is Question an entity or just an attribute of Survey. If you built One table PER you're clearly saying that Question is just an Attribute of Survey and storing them vertically makes this an EAV. What this model shows is that Question is actually another entity. There is a relationship between Questions, e.g. 'a country [can] edit questions'. There was the original question and edited one. Each question has a collection of possible answers. And the most important this is that, they are all questions. In an EAV I call fname, lname, bdate, age, major, salary, etc... all very disparate things, just attributes. In this case we're not including the name of the agency who originated the survey and the date it was issued and the date is due back and the etc... as questions.

Let me put this another way. You're Fedex. You want to store timestamps for certain events. Each time a package enters or leaves a facility or vehicle. Time on the picking up truck, time off the truck and into the first facility, time out of that facility and onto a plane, etc. Do you store them Horizontally? How do you know the number of hops in advance? If you store them vertically does that automatically make it an EAV? And if so why.

You're a weather company getting temps from stations around the country. Let's say the sensors are designed to send a reading when the temperature changes +/- a full degree. If you store a sensor_ID|timestamp|temp is a Reading Table is that an EAV? Each reading isn't an attribute of the sensor, they are themselves entities which belong to a collection/series.

One thing that vertical storage of answers has in common with an EAV is its difficulty in performing analytic queries. If you wanted a list of all the people who answered TRUE to question 5 and 10 but FALSE to 6 and 11 would be very difficult when done vertically. Maybe that's why you see this an EAV. If you want to do that, you need a different storage. The relational storage of the question and answers isn't the best reporting database. Let's go back to the Fedex example. It's not simple to do "transit" time reporting when the rows are vertical.

等待圉鍢 2024-10-17 12:54:20

这听起来像是您正在解决一个常见问题:如何使用锤子来拧紧螺丝。

您列出的两种选择都不好,每种选择都有不同的原因。但这是因为您试图将特定的数据模型填充到关系数据库系统中。一个好的方法是超越关系数据库,查看一些其他数据库/存储系统,尝试结合起来,找到最适合您的项目的方案。


我尝试过 EAV 模型并放弃了,因为它太复杂了,而且我不敢尝试使用关系数据库系统的多表模型。我发现的关系数据库最简单的解决方案是:将每个完整的响应存储为单个 CLOB,序列化为 JSON 或 YAML(或其他轻量级的内容),并存储在响应表中。

create table responses (
  id uuid primary key,
  questionnaire_id uuid references questionnaires.id,
  data text
)

This sounds like you are wrestling with a common problem: how to use a hammer to fasten a screw.

Both alternatives you listed are bad, each for different reasons. But that's because you are trying to stuff your particular data model into a relational database system. A good approach would be to look beyond the relational database at some other database/storage systems, try a couple out, and find the best fit for your project.


I have tried the EAV model and gave up because it was far too complex, and I am afraid to try the multi-tables model with a relational database system. The easiest solution I have found with a relational database is: store each complete response as a single CLOB, serialized into JSON or YAML (or something else lightweight), in a responses table.

create table responses (
  id uuid primary key,
  questionnaire_id uuid references questionnaires.id,
  data text
)
沐歌 2024-10-17 12:54:20

如果我使用 SQL Server,Express 就可以了,那么我会这样做:

  • 带有问题列表、标志的表
    对于类型(位),如果需要标志
    (位),正确答案(如果存在),
    等等
  • 包含国家列表的表格
  • 国家和地区的链接表格
    问题(某些国家可能无法得到某些问题的
  • 答案表,其中包含以下列:
    问题和 xml
    可选问题栏
    包括那些添加的内容。

如果您不熟悉分解 XML,则对所有可选问题使用稀疏列。我不太记得表中稀疏列的数量限制,但我相信它超过 30,000。 SQL Server 在内部将稀疏列存储为 XML,并且当选择该列并且可以对其进行索引时将其粉碎。

下图显示了使用 SQL Server 创建的图表。 AL_A4 列将保存 QL_Id = 4 的答案,并且类型为稀疏。 QuestionList 表中的 QL_Id 未标记为必需,让您知道使 AnswerList 中的列稀疏。

由于国家/地区将添加问题,因此会创建 QuestionListCustom、QuestiontoCountryCustom 和 AnswerListCustom 表并添加自定义问题中的信息。

我确信还有其他方法来设计存储,这就是我交作业的方式,如果这不是作业那么你肯定为联合国工作。

替代文字

If I was using SQL Server, Express will be OK, then I would do this:

  • Table with list of questions, flags
    for type (bit), if required flag
    (bit), the correct answer if exists,
    etc
  • Table with list of countries
  • Table linking of countries and
    questions (some countries may not get some questions
  • Table for answers with columns for
    the question(s) and a xml
    column for the optional questions
    including those which are added

If you are not versed in shredding XML then use sparse columns for all the optional questions. I do not recall exactly the limit on the number of sparse columns in a table but I believe it is above 30,000. SQL Server internally stores sparse columns as XML and will shred it when one selects the column and yes it can be indexed

The diagram below show a diagram created with SQL Server. the column AL_A4 will hold the answer to QL_Id = 4 and is of type sparse. The QL_Id in the QuestionList table is not flagged required letting you know to make the column in AnswerList sparse.

Since countries will add questions create QuestionListCustom, QuestiontoCountryCustom and AnswerListCustom tables and add the information from the custom questions.

I am sure there are other ways to design the storage, this is the way I would turn in the homework, if this is not homework then you surely work for the UN.

alt text

弃爱 2024-10-17 12:54:20

您是否考虑过不重新发明轮子?已经构建了开源调查应用程序。即使它们不能满足您的需求,也可以下载一些并检查它们的数据模型。

Have you considered not reinventing the wheel? There are open source survey applications already built. Even if they don't meet your needs, download a few and check out their data models.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文