淹没在空无的海洋中
我继承的一个应用程序跟踪对材料样品执行的实验室测试结果。数据存储在单个表 (tblSampleData) 中,其主键为 SampleID,并有 235 列代表潜在的测试结果。问题是每个样本仅执行少量测试,因此每行包含超过 200 个空值。实际上,还有第二个类似的表 (tblSampleData2),其中包含另外 215 个主要为空的列和 SampleID 的主键。这两个表具有一对一的关系,并且大多数 SampleID 在这两个表中都有一些数据。 但是,对于每个 SampleID,很容易就有 400 个空列!
这是糟糕的数据库设计吗?如果是这样,哪个范式规则被打破了?如何查询此表以确定哪些列组通常与数据一起填充?我的目标是拥有 45 个表,其中包含 10 列和更少的空值。我该怎么做?如何避免破坏现有应用程序?
到目前为止,这些表大约有 200,000 条样本记录。用户要求我添加更多列以进行更多测试,但我宁愿构建一个新表。这明智吗?
An application I inherited tracks lab test results performed on material samples. Data is stored in a single table (tblSampleData) with a primary key of SampleID and 235 columns representing potential test results. The problem is that only a few tests are performed per sample, so each row contains over 200 nulls. Actually, there is a second similar table (tblSampleData2) with another 215 primarily null columns and a primary key of SampleID. The two tables have a one-to-one relationship and most SampleIDs have some data in both tables. For every SampleID, however, there are easily 400 null columns!
Is this bad database design? If so, which normal form rule is broken? How can I query this table to identify which groups of columns are typically filled together with data? My goal would be to have, say 45 tables with 10 columns and fewer null values. How can I do this? How do I avoid breaking existing applications?
The tables have about 200,000 sample records so far. Users are asking me to add more columns for more tests, but I'd rather build a new table. Is this wise?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(10)
我看过一些文章/论文,表明数据库中仅包含 NULL 就破坏了第一范式。
根据我从您对数据库的描述中收集到的信息,更好的设计可能如下:具有
始终与示例关联的字段的示例表。例如,
然后,测试类型表,其中针对可以执行的每种测试类型都有一个条目。
最后,有一个中间表,表示上述两个表之间的多对多关系并保存测试结果。
这将消除空值,因为 TestResult 表将仅包含对每个样本实际执行的测试的条目。我曾经设计过一个数据库,其目的与我认为您正在做的几乎相同,这就是我采取的方法。
I have seen articles / papers that indicate that simply having NULLs in the database breaks the first normal form.
From what I've gathered from your description of the database, a better design might be as follows:
A Sample table with fields that are always associated with a sample. For example,
Then, a table of test types with one entry for each type of test that can be performed.
Finally, have an intermediate table that represents the many-to-many relationship between the two above tables and holds the results for the tests.
This would eliminate the null values because the TestResult table would only contain entries for the tests that were actually performed on each sample. I once designed a database for an almost identical purpose to what I believe you are doing and this is the approach I took.
您可以使用众所周知的实体属性值模型 (EAV)。何时适合使用 EAV 的描述非常适合您的用例:
在您的具体情况下:
EAV 有一些严重的缺点并造成许多困难,因此仅应在适当的情况下应用它。如果您需要在一行中返回特定样本的所有测试结果,则不应使用它。
在不破坏现有应用程序的情况下,很难修改数据库以使用这种结构。
You could use the well known Entity Attribute Value model (EAV). The description of when it is appropriate to use EAV fits quite well with your use case:
In your specific case:
EAV has some serious drawbacks and creates a number of difficulties so it should only be applied when it is appropriate to do so. You should not use it if you need to return all test results for a specific sample in a single row.
It will be difficult to modify the database to use this structure without breaking existing applications.
我不确定这个设计真的那么糟糕。实际上,NULL 值的存储成本应该相对较低。在 SQL Server 中,每行都有一个内部位字段(或多个字段)来指示哪些列值为 NULL。
如果应用程序的性能不需要提高,并且由于更改表架构而进行重构的成本效益并不积极,那么为什么要更改它呢?
I'm not sure the design is really that bad. NULL values should actually be relatively cheap to store. In SQL Server, there is an internal bit field (or fields) for each row that indicates which column values are NULL.
If the performance of the application doesn't need to be improved and the cost-benefit of refactoring due to changing the table schema isn't positive, why change it?
仅仅因为没有破坏范式规则并不意味着它不是糟糕的数据库设计。一般来说,您最好采用较小的行更紧密地包装的设计,因为这样可以在一个页面中容纳更多的行,从而减少数据库要做的工作。根据当前的设计,数据库服务器必须投入大量空间来保存空值。
避免破坏现有应用程序是困难的部分,如果其他应用程序只需要读取访问,您可以编写一个看起来与旧表相同的视图。
Just because no normal-form rules are broken doesn't mean it isn't bad database design. Generally you're better off with a design with smaller rows more tightly packed, because that way more rows can fit in a page so there is less work for the database to do. With the current design the database server is having to devote a lot of space to holding null values.
Avoiding breaking existing applications is the tough part, if the other applications need only read-access you could write a view that looks identical to the old table.
如果您确实更改了表结构,我建议您使用一个名为 tblSampleData 的视图,它返回与表现在相同的数据。这将保留一些兼容性。
If you do change your table structure, I'd recommend having a view called tblSampleData which returns the same data as the table does now. That'll preserve some compatibility.
您可能甚至不需要 RDBMS 来存储这些数据。将数据存储在结构化二进制文件或 DBM/ISAM 表中。
它没有标准化。通常,缺乏标准化是所有问题的根源。但在这种情况下,缺乏规范化并不是世界末日,因为这些数据是“只读”的,只有一个键,并且与其他任何东西都不相关。所以更新异常不用担心。您只需担心原始数据是否一致。
如果您将 NULL 视为在整个应用程序中具有相同含义的“特殊值”,那么所有这些 NULL 并没有什么太严重的错误。未收集数据。数据不可用。对象拒绝回答问题。数据异常。数据待定。已知数据未知。对象说他们不知道……等等,你明白了。没有定义原因且没有定义含义而允许NULL是非常错误的。
我说的是标准化。要么定义特殊值并创建一张巨大的表。或者,为 VB 和 PHP 程序员保留 NULL,并正确分割数据。如果您需要支持遗留代码,请创建一个视图来加入数据备份。根据您的描述,您正在谈论几个小时的工作才能使这件事正确。这并不是一个糟糕的交易。
You probably don't even need a RDBMS for this data. Store your data in structured binary files or a DBM/ISAM table.
It's not normalized. Usually, lack of normalization is the source of all your problems. But in this case, lack of normalization is not the end of the world because this data is "read only", there's only one key, and it's not related to anything else. So update anomalies shouldn't be a worry. You only have to worry that the original data is consistent.
There's nothing too terribly wrong with all those NULLs if you treat NULLs as a "special value" with the same meaning across the entire app. Data was not collected. Data not available. Subject refused to answer question. Data is outlier. Data is pending. Data is known to be UNKNOWN. Subject said they didn't know... etc. you get the idea. Allowing NULLs for no defined reason with no defined meaning is terribly wrong.
I say normalize it. Either define special values and create one massive table. Or, leave NULLs for the VB and PHP programmers, and split it up your data properly. Create a VIEW to join the data back up if you need to support legacy code. From what you described, you're talking about a couple hours of work to get this thing correct. That's not such a bad deal.
假设您的测试机 X 有 40 个测量通道。如果您知道在每次测试中测试人员将仅使用几个通道,您可以将设计更改为:
tblTest: testId, testDate
tblResult:testId、machineId、channelId、Result
您始终可以使用交叉表检索先前的布局。
Let's say you have test machine X with 40 measurement channels. If you know that on each test the testers will use just a few channels, you could change the design to:
tblTest: testId, testDate
tblResult: testId, machineId, channelId, Result
You could always retrieve the pevious layout using a crosstab.
我会使用 1 个主表,其中每个样本有 1 行,它将包含每个样本应具有的所有列:
然后我会为每个不同的测试或类似测试的“类”添加一个表,并且包括与这些相关的所有列(使用实际测试名称而不是 XYZ):
搜索结果时,您将搜索适用的测试方法表并连接回实际示例表。
I d go with 1 main table, where you'd have 1 row per sample, it would contain all the columns that every sample should have:
I'd then add one table for each different test or "class" of similar tests, and include all columns related to those (use the actual test name and not XYZ):
When you search for a result, you would search the test method table that applies and join back to the actual sample table.
EAV 是一个选项,但查询会害死你。
是否可以选择将数据迁移到 MongoDB 等 NoSQL DB?我相信这将是解决您的问题的最有效和最简单的方法。既然您提到您基本上是在执行 CRUD 查询,NoSQL 应该非常高效。
EAV is an option but the queries will kill you.
Is it an option to migrate the data to a NoSQL DB like MongoDB? I believe this will be the most efficient and easy way arround your problem. Since you mentioned you're basically doing CRUD queries NoSQL should be pretty efficient.
目前的设计很糟糕。一般来说,具有大量 NULL 值的数据库表明设计不佳,违反了第四范式。但设计的最大问题并不是违反正常原则,而是添加新的测试类型需要更改数据库结构,而不是简单地添加一些数据 > 几个“定义”测试的表。更糟糕的是,它需要对现有表进行结构更改,而不是添加新表。
您可以通过采用其他人所描述的键值系统来实现完美的第四范式。但是,您可以通过执行以下任一操作来大幅改进数据库的设计,同时仍然保持理智(在没有 ORM 的情况下使用键值系统时很难做到这一点):
尝试发现最大的数字代表任何单独测试所需的测量值。如果测试返回不同的数据类型,您需要发现最大测试返回的每种数据类型的最大数量的值。创建一个仅包含这些列的表,标记为 Meas1、Meas2 等。您可能需要 10 或 40 列,而不是 400 列。然后创建一组表,描述每列对于每个测试的“含义”。此信息可用于根据所存储的测试类型提供有意义的提示和报告列标题。这不会完全消除 NULL,但会大大减少它们,并且只要任何新测试可以“适合”您指定的测量数量,新测试就可以作为数据而不是结构更改添加。
发现每个测试的实际测量列表,并创建一个单独的表来保存每个测试的结果(测试 ID、谁运行它、时间等基本信息仍然放入单个表中)。这是一个多表继承模式(不知道有没有真名)。您仍然需要为每个新测试创建一个新的“数据”表,但现在您不会触及其他现有的生产表,并且您将能够实现完美的范式。
我希望这提供了一些开始的想法。
The current design is a poor one. In general a database with lots of NULL values is an indication of poor design, violating 4th normal form. But the biggest problem with the design is not a violation of normal principles but the fact that the addition of a new test type requires changes to the database structure rather than simply adding some data to several tables that "define" a test. Even worse, it requires structural changes to an existing table, rather than addition of new tables.
You can achieve perfect fourth normal form by adapting a key-value system as described by others. But you may be able to substantially improve the design of the database and still maintain your sanity (something hard to do when working with key-value systems without an ORM) by doing either of the following:
Attempt to discover the largest number of measurements required to represent any individual test. If there are different data types returned by the tests, you'll need to discover the largest number of values of each data type returned by the largest test. Create a table with those columns only, labelled Meas1, Meas2, etc. Instead of 400 columns you'll need, perhaps, 10. Or 40. Then create a set of tables that describe what each column "means" for each test. This information can be used to provide meaningful prompts and report column headers depending on the type of test being stored. This will not eliminate NULLs entirely, but will greatly reduce them and, as long as any new test can be "fit" into the number of measurements you specified, new test can be added as data rather than structural changes.
Discover the actual list of measurements for each test and create a separate table to hold the results of each one (basic information like test ID, who ran it, the time, etc still go into a single table). This is a multi-table inheritance pattern (I don't know if it has a real name). You still have to create a new "data" table for each new test, but now you wont be touching other existing production tables and you will be able to achieve perfect normal form.
I hope this provides some ideas to get started.