规范化和易于编码哪个更重要?
我有一个 Excel 电子表格,我将把它变成一个数据库来挖掘数据并构建一个交互式应用程序。大约有 20 个列和 80,000 条记录。实际上,所有记录的列数据中约有一半为空,但对于每条记录而言,哪一列具有数据是随机的。
选项是:
创建一个更加规范化的数据库,其中每列都有一个表,并使用 20 个联接来查看所有数据。我认为好处是数据库实际上没有 NULL 值,因此大小会更小。主要缺点之一是从应用程序端更新每个表需要更多代码。
创建一个平面文件,其中包含一个包含所有列的表。我认为这对于应用程序端来说会更容易进行更新,但会导致一个表有大量空数据空间。
I have an excel spreadsheet i am going to be turning into a DB to mine data and build an interactive app. There are about 20 columns and 80,000 records. Practically all records have about half of their column data as null, but which column has data is random for each record.
The options would be to:
Create a more normalized DB with a table for each column and use 20 joins to view all data. I would think the benefits would be a DB with really no NULL values so the size would be smaller. One of the major cons would be more code to update each table from the application side.
Create a flat file with one table that has all columns. I figure this will be easier for the application side to do updates, but will result in a table that has a butt load of empty dataspace.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
我不明白为什么您认为更新规范化数据库比平面表更难。情况恰恰相反。
考虑在客户和产品(基本上是订单)之间插入关系。您必须:
第一次呢?您如何处理初始空值?您是否修改您的选择以忽略它们?如果您想要空值怎么办?
如果删除最后一个产品怎么办?您是否将其更改为更新并仅为几列设置空值?
撇开连接不谈,使用规范化表在设计上是微不足道。您为它的琐碎性和性能付出了代价,这就是实际的权衡。
I don't get why you think updating a normalized db is harder than a flat table. It's very much the other way around.
Think about inserting a relation between a customer and a product (basically an order). You'd have to:
What about the first time? What do you do with the initial nulls? Do you modify your selects to ignore them? What if you want the nulls?
What if you delete the last product? Do you change it into an update and set nulls for just a few columns?
Joins aside, working with a normalized table is trivial by design. You pay for its triviality with performance, that's the actual trade-off.
如果您打算使用关系数据库,那么您应该规范化您的表,以简化数据维护并确保没有重复数据。
您可以研究使用文档数据库来代替关系数据库进行存储,尽管它是不是唯一的选择。
If you are going to be using a relational database, you should normalize your tables, if nothing else in order to ease data maintenance and ensure you don't have duplicate data.
You can investigate the use of a document database for storage instead of a relational database, though it is not the only option.
一般来说,规范化数据库最终会更容易编写代码,因为 SQl 代码是在设计时考虑到规范化表的。
Generally normalized databases will end up being easier to write code against as SQl code is deisgned with normalized tables in mind.
不必在所有列上进行标准化,因此您提供的两个选项之间有一个中间立场。一个好的经验法则是,如果您的列的值在记录中大量重复,那么这些列可能是规范化为一个或多个单独表的良好候选者。将每一列放在自己的表中并跨它们进行连接几乎肯定是过度的做法。
Normalizing doesn't have to be done on all columns, so there's a middle ground between the two options you present. A good rule of thumb is that if you have columns that have values being repeated heavily across records, those can be good candidates for normalizing into one or more separate tables. Putting each column in its own table and joining across them is almost certainly overdoing it.
不要过度标准化。随着应用程序的增长,很难维护规范模型。存储很便宜。不要因为 20 年前的担忧而陷入编码头痛的境地。除非你需要,否则不需要使用 nosql。
Don't normalize too much. It's hard to maintain a canonical model as your application grows. Storage is cheap. Don't get fooled into coding head aches because of concerns that were valid 20 years ago. No need to go nosql unless you need it.