动态规范化表是否实用?
假设我的数据库跟踪鸟类目击事件(注:我实际上是在刮木桶底部的例子)。
这些字段是:
sighting_id | common_name | park_name | location | time | etc....
尽管我假设公园始终位于同一位置,但该网站就像一个电子表格。用户为每个条目输入park_name
和location
。另请注意,我的实际模式还有其他字段也依赖于类似的“公园名称”(例如州)。
我没有办法让用户预定义公园,所以我无法提前知道它们。我是否应该尝试动态规范化这些数据?例如,我的程序是否应该自动填充 parks
表,用 park_id
替换观鸟表中的 park_name 和 location 列?
我主要担心的是表现。列出每个目击事件需要加入填充公园和位置。此外,动态管理几乎肯定需要比节省的资源更多的资源。我可能需要一个 Cron 工作来消除孤立的 Parks,因为它们可能会在多次目击中被引用。
Let's say my database tracks bird sightings (Note: I'm really scraping the bottom of the barrel for examples).
The fields are:
sighting_id | common_name | park_name | location | time | etc....
Although I'm assuming that a park will always be in the same location, the website is like a spreadsheet. The user enters park_name
and location
for every entry. Also please note that my actual schema has other fields that are dependent on the analogous "park name" as well (e.g. state).
I do not have a way for the user to predefine parks, so I can't know them ahead of time. Should I even attempt to dynamically normalize this data? For example, should my program automatically populate a parks
table, replacing the park_name and location column in the bird sighting table with a park_id
?
I'm worried about performance, mostly. Listing every sighting would require a join to populate park and location. Also, dynamically managing this would almost certainty require more resources than it would save. I would probably need a Cron job to eliminate orphaned Parks, since they may be referenced in multiple sightings.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这有点取决于您的使用情况。标准化方法(公园是一张桌子)将使以下查询变得更容易:
但是是的,你确实遇到了一些棘手的问题。 “如果公园 XYZ 不存在,则将其插入公园表”的模式会遇到您必须处理的竞争条件。
现在,我们来谈谈反对规范化的一些论点……大多数客户数据库可能将我的街道地址存储为“123 Foo Street”,而不动态规范化街道名称(我们可以有一个街道表并将“Foo Street”放在那里,然后引用为什么我要提出这一点,是为了表明即使是讨厌任何重复数据的人也可能会承认有一条线你不一定必须跨越
另一个愚蠢的例子是我们可能会分享 。我们真的需要一个包含唯一姓氏和姓氏的表吗? 从其他表中获取外键可能对某些应用程序很有帮助,但对于 99% 的应用程序来说,这太过分了,而且性能较差,几乎没有任何好处。
那么 我希望如何从表中查询数据。老实说,在这种情况下,我可能会为公园做一个单独的表,但在其他情况下,我选择不这样做
。 。
It depends on a bit on your usage. The normalized approach (park is a table) will make the following queries easier:
But yes, you do run into some sticky issues. The pattern "if park XYZ doesn't exist then insert it into the parks table" suffers from a race condition that you'll have to deal with.
Now, how about some arguments against normalization here... Most customer databases probably store my street address as "123 Foo Street", without dynamically normalizing the street name (we could have a street table and put "Foo Street" there, then reference it from other tables. Why do I bring this up, well to show that even the guys who hate any repeated data will probably acknowledge that there is some line you don't necessarily have to cross.
Another silly example would be that we might share last names. Do we really need a table for unique last names and then foreign key to it from other tables? There might be some applications where this is helpful but for 99% of application out there, this goes too far. It's just more work and less performant for little to no gain.
So I'd consider how I want to be able to query data back out of the table. Honestly in this case I'd probably do a separate table for parks. But in other cases I've chosen not to.
That's my two cents, one cent after taxes.
我对原始“公园”示例的两点看法(相对于OP的实际问题):
反对尝试自动标准化公园和位置列的决定性论点是可用性:当数据以可编辑的类似电子表格的格式呈现给用户时,他们自然会假设每一行都可以独立编辑,因此如果某些列例如因为“位置”是实际上与公园有关,而不是与行有关。
处理此类情况的典型模式是仅提示用户输入公园的详细信息,并在输入新公园时在“公园”表中创建一行。例如,如果公园列包含一个下拉框,则最后一个选项可以是“添加新公园”。或者,当用户输入无法识别的公园名称时添加一个新公园,但仍然让用户清楚正在创建一个新公园。
My two cents on the original "parks" example (as opposed to the OP's actual problem):
The decisive argument against trying to automatically normalize the park and location columns is usability: when data is presented to the user in an editable spreadsheet-like format, they will naturally assume that each row can be independently edited, so it's deceptive (and likely to lead ultimately to confusion) if some columns such as "location" are actually associated with the park, rather than the row.
A typical pattern for handling this sort of situation is to only prompt the user for park's details and create a row in the "parks" table when a new park is entered. For example, if the park column contains a drop-down box, then the last option could be "add new park". Alternatively, add a new park when the user enters an unrecognized park name -- but still make it clear to the user that a new park is being created.