在 RDBMS 中实现灵活的关系——真正的权衡是什么？

发布于 2024-11-17 05:22:55 字数 1224 浏览 6 评论 0原文

我有很多产品，每种产品都有很多不同的可能属性。例如，产品 A 有名称、尺寸、颜色、形状。产品 B 有名称、卡路里、糖等。解决此问题的一种方法是：

1）创建表格

Products (id, name)
Attributes (id, name)
Product_Attributes (product_id, attribute_id, value as string)

这可以实现最大的灵活性，但我听到很多人建议反对这样做，尽管我不确定为什么。我的意思是，如果这些表被称为 Teams、Players、Team_Players，我们都会同意这是正确的关系设计。

每个向我解释为什么这是不好的人都是在完全灵活的关系设计的背景下这样做的，在这种设计中，您永远不会创建超过几个基本初始表（例如对象，属性，object_attribute）的真实表 - 我认为我们所有人都同意这很糟糕。但这是一个更加有限和包含的版本（只有产品，而不是系统中的每个对象），所以我认为将这两种架构组合在一起是不公平的。

您遇到过哪些问题（经验或理论）导致此设计如此糟糕？

2）解决这个问题的另一种方法是创建一个包含一系列列（如尺寸、颜色、形状、重量、糖等）的产品表，然后在末尾添加一些额外的列，以便为我们提供一些灵活性。这将创建通常稀疏的行，其中大部分由 NULL 填充。人们往往喜欢这种方法，但我的问题是，在这种方法失去性能优势之前，您可以拥有多少列？如果你有 200 列，我想这不再是明智之举，但是 100 列呢？ 50 列？ 25 列？

3) 我知道的最后一种方法是将所有属性作为 blob（可能是 JSON）存储在 Products 表的单个列中。我喜欢这种方法，但感觉不对。查询很难。如果您希望以后能够轻松更改属性的名称，则必须单独解析每个记录，或者通过某个 id 将它们键入到您的 blob 中。如果您选择 id 路径，那么您将需要另一个表属性，并且事情开始看起来像上面的方法#1，除非您无法将 attribute_id 与您的 blob 连接起来，所以我希望您不想查询任何内容按属性名称。

我喜欢这种方法的一点是，您可以查询一个产品，并且在代码中您可以轻松快速地访问它拥有的所有属性。而且，如果您删除一个产品，则无需清理其他表——很容易保持一致。

4) 我读过一些关于能够在某些 RDBMS 中索引强类型 xml 格式的文章，但老实说我对这种方法了解不多。

我被困住了。我觉得方法#1 是最好的选择，但我读到的所有内容都说这种方法很糟糕。思考这个问题的正确方法是什么，以便能够决定针对给定情况的最佳方法是什么？显然，比我列出的更多想法受到欢迎！

原文

I have a bunch of products with a bunch of different possible attributes for each product. E.g. Product A has a name, size, color, shape. Product B has a name, calories, sugar, etc. One way to solve this is like:

1) Create tables

Products (id, name)
Attributes (id, name)
Product_Attributes (product_id, attribute_id, value as string)

This allows for maximum flexibility, but I have heard a lot of people recommend against this although I am not sure why. I mean, if those tables were called Teams, Players, Team_Players we would all agree that this is proper relational design.

Everyone who explains to me why this is bad does so in the context of a completely flexible relational design where you don't ever create real tables past a basic few basic initial tables (e.g. object, attribute, object_attribute)-- which I think we all can agree is bad. But this is a much more limited and contained version of that (only Products, not every object in the system), so I don't think it is fair to group these two architectures together.

What issues have you encountered (experience or theoretical) that makes this design so bad?

2) Another way to solve this is to create a Product table with a bunch of columns like Size, Color, Shape, Weight, Sugar, etc and then include some extra columns at the end to give us some flexibility. This will create generally sparse rows filled mostly with NULLs. People tend to like this approach, but my question is how many columns can you have before this approach loses its performance benefits? If you have 200 columns, I imagine this is no longer a smart move, but what about 100 columns? 50 columns? 25 columns?

3) The final approach I know about is to store all of the attributes as a blob (JSON perhaps) in a single column of the Products table. I like this approach but it doesn't feel right. Queries are hard. And if you want to be able to easily change the name of an attribute later, you either have to parse every record individually or have them keyed in your blob by some id. If you go the id path then you will need another table Attributes and things start to look like approach #1 from above except you won't be able to join the attribute_id with your blob, so I hope you didn't want to query anything by attribute name.

What I like about this approach though is you can query one product and in your code you can easily access all the properties it has -- fast. And if you delete a product, you won't have to cleanup other tables -- easy to stay consistent.

4) I have read some things about being able to index strongly typed xml formats in some RDBMSs, but I honestly don't know much about this approach.

I am stuck. I feel like approach #1 is the best bet, but everything I read says that way stinks. What is the right way to think about this problem to be able to decide what is the best method for a given situation? More ideas than what I have listed are obviously welcomed!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

别低头，皇冠会掉 2024-11-24 05:22:55

通过 Google 搜索“实体属性值反模式”，您可能会找到有关此主题的大量信息。

这种方法的问题之一是最终将元数据与实际数据混合在一起。您的“属性”现在必须告诉数据库“值”列中到底保存了什么。这使得在前端、报告软件等中处理这些数据变得非常困难。

其次，您将很难在数据库中实际执行任何数据完整性。当您的产品具有“重量”属性时，什么可以阻止人们在值中添加“22 英寸”？或者完全是非数字值。您可能会说：“好吧，我的应用程序会处理这个问题。”然后，每次要添加新属性时，您都需要更改应用程序，因为应用程序需要知道如何处理它。如果您要完成所有这些工作，只需添加一个新列即可。

第三，如何强制给定产品具有其所需的所有属性？在一行中，您可以将列设置为 NOT NULL，然后需要他们将该行放入数据库中。您无法在 EAV 模型中强制执行这一点。

第四，这种模型通常会导致很多混乱。人们不确定支持哪些“属性”，或者他们重复属性，或者在创建报表时忘记处理属性。举个例子，如果我有一个“重量（公斤）”属性和另一个“重量（磅）”属性，有人问我：“您的数据库中最重的产品是什么？”我最好记住我需要检查这两个属性。

第五，这种模式通常也会导致懒惰。嘿，没有理由对我们的系统可以处理的产品进行任何分析，因为无论出现什么，我们都只会添加一些属性。根据我的经验，公司最好进行创建良好数据库设计所需的分析，而不是依赖像这样的反模式。您将了解有关数据库、应用程序以及可能的业务的知识。

第六，可能需要大量联接才能获取给定产品的单行数据。您可以将属性作为单独的行返回，但现在您必须提供自定义列表框来列出这些产品等。类似地，针对此模型编写搜索查询可能非常困难，并且在这两种情况下您可能会有性能问题。

这些只是我多年来遇到的一些问题。我确信还有其他人。

适合您的系统的正确解决方案在很大程度上取决于您的业务和应用程序的具体情况。如果您的产品属于几个具有共同属性的类别，您可以考虑使用子类型表，而不是稀疏行。

You can probably find a great deal about this topic by doing a Google search on "entity attribute value antipattern".

One of the issues with this approach is that you end up mixing meta-data with actual data. Your "attribute" has to now tell the database what exactly is held in the "value" column. This can make it very difficult to handle this data in front-ends, reporting software, etc.

Second, you're going to have a very hard time actually enforcing any data integrity in the database. When your product has an attribute of "weight" what's to stop someone from putting "22 inches" in the value? Or a non-numeric value completely. You might say, "Well, my application will handle that." Then you need to change your application every time that you want to add a new attribute because the application needs to know how to handle it. If you're going to go through all of that work, just add a new column.

Third, how do you enforce that a given product has all of the attributes that it needs? In a row you can make column NOT NULL and they are then required to get that row into the database. You can't enforce that in the EAV model.

Fourth, this kind of a model usually leads to a lot of confusion. People aren't sure what "attributes" are supported, or they duplicate an attribute, or they forget to handle an attribute when creating a report. As an example, if I have an attribute for "Weight(kg)" and another attribute for "Weight(lbs)" and someone asks me, "What's the heaviest product in your database?" I'd better remember that I need to check both attributes.

Fifth, this model usually also leads to laziness. Hey, there's no reason to actually do any analysis of the products that our system can handle, because whatever comes along we'll just add some attributes. In my experience, companies are much better off doing the analysis required to create a good database design rather than fall back on an antipattern like this. You'll learn things about the database, the application, and likely the business as well.

Sixth, it might take a LOT of joins to get a single row of data for a given product. You can return the attributes as separate rows, but now you have to come up with customized list boxes to list those products, etc. Similarly, writing search queries against this model can be very difficult and in both of these situations you're likely to have performance issues.

These are just a few of the problems which I've encountered over the years. I'm sure that there are others.

What the correct solution is for your system depends a lot on the specifics of your business and application. Rather than a sparse row, you might consider using subtype tables if your products fall into a few categories that share common attributes.

回复收藏 0 原文

吝吻 2024-11-24 05:22:55

灵活的数据模型存在很多问题，但第一个可能让您困扰的问题是查询很快就会变得难以处理。例如，如果您想获取每个产品的 Size 属性，查询就相对容易。

SELECT p.name product_name, 
       pa.value product_size
  FROM product p    
         left outer join product_attribute pa on (p.product_id = pa.product_id)
         left outer join attribute a on (pa.attribute_id = a.attribute_id and 
                                         a.name          = 'size')

如果您想获取尺寸和其他一些属性（例如颜色），事情会变得更加棘手。

SELECT p.name product_name, 
       pa_size.value product_size
       pa_color.value product_color
  FROM product p    
         left outer join product_attribute pa_size on (p.product_id = pa_size.product_id)
         left outer join product_attribute pa_color on (p.product_id = pa_size.product_id)
         left outer join attribute a_size on (pa_size.attribute_id = a.attribute_id and 
                                              a_size.name          = 'size')
         left outer join attribute a_color on (pa_color.attribute_id = a.attribute_id and
                                              a_color.name         = 'color')

很快，当您开始想要获取 10 个属性或编写复杂的搜索（向我显示颜色为蓝色且尺寸为中等的产品）时，查询对于开发人员的编写和维护以及数据库优化器生成查询计划来说，这一切都开始变得非常复杂。如果您要将 30 个表连接在一起，优化器必须非常非常快地修剪它认为的计划树，以便能够在合理的时间范围内生成查询计划。这往往会导致优化器过早地放弃有希望的路径，并为许多查询生成不是最佳的路径。

反过来，这意味着您很快就会遇到新开发的瓶颈，因为开发人员无法正确获取查询，或者开发人员无法足够快地返回查询。无论您通过不收集需求来确定有效属性而预先节省的时间，都会随着“为什么我不能从这个腐烂的数据模型中获取我想要的数据？”的第 47 次迭代而耗尽。

除了开发人员的成本之外，您最终还会为整个组织带来大量成本。

没有查询工具能够很好地处理这种数据模型。因此，当前可以启动他们最喜欢的查询工具并从数据库中运行一些报告的所有用户现在都在等待开发人员编写报告并为他们进行提取。
数据质量变得很难执行。检查涉及多个属性的条件变得非常困难（即，如果产品的尺寸为 Medium，则重量必须在 1 到 10 磅之间，如果指定了产品的高度，则还需要宽度），因此人们不会制作这些条件检查。他们撰写报告并不是为了确定哪些地方违反了此类规则。因此，数据最终会成为下游进程决定无法使用的一小部分数据，因为它不够完整。
当理解核心实体可能会带来更好的整体设计时，您将太多的初始需求讨论转移到了未来。如果您不能就产品的第一个版本需要支持的一组属性达成一致，那么您就无法真正理解该版本应该做什么。即使您成功地编写了一个非常通用的应用程序，这也意味着一旦您构建了它，就需要花费大量时间进行配置（因为有人必须弄清楚它此时支持哪些属性）。然后你会发现，在配置应用程序时，你错过了大量的要求，这些要求只有在定义属性时才变得清晰——如果指定了高度，如果你不知道是否需要宽度，你就无法知道是否需要宽度。他们将首先存储高度或宽度。
在最坏的情况下，在配置过程中对这个问题的响应是立即确定您需要提供一种灵活的方式来指定业务规则并指定工作流程，以便配置应用程序的人员在添加新属性时可以快速编写其业务规则以便他们可以通过将属性分组在一起或跳过某些页面来控制应用程序的流程（即，如果产品类型是汽车，则有一个需要品牌和型号的页面，如果现在则跳过该页面）。但为了做到这一点，您最终将构建一个完整的开发环境。您将把实际编写应用程序代码的工作交给配置产品的人员。除非你恰好非常擅长构建开发环境，并且除非配置产品的人真的是开发人员，否则这不会有好结果。

There are many problems with flexible data models but the first one that is likely to bite you is the fact that queries get unwieldy very quickly. For example, if you wanted to get the Size attribute for every product, the query is relatively easy.

SELECT p.name product_name, 
       pa.value product_size
  FROM product p    
         left outer join product_attribute pa on (p.product_id = pa.product_id)
         left outer join attribute a on (pa.attribute_id = a.attribute_id and 
                                         a.name          = 'size')

If you want to get the size and some other attribute like color, things get trickier

SELECT p.name product_name, 
       pa_size.value product_size
       pa_color.value product_color
  FROM product p    
         left outer join product_attribute pa_size on (p.product_id = pa_size.product_id)
         left outer join product_attribute pa_color on (p.product_id = pa_size.product_id)
         left outer join attribute a_size on (pa_size.attribute_id = a.attribute_id and 
                                              a_size.name          = 'size')
         left outer join attribute a_color on (pa_color.attribute_id = a.attribute_id and
                                              a_color.name         = 'color')

Very quickly, when you start wanting to grab 10 attributes or write complex searches (show me products where the color is blue and the size is medium), the queries start to get very complicated both for developers to write and maintain and for the database optimizer to generate the query plan for. If you're joining 30 tables together, the optimizer would have to prune the tree of plans it considers very, very quickly to be able to generate a query plan in a reasonable time frame. That tends to lead the optimizer to discard promising paths too early and to generate less than optimal paths for many of your queries.

This, in turn, means that you very quickly get to a point where new development is bottlenecked because developers can't get their queries right or developers can't get their queries to return quickly enough. Whatever time you saved up front by not gathering the requirements to determine what the valid attributes are quickly gets used up with the 47th iteration of "Why can't I get the data I want out of this putrid data model?"

Beyond this cost to developers, you end up creating a lot of costs for the organization as a whole.

No query tool is going to handle this sort of data model well. So all the users that can currently fire up their favorite query tool and run some reports out of your database are now stuck waiting for developers to write their reports and do their extracts for them.
Data quality becomes very hard to enforce. It becomes very hard to check conditions that involve multiple attributes (i.e. if a product's size Medium then the weight must be between 1 and 10 pounds, if a product's height is specified then a width is required as well) so people don't make those checks. They don't write the reports to identify where these sorts of rules are violated. So the data ends up being a bit bucket of data that downstream processes decide they can't use because it isn't sufficiently complete.
You're moving too much of the initial requirements discussion off into the future when understanding the core entities will likely lead to a much better design overall. If you can't agree on a set of attributes that the first version of the product needs to support, you don't really understand what that version is supposed to do. Even if you successfully code a very generic application, that means that it is going to require a lot of time to configure once you've built it (because someone will have to figure out what attributes it supports at that point). And then you'll discover when the application is being configured that you missed a ton of requirements that only became clear when the attributes were defined-- you can't know that width is required if height is specified if you don't know whether they're going to store height or width in the first place.
In the worst case, the response to this problem during configuration is to immediately determine that you need to provide a flexible way to specify business rules and to specify workflows so that the people configuring the application can quickly code their business rules when they add new attributes and so that they can control the flow of the application by grouping attributes together or skipping certain pages (i.e. have a page where make & model are required if the product type is car, skip that page if now). But in order to do that, you're going to end up building an entire development environment. And you're going to push the job of actually coding the application to the folks that are configuring the product. Unless you happen to be really good at building development environments, and unless the people configuring the product are really developers, this doesn't end well.

回复收藏 0 原文

风吹短裙飘 2024-11-24 05:22:55

我的意思是，如果这些表被调用
球队、球员、Team_Players 我们会
所有人都同意这是正确的
关系设计。

不，我们不会。原因如下。

你是从这个开始的。

Products (id, name)
Attributes (id, name)
Product_Attributes (product_id, attribute_id, value as string)

让我们删除 ID 号，这样我们就可以看到到底发生了什么。（为了清晰起见，列名更长。）

Products (product_name)
Attributes (attribute_name)
Product_Attributes (product_name, attribute_name, value as string)

并将其翻译给球队和球员。。。

Teams (team_name)
Players (player_name)
Team_Players (team_name, player_name, value as string)

因此，对于样本数据，我们可能会用

Team                   Player             Value
--
St. Louis Cardinals    Boggs, Mitchell    ?
St. Louis Cardinals    Carpenter, Chris   ?
St. Louis Cardinals    Franklin, Ryan     ?
St. Louis Cardinals    Garcia, Jaime      ?

“到底属于什么”来代替问号？假设我们想要记录玩过的游戏数量。现在示例数据如下所示。

Team                   Player             Value
--
St. Louis Cardinals    Boggs, Mitchell    23
St. Louis Cardinals    Carpenter, Chris   15
St. Louis Cardinals    Franklin, Ryan     19
St. Louis Cardinals    Garcia, Jaime      14

也想存储击球率吗？你不能。你不仅不能将击球率与所参加的比赛一起存储，你也无法通过查看数据库来判断米奇·博格斯是否参加了 23 场比赛，击出 23 支安打，得分 23 分，有 23 次“击球”，有 23 次单打，或三振 23 次。

I mean, if those tables were called
Teams, Players, Team_Players we would
all agree that this is proper
relational design.

No, we wouldn't. Here's why.

You started with this.

Products (id, name)
Attributes (id, name)
Product_Attributes (product_id, attribute_id, value as string)

Let's drop the id numbers, so we can see what's really going on. (Longer column names for clarity.)

Products (product_name)
Attributes (attribute_name)
Product_Attributes (product_name, attribute_name, value as string)

And translating that to teams and players . . .

Teams (team_name)
Players (player_name)
Team_Players (team_name, player_name, value as string)

So for sample data we might have

Team                   Player             Value
--
St. Louis Cardinals    Boggs, Mitchell    ?
St. Louis Cardinals    Carpenter, Chris   ?
St. Louis Cardinals    Franklin, Ryan     ?
St. Louis Cardinals    Garcia, Jaime      ?

What on earth belongs in place of the question marks? Let's say we want to record number of games played. Now the sample data looks like this.

Team                   Player             Value
--
St. Louis Cardinals    Boggs, Mitchell    23
St. Louis Cardinals    Carpenter, Chris   15
St. Louis Cardinals    Franklin, Ryan     19
St. Louis Cardinals    Garcia, Jaime      14

Want to store batting average, too? You can't. Not only can you not store batting average along with games played, you can't tell by looking at the database whether Mitch Boggs played in 23 games, had 23 hits, scored 23 runs, had 23 "at bats", had 23 singles, or struck out 23 times.

回复收藏 0 原文