如何在列上存储元数据
假设您正在收集有关即将上映的超级英雄电影的内部信息,并且您的主电影表如下所示:
表 1
Title Director Leading Male Leading Female Villain
--------------------------------------------------------------------------
Green Lantern Kubrick Robert Redford Miley Cyrus Hugh Grant
The Tick Mel Gibson Kevin Sorbo Linda Hunt Anthony Hopkins
一般而言,该表应该工作得很好,并且允许非常简单的查询以及行之间的比较。
但是,您希望跟踪每个数据事实的来源,以及发现该事实的记者的姓名。 这似乎暗示了某种 EAV 表,如下所示:
表2
Movie Attribute Value Source Journalist
----------------------------------------------------------------------------------
Green Lantern Director Kubrick CHUD Sarah
Green Lantern Leading Male Robert Redford CHUD James
Green Lantern Leading Female Miley Cyrus Dark Horizons James
Green Lantern Villain Hugh Grant CHUD Sarah
The Tick Director Mel Gibson Yahoo Cameron
...
虽然它可以轻松捕获我们想要的元数据,但会使查询变得更加困难。 简单地获取一部电影的所有基本数据需要更多的时间。 更具体地说,您必须在这里处理四行才能获取有关 Green Lantern 的四个重要信息,而在表 1 中它是一个单独的、封装良好的行。
所以我的问题是,鉴于我刚才描述的复杂性,并且因为我知道通常要避免 EAV 表,EAV 仍然是最佳解决方案吗? 这似乎是表示这些数据的唯一合理的方式。 我看到的唯一其他替代方案是将表 1 与另一个仅包含如下元数据的表结合使用:
表 3
Movie Attribute Source Journalist
----------------------------------------------------------------------------------
Green Lantern Director CHUD Sarah
Green Lantern Leading Male CHUD James
Green Lantern Leading Female Dark Horizons James
Green Lantern Villain CHUD Sarah
The Tick Director Yahoo Cameron
...
但这非常危险,因为如果有人更改表 1 中的列名称,例如“Villain”到“Primary Villain”,表 3 中的行仍然会简单地说“Villain”,因此相关数据将不幸地被解耦。 如果“属性”列链接到另一个表,该表充当表 1 的列的枚举,这可能会有所帮助。当然,DBA 将负责维护此枚举表以匹配表 1 的实际列。实际上,可以通过使用 SQL Server 中包含表 1 中列名的系统视图来代替手动创建枚举表来进一步改进这一点。尽管我不确定您是否可以拥有涉及以下内容的关系:系统视图。
你有什么建议? EAV 是唯一的出路吗?
如果只有一个元数据列(只是“来源”,没有“记者”)怎么办?是否还有必要走 EAV 路线? 您可以有“Director”、“Director_Source”、“Leading Male”、“Leading Male_Source”等列,但这很快就会变得丑陋。 还有我没有想到的更好的解决方案吗?
如果我没有澄清任何一点,请发表评论,我会根据需要添加更多内容。 哦,是的,我使用的电影数据是捏造的:)
编辑:为了简洁地重申我的主要问题,我希望表 1 具有简单性和真正的 RDBMS 设计,它真正很好地描述了电影条目,同时仍然存储以安全且可访问的方式获取有关属性的元数据。 这可能吗? 或者EAV是唯一的方法吗?
编辑 2:在做了更多的网络研究之后,我还没有找到关于 EAV 的讨论,该讨论集中在将元数据存储在列上的愿望上。 实现 EAV 的主要原因几乎总是动态且不可预测的列,但我的示例中并非如此。 在我的例子中,总是有相同的四栏:导演、男主角、女主角、反派。 但是,我想存储有关每行每列的某些事实(来源和记者)。 EAV 可以促进这一点,但我想避免诉诸于此。
更新
使用表 2 的设计,除了将“Movie”列重命名为“Name”并将整个表命名为“Movie”之外,下面是 SQL Server 2008 中返回表 1 的透视操作:
SELECT Name, [Director], [Leading Male], [Leading Female], [Villain]
FROM (Select Name, Attribute, Value FROM Movie) as src
PIVOT
(
Max(Value)
FOR Attribute IN ([Director], [Leading Male], [Leading Female], [Villain])
) AS PivotTable
Let's say you're collecting insider info on upcoming superhero movie releases and your main Movie table looks something like this:
Table 1
Title Director Leading Male Leading Female Villain
--------------------------------------------------------------------------
Green Lantern Kubrick Robert Redford Miley Cyrus Hugh Grant
The Tick Mel Gibson Kevin Sorbo Linda Hunt Anthony Hopkins
This should work very well in general and allow very easy queries as well as comparisons between rows.
However, you'd like to track the source of each data fact, as well as the name of the journalist who discovered the fact. This seems to suggest some sort of an EAV table like this:
Table 2
Movie Attribute Value Source Journalist
----------------------------------------------------------------------------------
Green Lantern Director Kubrick CHUD Sarah
Green Lantern Leading Male Robert Redford CHUD James
Green Lantern Leading Female Miley Cyrus Dark Horizons James
Green Lantern Villain Hugh Grant CHUD Sarah
The Tick Director Mel Gibson Yahoo Cameron
...
Which, while it easily captures the meta-data that we wanted, makes queries harder. It takes a bit more to simply get all the basic data of a single movie. More specifically, you have to deal with four rows here to get the four important tidbits of information on the Green Lantern while in table 1 it is a single, nicely encapsulated row.
So my question is, in light of the complications I just described, and because I know in general EAV tables are to be avoided, is the EAV still the best solution? It does seems like it is the only reasonable way to represent this data. The only other alternative I see is to use table 1 in conjunction with another one that only houses meta data like this:
Table 3
Movie Attribute Source Journalist
----------------------------------------------------------------------------------
Green Lantern Director CHUD Sarah
Green Lantern Leading Male CHUD James
Green Lantern Leading Female Dark Horizons James
Green Lantern Villain CHUD Sarah
The Tick Director Yahoo Cameron
...
But this is very dangerous because if someone changes a column name in table 1, like "Villain" to "Primary Villain," the row in table 3 will still simply say "Villain" and thus the related data will be unfortunately decoupled. This could be helped if the "attribute" column was linked to another table that served as an enumeration of the columns of table 1. Of course, the DBA would be responsible for maintaining this enumeration table to match the actual columns of table 1. And it might actually be possible to improve this even further by instead of creating the enumeration table by hand, use a system view in SQL Server that houses the names of the columns in table 1. Though I'm not sure you can have relationships that involve system views.
What do you suggest? Is the EAV the only way to go?
And what if it was only one meta-data column (just "Source" without "Journalist") - is it still necessary to go the EAV route? You could have columns "Director," "Director_Source," "Leading Male," "Leading Male_Source," etc., but that gets ugly very quickly. Is there some better solution I'm not thinking of?
If I haven't clarified any point please comment and I'll add more as necessary. Oh yeah, and the movie data I used is fabricated :)
Edit: To restate my primary question concisely, I would like to have the simplicity and the true RDBMS design of table 1, which really describes a movie entry well, while still storing the meta data on the attributes in a safe and accessible manner. Is this possible? Or is EAV the only way?
Edit 2: After doing some more web research, I have yet to find a discussion on EAV's that centered around the desire to store metadata on the columns. The primary reason given to implement an EAV is almost always dynamic and unpredictable columns, which is not the case in my example. In my example, There are always the same four columns: director, leading male, leading female, villain. However, I want to store certain facts (source and journalist) about each column for each row. An EAV would facilitate this, but I would like to avoid resorting to that.
Update
Using the Table 2 design except for renaming the column "Movie" to "Name" and calling the whole table "Movie," here is the pivot operation in SQL Server 2008 to get back Table 1:
SELECT Name, [Director], [Leading Male], [Leading Female], [Villain]
FROM (Select Name, Attribute, Value FROM Movie) as src
PIVOT
(
Max(Value)
FOR Attribute IN ([Director], [Leading Male], [Leading Female], [Villain])
) AS PivotTable
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
您可以更改设计中您认为的事实值 ...数据模型中的事实似乎可以表示为以下 N 元组:
以下表结构应支持该数据模型你想要的,并且可以相对容易地被索引和加入。 您还可以创建一个仅导出事实值和事实类型的视图,以便您可以创建以下视角:
有趣的是,您可以认为这是将 EAV 模型完全应用于数据并分解个体的逻辑扩展电影(具有导演、主角、反派等的直观属性)转变为一个旋转结构,其中属性集中于信息源。
所提出的数据模型的优点是:
该数据模型的一些缺点是:
我已将电影数据提升了一个级别以进行标准化结构,并且您可以将电影名称下推到 MovieFact 结构中以保持一致性(因为对于某些电影,我可以想象即使名称是您可能想要跟踪源信息的内容)。
您的虚构电影数据将如下所示:
Your can change what you consider a fact value in your design ... it seems that a fact in your data model could be expressed as the following N-tuple:
The following table structures should support the data model you want, and can relatively easily be indexed and joined. You can also create a view that pivots out just the fact value and fact type so that you can create the following perspective:
Interestingly, you could consider this to be the logical extension of fully applying an EAV model to the data, and decomposing an individual movie (with it's intuitive attribution of director, lead, villain, etc) into a pivoted structure where attributes focus on the source of the information instead.
The benefits of the proposed data model are:
Some of the disadvantages of the data model are:
I've the Movie data up a level to normalize the structure, and you could pushed the movie name down into the MovieFact structure for consistency (since for some movies I can imagine even then name is something you may want to track source information for).
Your fictional movie data would then look like the following:
有趣的场景。 您可以通过将您的实体视为一流对象来绕过 EAV 贫民区; 我们称它们为事实。 在这种情况下,你的正交性非常有帮助,因为每部电影都有完全相同的四个事实。 您的 EAV 表可以是原始/正确的表,然后您可以有一个外部进程来挖掘该表并将数据复制为正确规范化的形式(即您的第一个表)。 通过这种方式,您可以获得所需的数据及其元数据,并且您可以轻松查询电影信息,准确到挖掘过程运行的频率。
我认为您肯定需要一些“数据库外”的力量来确保数据保持有效,因为似乎没有任何数据库内的方法来维护常规表和 EAV 表的完整性。 我想通过一系列复杂的触发器,您几乎可以完成任何事情,但是一个“了解”您的问题的人类管理员可能更容易处理。
Interesting scenario. You could get around the EAV ghetto-ness by thinking about your entities as first class objects; let's call them Facts. And it helps that you're pretty orthogonal in this case, in that every movie has the exact same four facts. Your EAV table can be your pristine/correct table, and then you can have an outside process that mines that table and replicates the data into a properly normalized form (i.e. your first table). This way you have the data you want, with its meta data, and, you have an easy way to query for movie information, accurate to how often your mining process runs.
I think you definitely need some "out-of-database" muscle to make sure the data remains valid, since there doesn't seem to be any in-database way of maintaining integrity across your regular and EAV tables. I guess with a complex series of triggers you can pretty much accomplish anything, but one human administrator who "gets" your problem is probably much easier to handle.
这是另一个想法...随意在其中打孔:)
我的想法是电影表中的列可以是不同的类型(在您的示例中,它们都是字符串/ varchar,但它们也可以是也有来源的数字或日期信息)。
然而,源数据的列类型可能不会随着电影数据的列类型的变化而变化,因此您可以对源使用更多的 EAV 系统,而不会丢失数据的完整性。
MovieRole 表允许您显式枚举角色,以便您可以在源和电影表的给定单元格之间创建可靠的链接。
-担
Here's another idea...feel free to punch holes in it :)
What I'm thinking is that the columns in the movie table could be of different types (in your example, they are all strings/varchars, but they could be, say, numerical or date information that also has a source).
The column types for the source data, however, probably wouldn't vary as a function of the column types of the movie data, so you could use more of an EAV system for the source without losing the integrity of your data.
The MovieRole table allows you to explicitly enumerate the roles so you can create a sure linkage between the source and a given cell of the movie table.
-Dan
鉴于您只有源数据的两个字段(源和记者),我建议使用这样的元数据表:
这会将不太重要的源数据保留在主表之外,但查询不会变得复杂,并且你的代码将更具可读性。
我只会建议
EAV
如果...Seeing as you only have the two fields for source data (Source and Journalist), I would recommend a meta-data table like this:
This will keep the less important source data out of the main table, but the queries will not get complicated and your code will be more readable.
I would only advise
EAV
if ...我的回答对于SO来说似乎有点太哲学了。 耐心听我说。
我认为“来源”列不是主题数据,而是元数据。 它实际上是关于我们如何了解其他一些数据的数据。 这使得它成为关于数据的数据,这就是元数据。
EAV 造成这些问题的原因之一是它将数据和元数据混合在一行中。 有时我自己会故意这样做,作为实现我想要实现的结果的中间步骤。 但我尝试永远不要在我的可交付成果中混合数据和元数据。
我知道为什么我从未这样做过,但我无法简明地解释它。
My response may seem a bit too philosophical for SO. Bear with me.
I think that the "Source" column isn't subject matter data, but rather meta-data. It's really data about how we come to know some other bit of data. That makes it data about data, and that's meta-data.
Among the reasons why EAV causes the problems that it does is the fact that it intermixes data and metadata in a single row. There are times when I've deliberately done that myself, as an intermediate step towards a result I want to acheive. But I've tried never to intermix data and metadata in my deliverables.
I know why I never did that, but I can't explain it concisely.
由于没有其他人真正对此进行研究,因此我将回答我自己的问题。 我很确定类似 EAV 的表确实是唯一的出路。 要在每一列上存储元数据(在本例中涉及消息来源和记者),您实际上将每一列本身视为一个实体,这正是 EAV 所允许的。
您可以走其他路线,例如为每个原始列添加第二列和第三列来存储数据,但这肯定会破坏一些基本的标准化规则,并且可能只会在以后给您带来痛苦。
Since no one else is really taking a crack at it, I'm going to answer my own question. I'm pretty sure an EAV-like table is indeed the only way to go. To store metadata on each column (regarding the source and journalist in this case), you're really treating each column as an entity in itself, which is what an EAV allows.
You could go other routes, like adding a second and third column for each original column to store data, but that is definitely breaking some fundamental normalization rules and will probably only cause you pain later.
嗯....我没有使用过这个,所以我不是根据经验说话(即如果它不起作用,请不要责怪我),但从表面上看,您似乎可以存储“通用”数据您知道它永远存在,就像在普通表中一样,并且“元数据”可能会更改为 XML。 那么问题是如何很好地查询它,我认为您可以按照描述来做到这一点 此处。
Hmm.... I've not used this, so I'm not speaking from experience (i.e. don't blame me if it doesn't work), but on the surface it seems that you could store "common" data that you know will always be there as you would in a normal table, and "metadata" that might change as XML. The question then is how to query it nicely, and I think you might be able to do that as described HERE.
另一种需要考虑的方法是类表继承。 Bill Karwin 在 这个 SO 答案 中对 EAV 选项进行了精彩的评论,并且提供了很多很好的上下文。
Another approach to consider is Class Table Inheritance. Bill Karwin has a great review of EAV options in this SO answer, and lots of good context.
我会根据我需要编码的内容做出决定。
如果 src/journo 只是附加信息,我会继续阅读更多专栏。 但如果我知道我最终会构建复杂的 src/journo 查询,我会选择 EAV,因为在元表中搜索记者的参考文献比进入 LeadingFemaleJournalist 更容易em> 和 VillainJournalist 等。就
个人而言 - 我倾向于将 src/journo 元数据转储到另一个 EAV 样式的表中,但使用 FK 来定义属性定义表。 拥有自由格式的属性文本字段会带来灾难 - 始终通过约束来控制属性。 如果需要,可以实施触发器来提高引用完整性。
对我来说,这取决于观点。 您认为消息来源和记者本身就是关系问题,还是只是补充电影的额外数据? 下一个细化级别是为 MovieDataSource 和 MovieDataJournalist 创建不同的表,这可以让您将 FK 映射到定义有效源 的表,并且记者(然后可以充实有关这些来源/记者的更多信息)。 您在这里要做的是在Movie实体和Source(以及Journalist)实体之间建立多对多关系。
I would make my decision based on what I need to code.
If src/journo is simply additional info, I would go for further columns. But if I know I'm going to end up building complicated src/journo queries, I would go EAV, as it'll be easier to search for a journalist's references down the meta table than having to go into LeadingFemaleJournalist and VillainJournalist etc.
Personally - I would be inclined dump the src/journo meta-data into another table EAV-style, but use a FK to define an Attribute definition table. Having a freeform Attribute text field is a recipe for disaster - always control your attributes through a constraint. Triggers could be implemented to improve referential integrity if required.
For me, it comes down to point-of-view. Do you see sources and journalists being relational concerns in their own right or are they just additional pieces of data to complement a Movie? The next level of refinement would be to create different tables for MovieDataSource and MovieDataJournalist which could allow you to map FKs to a tables defining valid Sources and Journalists (and further information on those Sources/Journalists could then be fleshed out). What you will have done here is to establish a many-to-many relationship between the Movie entity and the Source (and also Journalist) entity.