如何在列上存储元数据

发布于 2024-07-25 09:33:17 字数 3227 浏览 11 评论 0原文

假设您正在收集有关即将上映的超级英雄电影的内部信息，并且您的主电影表如下所示：

表 1

Title              Director   Leading Male      Leading Female    Villain
--------------------------------------------------------------------------
Green Lantern      Kubrick    Robert Redford     Miley Cyrus     Hugh Grant  
The Tick          Mel Gibson  Kevin Sorbo        Linda Hunt    Anthony Hopkins

一般而言，该表应该工作得很好，并且允许非常简单的查询以及行之间的比较。

但是，您希望跟踪每个数据事实的来源，以及发现该事实的记者的姓名。这似乎暗示了某种 EAV 表，如下所示：

表2

Movie             Attribute            Value          Source          Journalist
----------------------------------------------------------------------------------
Green Lantern      Director           Kubrick         CHUD              Sarah
Green Lantern    Leading Male      Robert Redford     CHUD              James
Green Lantern   Leading Female      Miley Cyrus    Dark Horizons        James
Green Lantern      Villain           Hugh Grant       CHUD              Sarah
The Tick           Director          Mel Gibson       Yahoo            Cameron
...

虽然它可以轻松捕获我们想要的元数据，但会使查询变得更加困难。简单地获取一部电影的所有基本数据需要更多的时间。更具体地说，您必须在这里处理四行才能获取有关 Green Lantern 的四个重要信息，而在表 1 中它是一个单独的、封装良好的行。

所以我的问题是，鉴于我刚才描述的复杂性，并且因为我知道通常要避免 EAV 表，EAV 仍然是最佳解决方案吗？这似乎是表示这些数据的唯一合理的方式。我看到的唯一其他替代方案是将表 1 与另一个仅包含如下元数据的表结合使用：

表 3

Movie             Attribute            Source          Journalist
----------------------------------------------------------------------------------
Green Lantern      Director             CHUD              Sarah
Green Lantern    Leading Male           CHUD              James
Green Lantern   Leading Female      Dark Horizons         James
Green Lantern      Villain              CHUD              Sarah
The Tick           Director             Yahoo            Cameron
...

但这非常危险，因为如果有人更改表 1 中的列名称，例如“Villain”到“Primary Villain”，表 3 中的行仍然会简单地说“Villain”，因此相关数据将不幸地被解耦。如果“属性”列链接到另一个表，该表充当表 1 的列的枚举，这可能会有所帮助。当然，DBA 将负责维护此枚举表以匹配表 1 的实际列。实际上，可以通过使用 SQL Server 中包含表 1 中列名的系统视图来代替手动创建枚举表来进一步改进这一点。尽管我不确定您是否可以拥有涉及以下内容的关系：系统视图。

你有什么建议？ EAV 是唯一的出路吗？

如果只有一个元数据列（只是“来源”，没有“记者”）怎么办？是否还有必要走 EAV 路线？您可以有“Director”、“Director_Source”、“Leading Male”、“Leading Male_Source”等列，但这很快就会变得丑陋。还有我没有想到的更好的解决方案吗？

如果我没有澄清任何一点，请发表评论，我会根据需要添加更多内容。哦，是的，我使用的电影数据是捏造的:)

编辑：为了简洁地重申我的主要问题，我希望表 1 具有简单性和真正的 RDBMS 设计，它真正很好地描述了电影条目，同时仍然存储以安全且可访问的方式获取有关属性的元数据。这可能吗？或者EAV是唯一的方法吗？

编辑 2：在做了更多的网络研究之后，我还没有找到关于 EAV 的讨论，该讨论集中在将元数据存储在列上的愿望上。实现 EAV 的主要原因几乎总是动态且不可预测的列，但我的示例中并非如此。在我的例子中，总是有相同的四栏：导演、男主角、女主角、反派。但是，我想存储有关每行每列的某些事实（来源和记者）。 EAV 可以促进这一点，但我想避免诉诸于此。

更新

使用表 2 的设计，除了将“Movie”列重命名为“Name”并将整个表命名为“Movie”之外，下面是 SQL Server 2008 中返回表 1 的透视操作：

SELECT Name, [Director], [Leading Male], [Leading Female], [Villain]
FROM (Select Name, Attribute, Value FROM Movie) as src
PIVOT
(
Max(Value)
FOR Attribute IN ([Director], [Leading Male], [Leading Female], [Villain])
)  AS PivotTable

原文

Let's say you're collecting insider info on upcoming superhero movie releases and your main Movie table looks something like this:

Table 1

Title              Director   Leading Male      Leading Female    Villain
--------------------------------------------------------------------------
Green Lantern      Kubrick    Robert Redford     Miley Cyrus     Hugh Grant  
The Tick          Mel Gibson  Kevin Sorbo        Linda Hunt    Anthony Hopkins

This should work very well in general and allow very easy queries as well as comparisons between rows.

However, you'd like to track the source of each data fact, as well as the name of the journalist who discovered the fact. This seems to suggest some sort of an EAV table like this:

Table 2

Movie             Attribute            Value          Source          Journalist
----------------------------------------------------------------------------------
Green Lantern      Director           Kubrick         CHUD              Sarah
Green Lantern    Leading Male      Robert Redford     CHUD              James
Green Lantern   Leading Female      Miley Cyrus    Dark Horizons        James
Green Lantern      Villain           Hugh Grant       CHUD              Sarah
The Tick           Director          Mel Gibson       Yahoo            Cameron
...

Which, while it easily captures the meta-data that we wanted, makes queries harder. It takes a bit more to simply get all the basic data of a single movie. More specifically, you have to deal with four rows here to get the four important tidbits of information on the Green Lantern while in table 1 it is a single, nicely encapsulated row.

So my question is, in light of the complications I just described, and because I know in general EAV tables are to be avoided, is the EAV still the best solution? It does seems like it is the only reasonable way to represent this data. The only other alternative I see is to use table 1 in conjunction with another one that only houses meta data like this:

Table 3

Movie             Attribute            Source          Journalist
----------------------------------------------------------------------------------
Green Lantern      Director             CHUD              Sarah
Green Lantern    Leading Male           CHUD              James
Green Lantern   Leading Female      Dark Horizons         James
Green Lantern      Villain              CHUD              Sarah
The Tick           Director             Yahoo            Cameron
...

But this is very dangerous because if someone changes a column name in table 1, like "Villain" to "Primary Villain," the row in table 3 will still simply say "Villain" and thus the related data will be unfortunately decoupled. This could be helped if the "attribute" column was linked to another table that served as an enumeration of the columns of table 1. Of course, the DBA would be responsible for maintaining this enumeration table to match the actual columns of table 1. And it might actually be possible to improve this even further by instead of creating the enumeration table by hand, use a system view in SQL Server that houses the names of the columns in table 1. Though I'm not sure you can have relationships that involve system views.

What do you suggest? Is the EAV the only way to go?

And what if it was only one meta-data column (just "Source" without "Journalist") - is it still necessary to go the EAV route? You could have columns "Director," "Director_Source," "Leading Male," "Leading Male_Source," etc., but that gets ugly very quickly. Is there some better solution I'm not thinking of?

If I haven't clarified any point please comment and I'll add more as necessary. Oh yeah, and the movie data I used is fabricated :)

Edit: To restate my primary question concisely, I would like to have the simplicity and the true RDBMS design of table 1, which really describes a movie entry well, while still storing the meta data on the attributes in a safe and accessible manner. Is this possible? Or is EAV the only way?

Edit 2: After doing some more web research, I have yet to find a discussion on EAV's that centered around the desire to store metadata on the columns. The primary reason given to implement an EAV is almost always dynamic and unpredictable columns, which is not the case in my example. In my example, There are always the same four columns: director, leading male, leading female, villain. However, I want to store certain facts (source and journalist) about each column for each row. An EAV would facilitate this, but I would like to avoid resorting to that.

Update

Using the Table 2 design except for renaming the column "Movie" to "Name" and calling the whole table "Movie," here is the pivot operation in SQL Server 2008 to get back Table 1:

SELECT Name, [Director], [Leading Male], [Leading Female], [Villain]
FROM (Select Name, Attribute, Value FROM Movie) as src
PIVOT
(
Max(Value)
FOR Attribute IN ([Director], [Leading Male], [Leading Female], [Villain])
)  AS PivotTable

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

无所谓啦 2024-08-01 09:33:17

您可以更改设计中您认为的事实值 ...数据模型中的事实似乎可以表示为以下 N 元组：

Movie | FactType | FactValue | FactSource | FactJournalist

以下表结构应支持该数据模型你想要的，并且可以相对容易地被索引和加入。您还可以创建一个仅导出事实值和事实类型的视图，以便您可以创建以下视角：

MovieID | Movie Name | Director | LeadingMale | LeadingFemale | PrimaryVillain | etc

有趣的是，您可以认为这是将 EAV 模型完全应用于数据并分解个体的逻辑扩展电影（具有导演、主角、反派等的直观属性）转变为一个旋转结构，其中属性集中于信息源。

所提出的数据模型的优点是：

它是良好标准化的（尽管为了完整性，您可能应该将 FactType 字段标准化为参考表）
可以创建一个有效地将事实类型转出的视图成表格结构，
它相对可扩展，并允许数据库强制引用完整性和（如果需要）基数约束
MovieFact 表可以进行子类化以支持不同类型的电影事实，而不仅仅是那些简单的文本字段
对数据的简单查询相对高效

该数据模型的一些缺点是：

复合条件查询更难（但并非不可能）编写（例如查找导演为 A、男主角为 B 的所有电影等）。 ..）
该模型比更传统的方法不太明显，或者涉及 EAV 结构
插入和更新的模型有点棘手，因为更新多个事实需要更新多行，而不是多列

我已将电影数据提升了一个级别以进行标准化结构，并且您可以将电影名称下推到 MovieFact 结构中以保持一致性（因为对于某些电影，我可以想象即使名称是您可能想要跟踪源信息的内容）。

Table Movie
========================
MovieID   NUMBER, PrimaryKey
MovieName VARCHAR

Table MovieFact
========================
MovieID          NUMBER,  PrimaryKeyCol1
FactType         VARCHAR, PrimaryKeyCol2
FactValue        VARCHAR
FactSource       VARCHAR
FactJournalist   VARCHAR

您的虚构电影数据将如下所示：

Movie Table
====================================================================================
MovieID  MovieName
====================================================================================
1        Green Lantern
2        The Tick

MovieFact Table
====================================================================================
MovieID  FactType       FactValue         FactSource       FactJournalist
====================================================================================
1        Director       Kubrick           CHUD             Sarah
1        Leading Male   Robert Redford    CHUD             James
1        Leading Female Miley Cyrus       Dark Horizons    James
1        Villain        Hugh Grant        CHUD             Sarah
2        Director       Mel Gibson        Yahoo            Cameron
2        Leading Male   John Lambert      Yahoo            Erica
...

Your can change what you consider a fact value in your design ... it seems that a fact in your data model could be expressed as the following N-tuple:

Movie | FactType | FactValue | FactSource | FactJournalist

The following table structures should support the data model you want, and can relatively easily be indexed and joined. You can also create a view that pivots out just the fact value and fact type so that you can create the following perspective:

MovieID | Movie Name | Director | LeadingMale | LeadingFemale | PrimaryVillain | etc

Interestingly, you could consider this to be the logical extension of fully applying an EAV model to the data, and decomposing an individual movie (with it's intuitive attribution of director, lead, villain, etc) into a pivoted structure where attributes focus on the source of the information instead.

The benefits of the proposed data model are:

it is well-normalized (though you should probably normalize the FactType field into a reference table for completeness)
it is possible to create a view that pivots fact types efficiently out into a tabular structure
it is relatively extensible and allows the database to enforce referential integrity and (if desired) cardinality constraints
the MovieFact table can be subclassed to support different kinds of movie facts, not just those that are simple text field
simple queries against the data are relatively efficient

Some of the disadvantages of the data model are:

Composite, conditional queries are harder (but not impossible) to write (e.g. find all movies where Director is A and Leading Male is B, etc...)
The model is somewhat less obvious than the more traditional approach, or one involving EAV structures
inserts and updates are a little trickier because updating multiple facts requires updating multiple rows, not multiple columns

I've the Movie data up a level to normalize the structure, and you could pushed the movie name down into the MovieFact structure for consistency (since for some movies I can imagine even then name is something you may want to track source information for).

Table Movie
========================
MovieID   NUMBER, PrimaryKey
MovieName VARCHAR

Table MovieFact
========================
MovieID          NUMBER,  PrimaryKeyCol1
FactType         VARCHAR, PrimaryKeyCol2
FactValue        VARCHAR
FactSource       VARCHAR
FactJournalist   VARCHAR

Your fictional movie data would then look like the following:

Movie Table
====================================================================================
MovieID  MovieName
====================================================================================
1        Green Lantern
2        The Tick

MovieFact Table
====================================================================================
MovieID  FactType       FactValue         FactSource       FactJournalist
====================================================================================
1        Director       Kubrick           CHUD             Sarah
1        Leading Male   Robert Redford    CHUD             James
1        Leading Female Miley Cyrus       Dark Horizons    James
1        Villain        Hugh Grant        CHUD             Sarah
2        Director       Mel Gibson        Yahoo            Cameron
2        Leading Male   John Lambert      Yahoo            Erica
...

回复收藏 0 原文

野味少女 2024-08-01 09:33:17

有趣的场景。您可以通过将您的实体视为一流对象来绕过 EAV 贫民区；我们称它们为事实。在这种情况下，你的正交性非常有帮助，因为每部电影都有完全相同的四个事实。您的 EAV 表可以是原始/正确的表，然后您可以有一个外部进程来挖掘该表并将数据复制为正确规范化的形式（即您的第一个表）。通过这种方式，您可以获得所需的数据及其元数据，并且您可以轻松查询电影信息，准确到挖掘过程运行的频率。

我认为您肯定需要一些“数据库外”的力量来确保数据保持有效，因为似乎没有任何数据库内的方法来维护常规表和 EAV 表的完整性。我想通过一系列复杂的触发器，您几乎可以完成任何事情，但是一个“了解”您的问题的人类管理员可能更容易处理。

回复收藏 0 原文

梦回旧景 2024-08-01 09:33:17

这是另一个想法...随意在其中打孔:)

Table: Movie
Columns: MovieId|Movie|Director|LeadMale|LeadFemale|Villain

Table: MovieSource
Columns: MovieSourceId|MovieId|MovieRoleId|Source|Journalist

Table: MovieRole
Columns: MovieRoleId|MovieRole
Values: 1|Director, 2|LeadMale, 3|LeadFemale, 4|Villain

我的想法是电影表中的列可以是不同的类型（在您的示例中，它们都是字符串/ varchar，但它们也可以是也有来源的数字或日期信息）。

然而，源数据的列类型可能不会随着电影数据的列类型的变化而变化，因此您可以对源使用更多的 EAV 系统，而不会丢失数据的完整性。

MovieRole 表允许您显式枚举角色，以便您可以在源和电影表的给定单元格之间创建可靠的链接。

-担

Here's another idea...feel free to punch holes in it :)

Table: Movie
Columns: MovieId|Movie|Director|LeadMale|LeadFemale|Villain

Table: MovieSource
Columns: MovieSourceId|MovieId|MovieRoleId|Source|Journalist

Table: MovieRole
Columns: MovieRoleId|MovieRole
Values: 1|Director, 2|LeadMale, 3|LeadFemale, 4|Villain

What I'm thinking is that the columns in the movie table could be of different types (in your example, they are all strings/varchars, but they could be, say, numerical or date information that also has a source).

The column types for the source data, however, probably wouldn't vary as a function of the column types of the movie data, so you could use more of an EAV system for the source without losing the integrity of your data.

The MovieRole table allows you to explicitly enumerate the roles so you can create a sure linkage between the source and a given cell of the movie table.

-Dan

回复收藏 0 原文

顾北清歌寒 2024-08-01 09:33:17

鉴于您只有源数据的两个字段（源和记者），我建议使用这样的元数据表：

Movie    DirectorSource  DirectorJournalist  LeadingMaleSource  LeadingMaleJournalist ...
---------------------------------------------------------------------------------------
The Tick   Yahoo           Cameron           ...                ...

这会将不太重要的源数据保留在主表之外，但查询不会变得复杂，并且你的代码将更具可读性。

我只会建议 EAV 如果...

您有超过 3 个源元数据字段
您需要能够轻松添加或更改电影字段。（像“恶棍”到“主要恶棍”这样的变化每天会进行多次）

Seeing as you only have the two fields for source data (Source and Journalist), I would recommend a meta-data table like this:

Movie    DirectorSource  DirectorJournalist  LeadingMaleSource  LeadingMaleJournalist ...
---------------------------------------------------------------------------------------
The Tick   Yahoo           Cameron           ...                ...

This will keep the less important source data out of the main table, but the queries will not get complicated and your code will be more readable.

I would only advise EAV if ...

You have more than 3 fields of source meta-data
You need to be able to add or change movie fields easily. (changes like 'Villain' to 'Primary Villain' are being done several times per day)

回复收藏 0 原文

情深已缘浅 2024-08-01 09:33:17

我的回答对于SO来说似乎有点太哲学了。耐心听我说。

我认为“来源”列不是主题数据，而是元数据。它实际上是关于我们如何了解其他一些数据的数据。这使得它成为关于数据的数据，这就是元数据。

EAV 造成这些问题的原因之一是它将数据和元数据混合在一行中。有时我自己会故意这样做，作为实现我想要实现的结果的中间步骤。但我尝试永远不要在我的可交付成果中混合数据和元数据。

我知道为什么我从未这样做过，但我无法简明地解释它。

回复收藏 0 原文

善良天后 2024-08-01 09:33:17

由于没有其他人真正对此进行研究，因此我将回答我自己的问题。我很确定类似 EAV 的表确实是唯一的出路。要在每一列上存储元数据（在本例中涉及消息来源和记者），您实际上将每一列本身视为一个实体，这正是 EAV 所允许的。

您可以走其他路线，例如为每个原始列添加第二列和第三列来存储数据，但这肯定会破坏一些基本的标准化规则，并且可能只会在以后给您带来痛苦。

回复收藏 0 原文

べ映画 2024-08-01 09:33:17

嗯....我没有使用过这个，所以我不是根据经验说话（即如果它不起作用，请不要责怪我），但从表面上看，您似乎可以存储“通用”数据您知道它永远存在，就像在普通表中一样，并且“元数据”可能会更改为 XML。那么问题是如何很好地查询它，我认为您可以按照描述来做到这一点此处。

回复收藏 0 原文

左秋 2024-08-01 09:33:17

另一种需要考虑的方法是类表继承。 Bill Karwin 在这个 SO 答案中对 EAV 选项进行了精彩的评论，并且提供了很多很好的上下文。

回复收藏 0 原文

千笙结 2024-08-01 09:33:17

我会根据我需要编码的内容做出决定。

如果 src/journo 只是附加信息，我会继续阅读更多专栏。但如果我知道我最终会构建复杂的 src/journo 查询，我会选择 EAV，因为在元表中搜索记者的参考文献比进入 LeadingFemaleJournalist 更容易em> 和 VillainJournalist 等。就

个人而言 - 我倾向于将 src/journo 元数据转储到另一个 EAV 样式的表中，但使用 FK 来定义属性定义表。拥有自由格式的属性文本字段会带来灾难 - 始终通过约束来控制属性。如果需要，可以实施触发器来提高引用完整性。

对我来说，这取决于观点。您认为消息来源和记者本身就是关系问题，还是只是补充电影的额外数据？下一个细化级别是为 MovieDataSource 和 MovieDataJournalist 创建不同的表，这可以让您将 FK 映射到定义有效源的表，并且记者（然后可以充实有关这些来源/记者的更多信息）。您在这里要做的是在Movie实体和Source（以及Journalist）实体之间建立多对多关系。