关系数据库中的多值属性?
当一个想法被广泛引用时,关系数据库中的多值属性有多好?
让我举个例子来说明我的意思。假设我有下表:
UserID Attribute1
User1 a,b,c
User2 x,y,z
User3 a,x,y
User4 c,b,z
[a,b,c,x,y,z are to be strings]
还有另一个用户 User5
,我必须根据他的 Attribute1
是否与其他 4 个用户中的任何一个匹配或不是。
[在图形数据库中,任务本来可以更容易,因为我可以使用相同的关系从各个用户创建多个节点。]
现在,这个表只是一个微观层面的抽象实际的数据库会是什么样子。表中的行数即使不是数百万,也可能达到数十万。此外,多个值实际上可能远大于 3。除此之外,数据库可能处于重负载状态,在这种情况下,可能会出现一些问题。
那么,多值属性在这种情况下有用吗?或者有更好的方法来做同样的事情吗?我能想到的一种明显的方法是将其存储为:
UserID Attribute1
User1 a
User1 b
User1 c
User2 x
User2 y
User2 z
User3 a
User3 x
User3 y
User4 c
User4 b
User4 z
在数据库中处理此类情况的任何更快的方法?或者现代数据库是否有任何内置功能可供利用?
How well an idea are multi-valued attributes in a relational database when they are to be referred extensively?
Let me give you an example to show what I mean. Suppose I have the following table:
UserID Attribute1
User1 a,b,c
User2 x,y,z
User3 a,x,y
User4 c,b,z
[a,b,c,x,y,z are to be strings]
There is another user User5
to whom I have to make some suggestions about other users based on whether his Attribute1
matches any one of other 4 users or not.
[In graph databases, the task could have been much easier as I could have created multiple nodes from the respective users using the same relationship.]
Now, this table is just a micro-level abstraction of what an actual database will look like. The number of rows in a table may run into hundreds of thousands, if not millions. Also, the multiple values may actually be a lot more than 3. Apart from this, the database can be under heavy load, and in that situation, there may be some issues.
So, are multi-valued attributes helpful in such cases? Or is there any better way of doing the same? One obvious way I can think of is to store it as:
UserID Attribute1
User1 a
User1 b
User1 c
User2 x
User2 y
User2 z
User3 a
User3 x
User3 y
User4 c
User4 b
User4 z
Any faster way of dealing such situations in databases? Or are there any built-in features of modern-day databases to exploit?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
仅当数据在数据库中很重时(即,如果您仅从数据库中读取该字段并随后对其进行处理),在一个字段中包含多个值才有用。
一旦您想在查询中使用字段中的值,您就会因为必须解析该值进行比较而遭受巨大的性能损失。如果像第二个示例一样将值放在单独的记录中,以便可以在其上添加索引,则查询速度提高 10 000 倍并非不现实。
表中拥有一百万条记录不是问题。我们有一些表,其中包含超过 1 亿条记录。
Having multiple values in a field is only useful if the data is dead weight in the database, i.e. if you only read the field out of the database and process it afterwards.
As soon as you want to use the values in the field in a query, you will take a huge performance hit from having to parse the value to compare it. If you put the values in separate records as in your second example, so that you can add an index on it, it's not unrealistic that the query will be 10 000 times faster.
Having a million records in a table is not a problem. We have some tables that have over 100 million records in them.
除了其他人所说的关于规范化之外,我想回答您的问题中的“或者现代数据库的任何内置功能可以利用吗?”部分:
PostgreSQL 有一个非常漂亮的扩展称为
hstore
,它正是以高度优化的方式做到这一点。hstore
数据类型本质上是一个键/值对,您可以在其中存储任何内容。在您的示例中,如下所示:将键
att1
和att2
插入列属性中。可以对其进行索引以加快查找速度。您可以使用以下语法查询数据:
这将返回具有名为
att1
的键且映射到值“Some Value”的所有行。上面的语句将使用列上的现有索引,因此查找速度几乎与“真实”列一样快。上面的语句在我的笔记本电脑上大约需要 2 毫秒才能在包含 100.000 行的表中找到一行。您还可以查询定义了特定属性的行,而不管值如何:
将查找定义了
att1
的所有行,并输出这些行的值。Apart from what the others have said regarding normalization, I'd like to answer to the "Or any inbuilt feature of modern-day databses to exploit?" part of your question:
PostgreSQL has a pretty nifty extension called
hstore
which does exactly that and in a highly optimized manner.The
hstore
data type is essentially a key/value pair, where you can store anything. In your example something like this:Will insert the keys
att1
andatt2
into the column attributes. This can be indexed to make lookups fast.You can query the data using this syntax:
This will return all rows that have a key named
att1
and where that is mapped to the value "Some Value". The above statement will use an existing index on the column, so the lookup is nearly as fast as with a "real" column. The above statement takes ~2ms on my laptop to find a row in a table with 100.000 rows.You can also query for rows that have a specific attribute defined regardless of the value:
will find all rows where
att1
is defined and will output the value for those.对于 nn 表,您可以将其规范化为 3 个表(在事务模型中) users - user_attribute - fields,其中 user_attribute 表由用户和属性的主键组成。键通常被索引,因此读取操作
编辑速度 相当快在问题之后,
这将导致一个仅保存用户的表,一个仅保存属性的表和一个保存哪个用户持有什么的
表
For a n-n table you could normalize it to 3 tables (in a transactional model) users - user_attribute - attributes where the user_attribute table consists out of the primary key of users and attributes.. Keys are usually indexed and therefore quite fast for read ops
EDIT AFTER QUESTION
this would result in a table holding only the users, a table holding only the attributes and a table holding which user is holding what
for instance