规范化该表是否更快?

发布于 2024-07-23 18:34:12 字数 568 浏览 3 评论 0原文

我有两个表,如下所示:

表人员:
VARCHAR 名称
INTEGER 年龄

表消息
VARCHAR 消息
VARCHAR Name

Message 表上有数百个插入和删除操作,查询如下:

insert into Message VALUES ('Hello there', 'John');
delete from Message where name = 'John';

我的问题是,是否值得在 People 中添加 ID 字段并在 Message 中引用 User 作为 id? 下面的查询会快很多吗?

FIRST - select id from User where name = 'John'
THEN - delete from Message where id = $id

我一开始认为通过搜索 id 进行插入和删除会比通过字符搜索更快。 但快多少呢? 速度的提高是否会抵消因需要对 User 表进行额外查询来查找 id 而造成的性能损失?

I have two tables, like these:

Table People:
VARCHAR Name
INTEGER Age

Table Message
VARCHAR Message
VARCHAR Name

There are hundreds of insert and delete going on with the Message table with queries like this:

insert into Message VALUES ('Hello there', 'John');
delete from Message where name = 'John';

My question is, is it worth while to add an ID field in People and reference the User as an id in Message? Will the following queries be much faster?

FIRST - select id from User where name = 'John'
THEN - delete from Message where id = $id

I thought at first inserts and deletes by searching for an id will be faster than by searching by characters. But how much faster? Will the speed incresase offset the performance hit from needing an extra query to the User table to find the id?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(11

满栀 2024-07-30 18:34:13

这一切都与 IO 和可维护性有关。 如果您的 varchar 包含的字节数少于 4 个字节,那么如果您使用 varchar 而不是整数,查询将会更快。 但这并不是一个很大的改进,如果您需要更改名称,您会损失很多性能! 消息表中的每一行都需要更新(例如删除和插入)。

如果使用整数,则仅存储 4 个字节用于引用用户表。 如果您在用户表中的 ID 和名称上使用覆盖索引,那么这只是一个轻微的开销。 用户表可能会保留在缓存中,具体取决于您期望用户表中有多少行以及您拥有多少内存。 在这种情况下,您将进行逻辑扫描,而不是慢得多的物理扫描。

It is all about IO and maintainability. If your varchar contain less than 4 bytes than your query will be faster if you use the varchar and not the integer. But it is not a big improvement, and you loose a lot performance if you ever need to change the name! Every row in message table need to be updated (eg delete and insert).

If you use an integer it is only 4 bytes stored for the reference to the user table. And if you use an covering index on ID and name in the user table then it is just a slight overhead to think of. The user table will probably stay in the cache, depending on how many rows you expect in the user table and how much memory you have. In that case you will have a logical scan instead of a much slower physical scan.

给我一枪 2024-07-30 18:34:13

旁注:不要忘记在要连接表的列上放置索引(如果它们尚不存在)。

A side note: don't forget to put indexes on the columns that you are joining your tables on, if they aren't there already.

撩起发的微风 2024-07-30 18:34:12

正如你所说,额外的查询会使它稍微慢一些(当然,这取决于名称的长度、数据库类型等)

但是当用户更改他的名称、当你想删除用户等时会发生什么? 这种设计会给你带来很多痛苦。 最好标准化,不管这个微小的性能问题,

As you said, the extra query will make it SLIGHTLY slower (of course there are dependences on the name's length, database type, etc.)

But what happens when the user changes his name, when you want to delete an user, etc?. That sort of design will give you lots of pain. It's better to normalize, regardless of this tiny performance issue,

倚栏听风 2024-07-30 18:34:12

一个人的名字永远不是一个好的主键,因为名字不是唯一的。 它们会随着时间的推移而改变。 最好使用代理键(是的,Int 上的联接通常更快,您可以在许多数据库中联接删除语句而不使用较慢的子查询),特别是因为名称往往长于几个字符。

A person's name is never a good Primary Key as names are NOT unique. And they change over time. FAr better to use a surrogate key (and yes a join on an Int is often faster and you can join in delete staments in many databases not use the slower subquery) especially since names tend to be longer than a few characters.

岁月流歌 2024-07-30 18:34:12

您的设计已经标准化,假设您对 People.Name 有唯一约束,并且 Message.Name 和 People.Name 之间存在引用完整性约束。

这并不是规范化的问题 - 如果您想允许人们更改他们的名字,那么您就会遇到性能和可扩展性问题(需要更新消息表中的所有关联行)。

如果人们从不更改系统中的姓名,那么这不是问题。 在这种情况下,名称几乎与 ID 一样好 - 尽管某些 DBMS 使用索引数字而不是索引字符串可能会表现得更好(?)。

删除的性能是另一回事 - 我想说,如果您已经有一个唯一的名称,那么按名称删除会比通过 ID 进行查找(或加入)删除更快 - 但您还是会想要做你自己的基准测试。

Your design is already normalized, assuming you have a unique constraint on People.Name, and there is a referential integrity constraint between Message.Name and People.Name.

It's not so much a problem of normalization - IF you want to allow people to change their names, you have a performance and scaleability issue (needing to update all associated rows in the Message table).

If people never change their names in your system, then it's not a problem. In this case, Name is pretty much as good as an ID - although some DBMS's might perform better with an indexed number rather than an indexed string (?).

The performance of the delete is another matter - I'd say if you have a unique name already, then deleting by name will be quicker than having to do a lookup (or join) to delete by ID - but again you'll want to do your own benchmarking.

萌逼全场 2024-07-30 18:34:12

更快吗? 然而,只有分析才能说明问题。 。 。

更好的做法是在 Person 上放置一个 id 列,并在 Id 上放置从 Message 到 Person 的外键约束(假设所有消息只能发送给 Person 表中的人员)。

您仍然可以在一个语句中删除一条消息

delete from Message where id IN (select id from Person where Name = 'John')

,数据库将对其进行优化,因此它比两个语句快得多(即比单独的 select 和删除语句更快)。

您可以在外键约束上指定级联删除,这样当您删除一个人,发送给该人的所有消息也会自动删除。

请参阅此处了解有关外键的更多信息

Is it faster? Only profiling will tell, however . . .

It IS better practise to put an id column on Person, and put a foreign key constraint from Message to Person on Id (that is assuming all messages can only go to people in Person table).

You can still delete a message in one statement

delete from Message where id IN (select id from Person where Name = 'John')

and the database will optimise this so it is far faster than the two statements (i.e. faster than separate select & delete statements)

You can specify cascading deletes on your foreign key constraint, so that when you delete a Person, all messages to the person are automatically deleted too.

See here for more on Foreign Keys

那片花海 2024-07-30 18:34:12

如果大多数名称都很短(不是 15 到 20 个字符长),并且表索引正确,那么从 id 字段获得的速度性能将可以忽略不计。

If the most of the names are short (not 15 to 20 characters long), and the table is indexed properly, then the speed performance you are going to receive from an id field is going to be negligible.

呆萌少年 2024-07-30 18:34:12

您不需要进行额外的查询。 你可以这样做:

DELETE FROM Message 
INNER JOIN User 
  ON Message.id = User.id 
WHERE User.name = 'John'

You shouldn't need to do an extra query. You can do something like this:

DELETE FROM Message 
INNER JOIN User 
  ON Message.id = User.id 
WHERE User.name = 'John'
绝不服输 2024-07-30 18:34:12

根据我的经验,网站后端的用户表是几乎 100% 时间都保留在内存中的表之一。 它对于任何活动都非常重要,因此它永远不会脱离页面缓冲区。 因此,我肯定会(并且已经)对所有像这样的引用使用 userId 。

In my experience, the user table of a website backend is one of those tables that pretty much stays in memory 100% of the time. It is pretty central to any activity so it never falls out of the page buffers. So I would (and have) definitely go the route of using the userId for all references like this.

岁月苍老的讽刺 2024-07-30 18:34:12

对于您的确切问题,使用如此小的模式,转储原始Messages表的内容,非规范化会更快。 查询计划将更小并且更容易优化,并且不会有连接开销。

一般来说,情况要复杂得多。

这样做是否正确是一个问题。 为此,从规范化设计开始,但如果有令人信服的理由,则愿意并准备好非规范化。 有时存在非规范化的合法原因,尽管规范化数据的收益通常会抵消任何性能损失。

标准化数据更容易维护并且通常更灵活。 为了灵活性,使用数字 pkey 可以让多个人具有相同的名称。 您可以轻松地向人员添加更多字段。 运行报告来查看系统中的所有人员更容易,而无需扫描所有消息

但性能可能是一个因素。 给定两个表中的数据,数据库对于如何连接有多种选择。 它可以使用 PeopleMessages 作为基表,连接的完成方式将影响事物(嵌套循环、散列连接、排序/合并等)。

但最重要的是,标准化实际上可以更快。 如果您的架构比您描述的更复杂怎么办? 假设您的人员表有 50 个与人力资源相关的字段,而您的消息表只有一个 20 个字符的消息字段。 如果你有两个人但有 100k 条消息的情况,非规范化实际上会更快。 这是因为I/O是数据库最大的限制因素。 如果您要在一个查询中转储所有数据,则规范化数据将仅获取这 50 个字段一次,并且您的 Messages 表将密集地填充数据。 在非规范化版本中,每行消息将包含 51 个字段,您将大幅增加 I/O 数量才能获得相同的结果。

To your exact question, with such a small schema, to dump the contents of the original Messages table, the denormalized will be faster. The query plan will be smaller and easier to optimize and there will be no join overhead.

In general, it's much, much more complicated.

Whether it's the right thing to do is a question. For that, start with a normalized design but be willing and prepared to denormalize if there's a compelling reason to do so. There are sometimes legit reasons for denormalization, though usually the gains of normalized data offset any performance loss.

Normalized data is easier to maintain and is generally more flexible. For flexibility, having a numeric pkey lets you have multiple people named the same name. You can add more fields to People easily. It's easier to run a report to see all the people in the system without scanning all Messages.

But performance may be a factor. Given the data in the two tables, the database has several options on how to join. It may use either People or Messages as the base table, and how the join is done will affect things (nested loops, hash joins, sort/merge, etc).

But on top of that, normalized can actually be faster. What if your schema is more complicated than you describe? Let's say your People table has 50 fields of HR-related stuff and your Messages table has only a single 20-character message field. If you have a case of two people but 100k messages, denormalized will actually be faster. This is because I/O is the biggest limiting factor of databases. If you were to dump all data in one query, the normalized data will fetch those 50 fields only once and your Messages table will be densely packed with data. In the denormalized version, each row of Messages will contain 51 fields and you'll drastically increase the number of I/Os to get the same result.

只是一片海 2024-07-30 18:34:12

在获得良好的设计之前,您不必担心优化问题。

现在,我认为很难说这是否是一个不切实际的玩具问题,但通常情况下,您在设计时会考虑到主键和外键关系,而这个键非常不太可能成为一个 varchar。

它可能是一个 GUID,也可能是一个 int,但无论哪种方式,您都将在其上至少有一个非聚集索引,也许在其上有一个聚集索引,并且您将在其上有一个非聚集索引像用户名这样的重要数据,因此整个系统(而不仅仅是这个删除)的最终性能将取决于这些表的可靠的规范化设计和良好的索引策略。

You shouldn't need to worry about optimization until you've got a good design.

Right now, I think it's hard to say if this is an unrealistic toy problem, but typically, you are going to design with a primary key and foreign key relationship in mind, and this key is very unlikely to be a varchar.

It might be a GUID or it might be an int, but either way, you're going to have at least a non-clustered index on it, perhaps a clustered index on it, and you'll have a non-clustered index on important data like the username, so ultimately performance of your entire system (and not just this one delete) is going to depend on a solid normalized design and good indexing strategy for these tables.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文