B 树、数据库、顺序插入与随机插入以及速度。随机就是胜利
编辑
@Remus 更正了我的测试模式。您可以在下面看到他的答案的更正版本。
我接受了用 DECIMAL(29,0) 替换 INT 的建议,结果是:
十进制:2133
GUID:1836
随机插入仍然获胜,即使行稍大一些。
尽管有解释表明随机插入比顺序插入慢,但这些基准测试表明它们显然更快。我得到的解释与基准不一致。因此,我的问题仍然集中在 b 树、顺序插入和速度上。
...
我从经验中知道,当数据按顺序添加到 B 树时(无论方向如何),B 树的性能很差。然而,当随机添加数据时,可以获得最佳性能。
这很容易用 RB 树之类的东西来演示。顺序写入会导致执行最大数量的树平衡。
我知道很少有数据库使用二叉树,而是使用n阶平衡树。我从逻辑上假设,当涉及到顺序输入时,它们与二叉树有类似的命运。
这激发了我的好奇心。
如果是这样,那么就可以推断写入顺序 ID(例如在 IDENTITY(1,1) 中)将导致树发生多次重新平衡。我见过许多帖子反对 GUID,因为“这些会导致随机写入”。我从不使用 GUID,但我突然意识到这个“坏”点实际上是一个好点。
所以我决定测试一下。这是我的代码:
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[T1](
[ID] [int] NOT NULL
CONSTRAINT [T1_1] PRIMARY KEY CLUSTERED ([ID] ASC)
)
GO
CREATE TABLE [dbo].[T2](
[ID] [uniqueidentifier] NOT NULL
CONSTRAINT [T2_1] PRIMARY KEY CLUSTERED ([ID] ASC)
)
GO
declare @i int, @t1 datetime, @t2 datetime, @t3 datetime, @c char(300)
set @t1 = GETDATE()
set @i = 1
while @i < 2000 begin
insert into T2 values (NEWID(), @c)
set @i = @i + 1
end
set @t2 = GETDATE()
WAITFOR delay '0:0:10'
set @t3 = GETDATE()
set @i = 1
while @i < 2000 begin
insert into T1 values (@i, @c)
set @i = @i + 1
end
select DATEDIFF(ms, @t1, @t2) AS [Int], DATEDIFF(ms, @t3, getdate()) AS [GUID]
drop table T1
drop table T2
请注意,我不会减去创建 GUID 的任何时间,也不会减去相当大的额外行大小的时间。在我的机器上的结果如下:
Int: 17,340 ms GUID:6,746 毫秒
这意味着在此测试中,随机插入 16 字节比顺序插入 4 字节几乎快 3 倍。
有人想对此发表评论吗?
诗。我知道这不是一个问题。这是一个讨论邀请,与学习最佳编程相关。
EDIT
@Remus corrected my test pattern. You can see the corrected version on his answer below.
I took the suggestion of replacing the INT with DECIMAL(29,0) and the results were:
Decimal: 2133
GUID: 1836
Random inserts are still winning, even with a fractionally bigger row.
Despite explanations that indicate random inserts are slower than sequential ones, these benchmarks show they are apparently faster. The explanations I'm getting aren't agreeing with the benchmarks. Therefore, my question remains focused on b-trees, sequential inserts, and speed.
...
I know from experience that b-trees have awful performance when data is added to them sequentially (regardless of the direction). However, when data is added randomly, best performance is obtained.
This is easy to demonstrate with the likes of an RB-Tree. Sequential writes cause a maximum number of tree balances to be performed.
I know very few databases use binary trees, but rather used n-order balanced trees. I logically assume they suffer a similar fate to binary trees when it comes to sequential inputs.
This sparked my curiosity.
If this is so, then one could deduce that writing sequential IDs (such as in IDENTITY(1,1)) would cause multiple re-balances of the tree to occur. I have seen many posts argue against GUIDs as "these will cause random writes". I never use GUIDs, but it struck me that this "bad" point was in fact a good point.
So I decided to test it. Here is my code:
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[T1](
[ID] [int] NOT NULL
CONSTRAINT [T1_1] PRIMARY KEY CLUSTERED ([ID] ASC)
)
GO
CREATE TABLE [dbo].[T2](
[ID] [uniqueidentifier] NOT NULL
CONSTRAINT [T2_1] PRIMARY KEY CLUSTERED ([ID] ASC)
)
GO
declare @i int, @t1 datetime, @t2 datetime, @t3 datetime, @c char(300)
set @t1 = GETDATE()
set @i = 1
while @i < 2000 begin
insert into T2 values (NEWID(), @c)
set @i = @i + 1
end
set @t2 = GETDATE()
WAITFOR delay '0:0:10'
set @t3 = GETDATE()
set @i = 1
while @i < 2000 begin
insert into T1 values (@i, @c)
set @i = @i + 1
end
select DATEDIFF(ms, @t1, @t2) AS [Int], DATEDIFF(ms, @t3, getdate()) AS [GUID]
drop table T1
drop table T2
Note that I am not subtracting any time for the creation of the GUID nor for the considerably extra size of the row. The results on my machine were as follows:
Int: 17,340 ms
GUID: 6,746 ms
This means that in this test, random inserts of 16 bytes was almost 3 times faster than sequential inserts of 4 bytes.
Would anyone like to comment on this?
Ps. I get that this isn't a question. It's an invite to discussion, and that is relevant to learning optimum programming.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
翻转操作,int 更快..您考虑过日志和数据文件的增长吗?单独运行每个
UUID 的问题是,当在它们上进行集群而不使用 NEWSEQUENTIALID() 时,它们会导致分页和表碎片,
现在尝试这样,您会看到它几乎相同
并且相反
flip the operation and the int is faster..have you taken into account log and data file growth? Run each separately
the problem with UUIDs is when clustering on them and not using NEWSEQUENTIALID() is that they cause page breaks and fragmentation of the table
now try like this and you see it is almost the same
And reversed
您没有测量 INSERT 速度。您正在测量日志刷新性能。由于您在每次 INSERT 之后提交,因此正在执行的所有测试都在等待提交以强化日志。这与 INSERT 性能几乎无关。并且,当 SET NOCOUNT 为
OFF
时,请不要发布“性能”测量结果...因此,让我们尝试一下这个操作,避免不必要的服务器-客户端对话,使用适当大小的数据、批量提交和预先生成的数据库:
INTS:18秒
GUIDS:23s
QED
You are not measuring the INSERT speed. You are measuring your log flush performance. Since you commit after each INSERT, all those tests are doing are sitting around waiting for commit to harden the log. That is hardly relevant for INSERT performance. And please don't post 'performance' measurements when SET NOCOUNT is
OFF
...So lets try this without unnecessary server-client chatter, with a properly sized data, batch commits and pre-grown databases:
INTS: 18s
GUIDS: 23s
QED
我预计在真正的数据库中,索引的重新平衡是一个小问题,因为许多索引条目将适合单个块并且很长。
更严重的问题可能是对包含所有新条目的单个块的争用。 Oracle 有一个功能,可以以相反的顺序存储密钥的字节,以便将新条目分散到所有块中: http://oracletoday.blogspot.com/2006/09/there-is-option-to-create-index.html 不知道其他数据库。
I expect in a real database rebalancing of an index being a minor problem, because lots of index entries will fit in a single block and as long.
What might become more of an issue might be contention to that single block containing all the new entries. Oracle has a feature to store the bytes of the key in reverse order to spread new entries out over all blocks: http://oracletoday.blogspot.com/2006/09/there-is-option-to-create-index.html Don't know about other databases.