B 树、数据库、顺序插入与随机插入以及速度。随机就是胜利

发布于 2024-10-10 07:57:08 字数 1659 浏览 7 评论 0原文

编辑

@Remus 更正了我的测试模式。您可以在下面看到他的答案的更正版本。

我接受了用 DECIMAL(29,0) 替换 INT 的建议,结果是:

十进制:2133
GUID:1836

随机插入仍然获胜,即使行稍大一些。

尽管有解释表明随机插入比顺序插入慢,但这些基准测试表明它们显然更快。我得到的解释与基准不一致。因此,我的问题仍然集中在 b 树、顺序插入和速度上。

...

我从经验中知道,当数据按顺序添加到 B 树时(无论方向如何),B 树的性能很差。然而,当随机添加数据时,可以获得最佳性能。

这很容易用 RB 树之类的东西来演示。顺序写入会导致执行最大数量的树平衡。

我知道很少有数据库使用二叉树,而是使用n阶平衡树。我从逻辑上假设,当涉及到顺序输入时,它们与二叉树有类似的命运。

这激发了我的好奇心。

如果是这样,那么就可以推断写入顺序 ID(例如在 IDENTITY(1,1) 中)将导致树发生多次重新平衡。我见过许多帖子反对 GUID,因为“这些会导致随机写入”。我从不使用 GUID,但我突然意识到这个“坏”点实际上是一个点。

所以我决定测试一下。这是我的代码:

SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[T1](
    [ID] [int] NOT NULL
 CONSTRAINT [T1_1] PRIMARY KEY CLUSTERED ([ID] ASC) 
)
GO

CREATE TABLE [dbo].[T2](
    [ID] [uniqueidentifier] NOT NULL
 CONSTRAINT [T2_1] PRIMARY KEY CLUSTERED ([ID] ASC)
)

GO

declare @i int, @t1 datetime, @t2 datetime, @t3 datetime, @c char(300)

set @t1 = GETDATE()
set @i = 1

while @i < 2000 begin
    insert into T2 values (NEWID(), @c)
    set @i = @i + 1
end

set @t2 = GETDATE()
WAITFOR delay '0:0:10'
set @t3 = GETDATE()
set @i = 1

while @i < 2000 begin
    insert into T1 values (@i, @c)
    set @i = @i + 1
end

select DATEDIFF(ms, @t1, @t2) AS [Int], DATEDIFF(ms, @t3, getdate()) AS [GUID]

drop table T1
drop table T2

请注意,我不会减去创建 GUID 的任何时间,也不会减去相当大的额外行大小的时间。在我的机器上的结果如下:

Int: 17,340 ms GUID:6,746 毫秒

这意味着在此测试中,随机插入 16 字节顺序插入 4 字节几乎快 3 倍

有人想对此发表评论吗?

诗。我知道这不是一个问题。这是一个讨论邀请,与学习最佳编程相关。

EDIT

@Remus corrected my test pattern. You can see the corrected version on his answer below.

I took the suggestion of replacing the INT with DECIMAL(29,0) and the results were:

Decimal: 2133
GUID: 1836

Random inserts are still winning, even with a fractionally bigger row.

Despite explanations that indicate random inserts are slower than sequential ones, these benchmarks show they are apparently faster. The explanations I'm getting aren't agreeing with the benchmarks. Therefore, my question remains focused on b-trees, sequential inserts, and speed.

...

I know from experience that b-trees have awful performance when data is added to them sequentially (regardless of the direction). However, when data is added randomly, best performance is obtained.

This is easy to demonstrate with the likes of an RB-Tree. Sequential writes cause a maximum number of tree balances to be performed.

I know very few databases use binary trees, but rather used n-order balanced trees. I logically assume they suffer a similar fate to binary trees when it comes to sequential inputs.

This sparked my curiosity.

If this is so, then one could deduce that writing sequential IDs (such as in IDENTITY(1,1)) would cause multiple re-balances of the tree to occur. I have seen many posts argue against GUIDs as "these will cause random writes". I never use GUIDs, but it struck me that this "bad" point was in fact a good point.

So I decided to test it. Here is my code:

SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[T1](
    [ID] [int] NOT NULL
 CONSTRAINT [T1_1] PRIMARY KEY CLUSTERED ([ID] ASC) 
)
GO

CREATE TABLE [dbo].[T2](
    [ID] [uniqueidentifier] NOT NULL
 CONSTRAINT [T2_1] PRIMARY KEY CLUSTERED ([ID] ASC)
)

GO

declare @i int, @t1 datetime, @t2 datetime, @t3 datetime, @c char(300)

set @t1 = GETDATE()
set @i = 1

while @i < 2000 begin
    insert into T2 values (NEWID(), @c)
    set @i = @i + 1
end

set @t2 = GETDATE()
WAITFOR delay '0:0:10'
set @t3 = GETDATE()
set @i = 1

while @i < 2000 begin
    insert into T1 values (@i, @c)
    set @i = @i + 1
end

select DATEDIFF(ms, @t1, @t2) AS [Int], DATEDIFF(ms, @t3, getdate()) AS [GUID]

drop table T1
drop table T2

Note that I am not subtracting any time for the creation of the GUID nor for the considerably extra size of the row. The results on my machine were as follows:

Int: 17,340 ms
GUID: 6,746 ms

This means that in this test, random inserts of 16 bytes was almost 3 times faster than sequential inserts of 4 bytes.

Would anyone like to comment on this?

Ps. I get that this isn't a question. It's an invite to discussion, and that is relevant to learning optimum programming.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

如此安好 2024-10-17 07:57:08

翻转操作,int 更快..您考虑过日志和数据文件的增长吗?单独运行每个

declare @i int, @t1 datetime, @t2 datetime

set @t1 = GETDATE()
set @i = 1

while @i < 10000 begin
    insert into T2 values (NEWID())
    set @i = @i + 1
END


set @t2 = GETDATE()
set @i = 1

while @i < 10000 begin
    insert into T1 values (@i)
    set @i = @i + 1
end



select DATEDIFF(ms, @t1, @t2) AS [UID], DATEDIFF(ms, @t2, getdate()) AS [Int]

UUID 的问题是,当在它们上进行集群而不使用 NEWSEQUENTIALID() 时,它们会导致分页和表碎片,

现在尝试这样,您会看到它几乎相同

declare @i int, @t1 datetime, @t2 datetime

set @t1 = GETDATE()
set @i = 1

while @i < 10000 begin
    insert into T2 values (NEWID())
    set @i = @i + 1
END
select DATEDIFF(ms, @t1, getdate()) 

set @t1 = GETDATE()
set @i = 1

while @i < 10000 begin
    insert into T1 values (@i)
    set @i = @i + 1
end



select DATEDIFF(ms, @t1, getdate())

并且相反

declare @i int, @t1 datetime, @t2 datetime



set @t1 = GETDATE()
set @i = 1

while @i < 10000 begin
    insert into T1 values (@i)
    set @i = @i + 1
end

set @t1 = GETDATE()
set @i = 1

while @i < 10000 begin
    insert into T2 values (NEWID())
    set @i = @i + 1
END
select DATEDIFF(ms, @t1, getdate())

flip the operation and the int is faster..have you taken into account log and data file growth? Run each separately

declare @i int, @t1 datetime, @t2 datetime

set @t1 = GETDATE()
set @i = 1

while @i < 10000 begin
    insert into T2 values (NEWID())
    set @i = @i + 1
END


set @t2 = GETDATE()
set @i = 1

while @i < 10000 begin
    insert into T1 values (@i)
    set @i = @i + 1
end



select DATEDIFF(ms, @t1, @t2) AS [UID], DATEDIFF(ms, @t2, getdate()) AS [Int]

the problem with UUIDs is when clustering on them and not using NEWSEQUENTIALID() is that they cause page breaks and fragmentation of the table

now try like this and you see it is almost the same

declare @i int, @t1 datetime, @t2 datetime

set @t1 = GETDATE()
set @i = 1

while @i < 10000 begin
    insert into T2 values (NEWID())
    set @i = @i + 1
END
select DATEDIFF(ms, @t1, getdate()) 

set @t1 = GETDATE()
set @i = 1

while @i < 10000 begin
    insert into T1 values (@i)
    set @i = @i + 1
end



select DATEDIFF(ms, @t1, getdate())

And reversed

declare @i int, @t1 datetime, @t2 datetime



set @t1 = GETDATE()
set @i = 1

while @i < 10000 begin
    insert into T1 values (@i)
    set @i = @i + 1
end

set @t1 = GETDATE()
set @i = 1

while @i < 10000 begin
    insert into T2 values (NEWID())
    set @i = @i + 1
END
select DATEDIFF(ms, @t1, getdate())
长亭外,古道边 2024-10-17 07:57:08

您没有测量 INSERT 速度。您正在测量日志刷新性能。由于您在每次 INSERT 之后提交,因此正在执行的所有测试都在等待提交以强化日志。这与 INSERT 性能几乎无关。并且,当 SET NOCOUNT 为 OFF 时,请不要发布“性能”测量结果...

因此,让我们尝试一下这个操作,避免不必要的服务器-客户端对话,使用适当大小的数据、批量提交和预先生成的数据库:

:setvar dbname testdb
:setvar testsize 1000000
:setvar batchsize 1000

use master;
go

if db_id('$(dbname)') is not null
begin
    drop database [$(dbname)];
end
go

create database [$(dbname)] 
    on (name='test_data', filename='c:\temp\test_data.mdf', size=10gb)
    log on (name='test_log', filename='c:\temp\test_log.ldf', size=100mb);
go

use [$(dbname)];
go  

CREATE TABLE [dbo].[T1](
    [ID] [int] NOT NULL
 CONSTRAINT [T1_1] PRIMARY KEY CLUSTERED ([ID] ASC) 
)
GO

CREATE TABLE [dbo].[T2](
    [ID] [uniqueidentifier] NOT NULL
 CONSTRAINT [T2_1] PRIMARY KEY CLUSTERED ([ID] ASC)
)
GO

set nocount on;
go

declare @i int, @t1 datetime, @t2 datetime

set @t1 = GETDATE()
set @i = 1

begin transaction;
while @i < $(testsize) begin
    insert into T1 values (@i)
    set @i = @i + 1
    if @i % $(batchsize) = 0
    begin
        commit;
        begin transaction;
    end
end
commit

set @t2 = GETDATE()
set @i = 1
begin transaction
while @i < $(testsize) begin
    insert into T2 values (NEWID())
    set @i = @i + 1
    if @i % $(batchsize) = 0
    begin
        commit;
        begin transaction;
    end
end
commit

select DATEDIFF(ms, @t1, @t2) AS [Int], DATEDIFF(ms, @t2, getdate()) AS [UID]

drop table T1
drop table T2

INTS:18秒
GUIDS:23s

QED

You are not measuring the INSERT speed. You are measuring your log flush performance. Since you commit after each INSERT, all those tests are doing are sitting around waiting for commit to harden the log. That is hardly relevant for INSERT performance. And please don't post 'performance' measurements when SET NOCOUNT is OFF...

So lets try this without unnecessary server-client chatter, with a properly sized data, batch commits and pre-grown databases:

:setvar dbname testdb
:setvar testsize 1000000
:setvar batchsize 1000

use master;
go

if db_id('$(dbname)') is not null
begin
    drop database [$(dbname)];
end
go

create database [$(dbname)] 
    on (name='test_data', filename='c:\temp\test_data.mdf', size=10gb)
    log on (name='test_log', filename='c:\temp\test_log.ldf', size=100mb);
go

use [$(dbname)];
go  

CREATE TABLE [dbo].[T1](
    [ID] [int] NOT NULL
 CONSTRAINT [T1_1] PRIMARY KEY CLUSTERED ([ID] ASC) 
)
GO

CREATE TABLE [dbo].[T2](
    [ID] [uniqueidentifier] NOT NULL
 CONSTRAINT [T2_1] PRIMARY KEY CLUSTERED ([ID] ASC)
)
GO

set nocount on;
go

declare @i int, @t1 datetime, @t2 datetime

set @t1 = GETDATE()
set @i = 1

begin transaction;
while @i < $(testsize) begin
    insert into T1 values (@i)
    set @i = @i + 1
    if @i % $(batchsize) = 0
    begin
        commit;
        begin transaction;
    end
end
commit

set @t2 = GETDATE()
set @i = 1
begin transaction
while @i < $(testsize) begin
    insert into T2 values (NEWID())
    set @i = @i + 1
    if @i % $(batchsize) = 0
    begin
        commit;
        begin transaction;
    end
end
commit

select DATEDIFF(ms, @t1, @t2) AS [Int], DATEDIFF(ms, @t2, getdate()) AS [UID]

drop table T1
drop table T2

INTS: 18s
GUIDS: 23s

QED

羁〃客ぐ 2024-10-17 07:57:08

我预计在真正的数据库中,索引的重新平衡是一个小问题,因为许多索引条目将适合单个块并且很长。

更严重的问题可能是对包含所有新条目的单个块的争用。 Oracle 有一个功能,可以以相反的顺序存储密钥的字节,以便将新条目分散到所有块中: http://oracletoday.blogspot.com/2006/09/there-is-option-to-create-index.html 不知道其他数据库。

I expect in a real database rebalancing of an index being a minor problem, because lots of index entries will fit in a single block and as long.

What might become more of an issue might be contention to that single block containing all the new entries. Oracle has a feature to store the bytes of the key in reverse order to spread new entries out over all blocks: http://oracletoday.blogspot.com/2006/09/there-is-option-to-create-index.html Don't know about other databases.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文