PostgreSQL 和SQL Server btree存储基础问题

发布于 2024-10-21 15:20:09 字数 586 浏览 1 评论 0原文

我知道 SQL Server 可以在聚集索引的叶级存储一行数据。我相信 PostgreSQL 不会这样做。如果是这样,它的存储范例是什么?

我的主要问题如下。考虑以下设计和数据(以 T-SQL 所示):

CREATE TABLE dbo.Tree
    (
    [Key] int NOT NULL,
    ID int NOT NULL
    ) ON [PRIMARY]
GO
ALTER TABLE dbo.Tree ADD CONSTRAINT
    PK_Tree PRIMARY KEY CLUSTERED 
    (
    [Key],
    ID
    ) WITH (...) ON [PRIMARY]

INSERT INTO TREE ([Key], ID) VALUES (1, 1), (1, 2), (1, 3), (1, 4).

由于这是一个以两列作为 PK 的 btree,我是否正确地说“[Key] = 1”只会存储一次,而“ID = [1, 2, 3” , 4]" 将是 btree 中的单独值,而每个 sé 中不会有叶值,因为没有不属于 PK 一部分的行列?

这在 PostgreSQL 中如何工作?

I know that SQL Server can store a row's data at leaf level in a clustered index. I believe that PostgreSQL doesn't do this. If so, what is its storage paradigm?

My main question is as follows. Consider the following design & data (shown in T-SQL):

CREATE TABLE dbo.Tree
    (
    [Key] int NOT NULL,
    ID int NOT NULL
    ) ON [PRIMARY]
GO
ALTER TABLE dbo.Tree ADD CONSTRAINT
    PK_Tree PRIMARY KEY CLUSTERED 
    (
    [Key],
    ID
    ) WITH (...) ON [PRIMARY]

INSERT INTO TREE ([Key], ID) VALUES (1, 1), (1, 2), (1, 3), (1, 4).

Since this is a btree with both columns as the PK, am I correct in saying that "[Key] = 1" would only be stored once, and "ID = [1, 2, 3, 4]" would be individual values in the btree, while there would be no leaf values per sé since there are no row columns that aren't part of the PK?

How would this work in PostgreSQL?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

葬﹪忆之殇 2024-10-28 15:20:09

TL;DR 版本 - 无论 DBMS 实现如何,您的键值始终存储在磁盘上。

PostgreSQL 将在磁盘上的页面中存储 4 行,每一行对应您插入的行。 SQL Server 还将在磁盘上存储 4 行。 B树是查找结构,而不是页级存储结构。

在底层磁盘层面,PostgreSQL使用无序的磁盘结构来存储数据。发生这种情况的原因是,由于 MVCC 事务语义,PostgreSQL 可能在任何给定时间维护行的多个副本。每行都有一个 xmin 和 xmax 详细说明当前行的创建和销毁事务 ID。 autovacuum 进程执行幻影记录清理操作。 PostgreSQL 中的索引指向堆表结构中的行。这一组幻灯片详细介绍了该过程。特别是,您需要查看幻灯片 29,了解 B 树查找是如何发生的,并查看幻灯片 48-52,了解数据如何存储在磁盘上的理论讨论。

在 SQL Server 中,叶页上有记录,但只有四行,聚集索引将只有 1 个索引级别 - 叶级别。您可以通过运行 SELECT * FROM sys.dm_db_index_physical_stats(DB_ID(), OBJECT_ID('dbo.Tree'), NULL, NULL, NULL) 来验证这一点。您还可以通过执行以下操作来验证 SQL Server 中的物理页级别:

-- Locate the first page of the index
DBCC IND('test', 'Tree', 1);
GO
-- tell SQL Server to show DBCC output in the message page, not the SQL Server log
DBCC TRACEON (3604);
GO
-- look at nasty, dirty, on page data.
DBCC PAGE(test, 1,155,3);

一旦您查看了 DBCC PAGE 输出,您就会恨我了。最后,您应该看到四行,如下所示:

Slot 0 Offset 0x60 Length 15

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP     Record Size = 15

Memory Dump @0x000000006D6FA060

0000000000000000:   10000c00 01000000 01000000 020000††††...............  

Slot 0 Column 1 Offset 0x4 Length 4 Length (physical) 4

Key = 1                              

Slot 0 Column 2 Offset 0x8 Length 4 Length (physical) 4

ID = 1                               

Slot 0 Offset 0x0 Length 0 Length (physical) 0

KeyHashValue = (e2338e2f4a9f)  

这是 SQL Server 存储的实际行数据。您将在整个输出中看到 Key = 1 的多个副本,后跟 ID 信息。可以找到这些命令的支持信息 此处

PostgreSQL 和 SQL Server 之间差异背后的原因来自 PostgreSQL 的 MVCC 实现。由于我们可能在 PostgreSQL 中拥有一行的多个副本,因此最好在磁盘上保留多个数据副本,而不是修改支持的索引结构。只要有可能,PostgreSQL 就会进行仅堆更新,并且只对基础​​表发出更新。 SQL Server 做同样的事情,并且只有在可以避免更新支持索引时才会更新聚集索引(或堆)。

TL;DR version - your key values are always stored on disk, regardless of DBMS implementation.

PostgreSQL would store 4 rows in page on disk, one for each row that you've inserted. SQL Server will also store 4 rows on disk. The B-tree is the lookup structure, not the page level storage structure.

At the underlying disk level, PostgreSQL uses unordered disk structures to store data. This happens because PostgreSQL may be maintaining multiple copies of a row at any given time due to MVCC transaction semantics. Each row has an xmin and xmax detailing the creation and destruction transaction ID of the current row. The autovacuum process performs ghost record clean up operations. The indexes in PostgreSQL point back to the rows in the heap table structure. This set of slides details the process. In particular you'll want to look at slide 29 for how the b-tree lookup occurs and 48-52 for a theoretical discussion of how the data is stored on disk.

In SQL Server, you'll have records on a leaf page, but with only four rows the clustered index will have just 1 index level - the leaf level. You can verify this by running SELECT * FROM sys.dm_db_index_physical_stats(DB_ID(), OBJECT_ID('dbo.Tree'), NULL, NULL, NULL). You can also verify the physical page level in SQL Server doing something like this:

-- Locate the first page of the index
DBCC IND('test', 'Tree', 1);
GO
-- tell SQL Server to show DBCC output in the message page, not the SQL Server log
DBCC TRACEON (3604);
GO
-- look at nasty, dirty, on page data.
DBCC PAGE(test, 1,155,3);

Once you look at the DBCC PAGE output, you'll be ready to hate me. Towards the end you should see four rows that look something like this:

Slot 0 Offset 0x60 Length 15

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP     Record Size = 15

Memory Dump @0x000000006D6FA060

0000000000000000:   10000c00 01000000 01000000 020000††††...............  

Slot 0 Column 1 Offset 0x4 Length 4 Length (physical) 4

Key = 1                              

Slot 0 Column 2 Offset 0x8 Length 4 Length (physical) 4

ID = 1                               

Slot 0 Offset 0x0 Length 0 Length (physical) 0

KeyHashValue = (e2338e2f4a9f)  

This is the actual row data as SQL Server is storing it. You'll see multiple copies of Key = 1 throughout the output, followed by the ID information. Supporting information for these commands can be found here.

The reasoning behind the difference between PostgreSQL and SQL Server comes from PostgreSQL's MVCC implementation. Since we may have multiple copies of a row in PostgreSQL, it's more optimal to keep several copies of the data on disk instead of modifying the supporting index structures. Whenever possible, PostgreSQL does heap-only updates and only issues updates on the underlying table. SQL Server does the same thing and will only update the clustered index (or heap) when it can avoid updating the supporting indexes.

睡美人的小仙女 2024-10-28 15:20:09

你是对的 - Postgres 不能做你所要求的事情。有关详细信息,请参阅此问题

您可以使用 CLUSTER 实现行的聚类 命令,但是一旦执行 DML,这不会使数据保持集群状态。

You are right - Postgres cannot do what you are asking. see this question for details.

You can achieve clustering of rows using the CLUSTER command, but this does not keep the data clustered once you do DML.

凑诗 2024-10-28 15:20:09

我知道 SQL Server 可以在聚集索引的叶级存储一行数据。我相信 PostgreSQL 不会这样做。如果是这样,它的存储范例是什么?

与 SQL Server 和其他引擎不同,PostgreSQL 不会将更改记录的事务 ID 存储在索引中,而仅存储在堆中。

索引只是指向堆(并将相应记录的 ctid 存储为行指针,因此也是键的一部分)。

这意味着对于每个查询,即使可以通过索引查找来满足,仍然应该进行堆查找以确保数据对当前事务的可见性。

因此,覆盖索引在 PostgreSQL 中并没有那么有用:因为无论如何都应该进行堆查找,引擎只能从堆中获取所有数据。

I know that SQL Server can store a row's data at leaf level in a clustered index. I believe that PostgreSQL doesn't do this. If so, what is its storage paradigm?

Unlike SQL Server and other engines, PostgreSQL does not store the id of the transaction that changed the record in the indexes, only in the heap.

The indexes just point into the heap (and store the ctid of the appropriate record as a row pointer and hence a part of the key).

This means that for each query, even if it could be satisfied by an index lookup, a heap lookup should still be made to ensure the visibility of the data to the current transaction.

Thus said, covering indexes are not that useful in PostgreSQL: since heap lookups should be made anyway, the engine can just take all the data from the heap.

沉默的熊 2024-10-28 15:20:09

如果您想了解有关索引的更多信息,请查看我的 SQL 索引教程

Have a look at my SQL Indexing tutorial if you want to learn more about indexing in general.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文