PostgreSQL 和SQL Server btree存储基础问题
我知道 SQL Server 可以在聚集索引的叶级存储一行数据。我相信 PostgreSQL 不会这样做。如果是这样,它的存储范例是什么?
我的主要问题如下。考虑以下设计和数据(以 T-SQL 所示):
CREATE TABLE dbo.Tree
(
[Key] int NOT NULL,
ID int NOT NULL
) ON [PRIMARY]
GO
ALTER TABLE dbo.Tree ADD CONSTRAINT
PK_Tree PRIMARY KEY CLUSTERED
(
[Key],
ID
) WITH (...) ON [PRIMARY]
INSERT INTO TREE ([Key], ID) VALUES (1, 1), (1, 2), (1, 3), (1, 4).
由于这是一个以两列作为 PK 的 btree,我是否正确地说“[Key] = 1”只会存储一次,而“ID = [1, 2, 3” , 4]" 将是 btree 中的单独值,而每个 sé 中不会有叶值,因为没有不属于 PK 一部分的行列?
这在 PostgreSQL 中如何工作?
I know that SQL Server can store a row's data at leaf level in a clustered index. I believe that PostgreSQL doesn't do this. If so, what is its storage paradigm?
My main question is as follows. Consider the following design & data (shown in T-SQL):
CREATE TABLE dbo.Tree
(
[Key] int NOT NULL,
ID int NOT NULL
) ON [PRIMARY]
GO
ALTER TABLE dbo.Tree ADD CONSTRAINT
PK_Tree PRIMARY KEY CLUSTERED
(
[Key],
ID
) WITH (...) ON [PRIMARY]
INSERT INTO TREE ([Key], ID) VALUES (1, 1), (1, 2), (1, 3), (1, 4).
Since this is a btree with both columns as the PK, am I correct in saying that "[Key] = 1" would only be stored once, and "ID = [1, 2, 3, 4]" would be individual values in the btree, while there would be no leaf values per sé since there are no row columns that aren't part of the PK?
How would this work in PostgreSQL?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
TL;DR 版本 - 无论 DBMS 实现如何,您的键值始终存储在磁盘上。
PostgreSQL 将在磁盘上的页面中存储 4 行,每一行对应您插入的行。 SQL Server 还将在磁盘上存储 4 行。 B树是查找结构,而不是页级存储结构。
在底层磁盘层面,PostgreSQL使用无序的磁盘结构来存储数据。发生这种情况的原因是,由于 MVCC 事务语义,PostgreSQL 可能在任何给定时间维护行的多个副本。每行都有一个 xmin 和 xmax 详细说明当前行的创建和销毁事务 ID。 autovacuum 进程执行幻影记录清理操作。 PostgreSQL 中的索引指向堆表结构中的行。这一组幻灯片详细介绍了该过程。特别是,您需要查看幻灯片 29,了解 B 树查找是如何发生的,并查看幻灯片 48-52,了解数据如何存储在磁盘上的理论讨论。
在 SQL Server 中,叶页上有记录,但只有四行,聚集索引将只有 1 个索引级别 - 叶级别。您可以通过运行 SELECT * FROM sys.dm_db_index_physical_stats(DB_ID(), OBJECT_ID('dbo.Tree'), NULL, NULL, NULL) 来验证这一点。您还可以通过执行以下操作来验证 SQL Server 中的物理页级别:
一旦您查看了 DBCC PAGE 输出,您就会恨我了。最后,您应该看到四行,如下所示:
这是 SQL Server 存储的实际行数据。您将在整个输出中看到 Key = 1 的多个副本,后跟 ID 信息。可以找到这些命令的支持信息 此处。
PostgreSQL 和 SQL Server 之间差异背后的原因来自 PostgreSQL 的 MVCC 实现。由于我们可能在 PostgreSQL 中拥有一行的多个副本,因此最好在磁盘上保留多个数据副本,而不是修改支持的索引结构。只要有可能,PostgreSQL 就会进行仅堆更新,并且只对基础表发出更新。 SQL Server 做同样的事情,并且只有在可以避免更新支持索引时才会更新聚集索引(或堆)。
TL;DR version - your key values are always stored on disk, regardless of DBMS implementation.
PostgreSQL would store 4 rows in page on disk, one for each row that you've inserted. SQL Server will also store 4 rows on disk. The B-tree is the lookup structure, not the page level storage structure.
At the underlying disk level, PostgreSQL uses unordered disk structures to store data. This happens because PostgreSQL may be maintaining multiple copies of a row at any given time due to MVCC transaction semantics. Each row has an xmin and xmax detailing the creation and destruction transaction ID of the current row. The autovacuum process performs ghost record clean up operations. The indexes in PostgreSQL point back to the rows in the heap table structure. This set of slides details the process. In particular you'll want to look at slide 29 for how the b-tree lookup occurs and 48-52 for a theoretical discussion of how the data is stored on disk.
In SQL Server, you'll have records on a leaf page, but with only four rows the clustered index will have just 1 index level - the leaf level. You can verify this by running
SELECT * FROM sys.dm_db_index_physical_stats(DB_ID(), OBJECT_ID('dbo.Tree'), NULL, NULL, NULL)
. You can also verify the physical page level in SQL Server doing something like this:Once you look at the DBCC PAGE output, you'll be ready to hate me. Towards the end you should see four rows that look something like this:
This is the actual row data as SQL Server is storing it. You'll see multiple copies of Key = 1 throughout the output, followed by the ID information. Supporting information for these commands can be found here.
The reasoning behind the difference between PostgreSQL and SQL Server comes from PostgreSQL's MVCC implementation. Since we may have multiple copies of a row in PostgreSQL, it's more optimal to keep several copies of the data on disk instead of modifying the supporting index structures. Whenever possible, PostgreSQL does heap-only updates and only issues updates on the underlying table. SQL Server does the same thing and will only update the clustered index (or heap) when it can avoid updating the supporting indexes.
你是对的 - Postgres 不能做你所要求的事情。有关详细信息,请参阅此问题。
您可以使用
CLUSTER
实现行的聚类 命令,但是一旦执行 DML,这不会使数据保持集群状态。You are right - Postgres cannot do what you are asking. see this question for details.
You can achieve clustering of rows using the
CLUSTER
command, but this does not keep the data clustered once you do DML.与 SQL Server 和其他引擎不同,PostgreSQL 不会将更改记录的事务 ID 存储在索引中,而仅存储在堆中。
索引只是指向堆(并将相应记录的 ctid 存储为行指针,因此也是键的一部分)。
这意味着对于每个查询,即使可以通过索引查找来满足,仍然应该进行堆查找以确保数据对当前事务的可见性。
因此,覆盖索引在 PostgreSQL 中并没有那么有用:因为无论如何都应该进行堆查找,引擎只能从堆中获取所有数据。
Unlike
SQL Server
and other engines,PostgreSQL
does not store the id of the transaction that changed the record in the indexes, only in the heap.The indexes just point into the heap (and store the
ctid
of the appropriate record as a row pointer and hence a part of the key).This means that for each query, even if it could be satisfied by an index lookup, a heap lookup should still be made to ensure the visibility of the data to the current transaction.
Thus said, covering indexes are not that useful in
PostgreSQL
: since heap lookups should be made anyway, the engine can just take all the data from the heap.如果您想了解有关索引的更多信息,请查看我的 SQL 索引教程。
Have a look at my SQL Indexing tutorial if you want to learn more about indexing in general.