当前位置：文江博客话题详情

innodb b 树中的内部节点是如何物理存储的？

发布于 2024-10-05 05:27:04 字数 230 浏览 8 评论 0原文

非叶 B 树节点在 innodb 中是如何物理表示的？

回想一下，b 树（更具体地说是 b+树）具有叶节点和非叶节点。在 b+tree 中，所有叶节点都位于非叶节点或“内部”节点的树下方，并指向实际包含行数据的页面。

我知道非叶节点存储在非叶节点段中，并使用类似于数据页的页面。我找到了有关数据页如何物理存储的充足文档，但我无法找到有关非叶索引页的任何内容。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

像极了他 2024-10-12 05:27:04

在《学习 InnoDB：核心之旅》中，我介绍了 innodb_diagrams 项目来记录 InnoDB 内部结构，它提供了本文中使用的图表。稍后在 innodb_ruby 快速介绍中，我逐步完成了 innodb_space 命令行工具的安装和一些快速演示。

InnoDB索引页的物理结构在InnoDB索引页的物理结构中描述。现在我们将使用一些实际示例来研究 InnoDB 如何在逻辑上构建其索引。

术语旁白：B+Tree、root、leaf 和 level
InnoDB 使用 B+Tree 结构作为索引。当数据无法放入内存并且必须从磁盘读取时，B+Tree 特别高效，因为它确保仅基于树的深度来访问任何请求的数据需要固定的最大读取次数，可以很好地扩展。

索引树从“根”页开始，其位置是固定的（并永久存储在 InnoDB 的数据字典中）作为访问树的起点。该树可以小到单个根页面，也可以大到多级树中的数百万个页面。

页面被称为“叶”页面或“非叶”页面（在某些上下文中也称为“内部”或“节点”页面）。叶页包含实际的行数据。非叶页仅包含指向其他非叶页或叶页的指针。该树是平衡的，因此树的所有分支都具有相同的深度。

InnoDB 为树中的每个页面分配一个“级别”：叶页面被分配为级别 0，并且级别沿着树向上递增。根页面级别基于树的深度。如果区分很重要的话，所有既不是叶页面也不是根页面的页面也可以称为“内部”页面。

叶页和非叶页
对于叶页和非叶页，每个记录（包括下确界和上界系统记录）都包含一个“下一条记录”指针，该指针存储到下一条记录的偏移量（页内）。链表从下确界开始，按键按升序链接所有记录，在上界终止。记录在页面内的物理排序并不正确（它们占用插入时可用的任何空间）；它们唯一的顺序来自于它们在链表中的位置。

叶子页面包含非键值作为每条记录中包含的“数据”的一部分：

非叶子页面具有相同的结构，但它们的“数据”不是非关键字段，而是子页面的页码，而不是确切的键，它们表示它们指向的子页面上的最小键：

同一级别的页面
大多数索引包含多个页面，因此多个页面按升序和降序链接在一起：

每个页面都包含“上一页”和“下一页”的指针（在 FIL 标头中），其中对于 INDEX 页面，用于形成同一级别页面的双向链接列表（例如叶页面，在级别 0 形成一个列表，级别 1 页面形成单独的列表等）。

单页表格的详细查看
让我们看一下单个索引页面中与 B+Tree 相关的大部分内容：

创建并填充表格
可以创建并填充上图中使用的测试表（确保您使用的是 innodb_file_per_table 并使用 Barracuda 文件格式）：

 CREATE TABLE t_btree (
  i INT NOT NULL,
  s CHAR(10) NOT NULL,
  PRIMARY KEY(i)
) ENGINE=InnoDB;

INSERT INTO t_btree (i, s)
  VALUES (0, "A"), (1, "B"), (2, "C");

虽然该表非常小且不现实，但它确实很好地演示了记录和记录遍历的工作原理。

验证空间文件的基本结构
该表应该与我们之前检查过的内容相匹配，具有三个标准开销页（FSP_HDR、IBUF_BITMAP 和 INODE），后跟一个用于索引根的 INDEX 页，在本例中是两个未使用的 ALLOCATED 页。

$ innodb_space -f t_btree.ibd 空间页面类型区域

start       end         count       type                
0           0           1           FSP_HDR             
1           1           1           IBUF_BITMAP         
2           2           1           INODE               
3           3           1           INDEX               
4           5           2           FREE (ALLOCATED)

space-index-pages-summary 模式将为我们提供每个页面中的记录计数，并显示预期的 3 条记录：

$ innodb_space -f t_btree.ibd 空间索引页面摘要

page        index   level   data    free    records 
3           18      0       96      16156   3       
4           0       0       0       16384   0       
5           0       0       0       16384   0

（请注意，space-index-pages-summary 还将空的 ALLOCATED 页显示为具有零记录的空页，因为这通常是您出于绘图目的感兴趣的内容。）

空间索引模式将显示有关我们的主键索引的统计信息，该索引正在其内部文件段上消耗单个页面：

$ innodb_space -f t_btree.ibd 空间索引

id          root        fseg        used        allocated   fill_factor 
18          3           internal    1           1           100.00%     
18          3           leaf        0           0           0.00%

设置记录描述器
为了让 innodb_ruby 解析记录的内容，我们需要提供一个记录描述器，它只是一个提供返回索引描述的方法的 Ruby 类

： Innodb::RecordDescriber
类型：簇状
键“i”，：INT，：NOT_NULL
行“s”，“CHAR（10）”，：NOT_NULL
我们需要注意

这是聚集键，提供键的列描述以及非键（“行”）字段的列描述。有必要要求 innodb_space 使用以下附加参数加载此类：

-r -r ./simple_t_btree_describer.rb -d SimpleTBTreeDescriber

查看记录内容
本例中的根页面（叶页面）可以使用页面转储模式进行转储并提供根页面的页码：

$ innodb_space -f t_btree.ibd -r ./simple_t_btree_describer.rb -d
SimpleTBTreeDescriber -p 3 页面转储

除了我们之前看过的输出的某些部分之外，它现在还会打印一个“records:”部分，其中包含以下内容每条记录的结构：

{:format=>:compact,
 :offset=>125,
 :header=>
  {:next=>157,
   :type=>:conventional,
   :heap_number=>2,
   :n_owned=>0,
   :min_rec=>false,
   :deleted=>false,
   :field_nulls=>nil,
   :field_lengths=>[0, 0, 0, 0],
   :field_externs=>[false, false, false, false]},
 :next=>157,
 :type=>:clustered,
 :key=>[{:name=>"i", :type=>"INT", :value=>0, :extern=>nil}],
 :transaction_id=>"0000000f4745",
 :roll_pointer=>
  {:is_insert=>true, :rseg_id=>8, :undo_log=>{:page=>312, :offset=>272}},
 :row=>[{:name=>"s", :type=>"CHAR(10)", :value=>"A", :extern=>nil}]}

这应该与上面的详细说明完全一致，因为为了准确性，我已经复制了此示例中的大部分信息。请注意以下方面：

:format 为 :compact 表示该记录是 Barracuda 格式表中新的“紧凑”格式（与 Antelope 表中的“冗余”格式相对）。
输出中列出的 :key 是索引的关键字段数组，:row 是非关键字段数组。
:transaction_id 和 :roll_pointer 字段是每个记录中包含的 MVCC 的内部字段，因为这是一个聚集键（主键）。
:header 哈希中的 :next 字段是一个相对偏移量 (32)，必须将其添加到当前记录偏移量 (125) 才能生成下一条记录的实际偏移量 (157)。为了方便起见，计算出的偏移量作为 :next 包含在记录哈希中。
递归索引
使用索引递归模式可以实现递归整个索引的良好且简单的输出，但由于这仍然是单页索引，因此输出将非常短：

$ innodb_space -f t_btree.ibd -r ./simple_t_btree_describer.rb -d
SimpleTBTreeDescriber -p 3 索引递归

ROOT NODE #3: 3 records, 96 bytes
RECORD: (i=0) -> (s=A)
RECORD: (i=1) -> (s=B)
RECORD: (i=2) -> (s=C)

构建一个不平凡的索引树
InnoDB 中的多级索引树（过于简化）如下所示：

如前所述，每个级别的所有页面都相互双向链接，并且在每个页面内，记录都是单向链接按升序排列。非叶页包含“指针”（包含子页号）而不是非键行数据。

如果我们使用 innodb_ruby 快速介绍中创建的包含 100 万行的更简单的表模式，则树结构看起来更有趣：

$ innodb_space -f t.ibd -r ./simple_t_describer.rb -d SimpleTDescriber
-p 3 索引递归

ROOT NODE #3: 2 records, 26 bytes
NODE POINTER RECORD >= (i=252) -> #36
INTERNAL NODE #36: 1117 records, 14521 bytes
NODE POINTER RECORD >= (i=252) -> #4
LEAF NODE #4: 446 records, 9812 bytes
RECORD: (i=1) -> ()
RECORD: (i=2) -> ()
RECORD: (i=3) -> ()
RECORD: (i=4) -> ()

NODE POINTER RECORD >= (i=447) -> #1676
LEAF NODE #1676: 444 records, 9768 bytes
RECORD: (i=447) -> ()
RECORD: (i=448) -> ()
RECORD: (i=449) -> ()
RECORD: (i=450) -> ()

NODE POINTER RECORD >= (i=891) -> #771
LEAF NODE #771: 512 records, 11264 bytes
RECORD: (i=891) -> ()
RECORD: (i=892) -> ()
RECORD: (i=893) -> ()
RECORD: (i=894) -> ()

这是一个三级索引树，通过上面的ROOT、INTERNAL、LEAF这几行可以看出。我们可以看到有些页面完全满了，468 条记录消耗了 16 KiB 页面中的近 15 KiB。

使用页面转储模式查看非叶页（第 36 页，在上面的输出中），记录看起来与之前显示的叶页略有不同：

$ innodb_space -f t.ibd -r ./simple_t_describer.rb -d SimpleTDescriber
-p 36 页面转储

{:format=>:compact,
 :offset=>125,
 :header=>
  {:next=>11877,
   :type=>:node_pointer,
   :heap_number=>2,
   :n_owned=>0,
   :min_rec=>true,
   :deleted=>false,
   :field_nulls=>nil,
   :field_lengths=>[0],
   :field_externs=>[false]},
 :next=>11877,
 :type=>:clustered,
 :key=>[{:name=>"i", :type=>"INT UNSIGNED", :value=>252, :extern=>nil}],
 :child_page_number=>4}

存在 :key 数组，尽管它代表最小键而不是精确键，并且不存在 :row，因为 :child_page_number 取代了它的位置。

根页面有点特殊
由于根页是在第一次创建索引时分配的，并且该页号存储在数据字典中，因此根页永远无法重新定位或删除。一旦根页面填满，就需要对其进行拆分，形成一个根页面加两个叶子页面的小树。

然而，根页面本身实际上不能被分割，因为它不能被重新定位。相反，会分配一个新的空页，将根中的记录移动到那里（根“提升”了一个级别），并且该新页被分成两部分。然后，根页面不需要再次分割，直到紧接其下的级别具有足够的页面，使得根充满子页面指针（称为“节点指针”），这在实践中通常意味着数百到一千多个。

B+树级别和增加树深度
作为 B+Tree 索引效率的一个例子，假设完美的记录打包（每个页面都已满，这在实践中永远不会发生，但对于讨论很有用）。上例中简单表的 InnoDB 中的 B+Tree 索引将能够在每个叶页存储 468 条记录，或每个非叶页存储 1203 条记录。在给定的树高度下，索引树的最大大小可以是以下大小：

Height  Non-leaf pages  Leaf pages  Rows    Size in bytes
1   0   1   468 16.0 KiB
2   1   1203    > 563 thousand  18.8 MiB
3   1204    1447209 > 677 million   22.1 GiB
4   1448413 1740992427  > 814 billion   25.9 TiB

正如您可以想象的那样，大多数具有合理 PRIMARY KEY 定义的表都是 2-3 级，有些达到 4 级。然而，使用过大的 PRIMARY KEY 可能会导致 B+Tree 的效率大大降低，因为主键值必须存储在非叶页中。这会极大地增加非叶页中记录的大小，这意味着每个非叶页中适合的记录要少得多，从而降低整个结构的效率。

In On learning InnoDB: A journey to the core, I introduced the innodb_diagrams project to document the InnoDB internals, which provides the diagrams used in this post. Later on in A quick introduction to innodb_ruby I walked through installation and a few quick demos of the innodb_space command-line tool.

The physical structure of InnoDB’s INDEX pages was described in The physical structure of InnoDB index pages. We’ll now look into how InnoDB logically structures its indexes, using some practical examples.

An aside on terminology: B+Tree, root, leaf, and level
InnoDB uses a B+Tree structure for its indexes. A B+Tree is particularly efficient when data doesn’t fit in memory and must be read from the disk, as it ensures that a fixed maximum number of reads would be required to access any data requested, based only on the depth of the tree, which scales nicely.

An index tree starts at a “root” page, whose location is fixed (and permanently stored in the InnoDB’s data dictionary) as a starting point for accessing the tree. The tree may be as small as the single root page, or as large as many millions of pages in a multi-level tree.

Pages are referred to as being “leaf” pages or “non-leaf” pages (also called “internal” or “node” pages in some contexts). Leaf pages contain actual row data. Non-leaf pages contain only pointers to other non-leaf pages, or to leaf pages. The tree is balanced, so all branches of the tree have the same depth.

InnoDB assigns each page in the tree a “level”: leaf pages are assigned level 0, and the level increments going up the tree. The root page level is based on the depth of the tree. All pages that are neither leaf pages nor the root page can also be called “internal” pages, if a distinction is important.

Leaf and non-leaf pages
For both leaf and non-leaf pages, each record (including the infimum and supremum system records) contain a “next record” pointer, which stores an offset (within the page) to the next record. The linked list starts at infimum and links all records in ascending order by key, terminating at supremum. The records are not physically ordered within the page (they take whatever space is available at the time of insertion); their only order comes from their position in the linked list.

Leaf pages contain the non-key values as part of the “data” contained in each record:

Non-leaf pages have an identical structure, but instead of non-key fields, their “data” is the page number of the child page, and instead of an exact key, they represent the minimum key on the child page they point to:

Pages at the same level
Most indexes contain more than one page, so multiple pages are linked together in ascending and descending order:

Each page contains pointers (in the FIL header) for “previous page” and “next page”, which for INDEX pages are used to form a doubly-linked list of pages at the same level (e.g. leaf pages, at level 0 form one list, level 1 pages form a separate list, etc.).

A detailed look at a single-page table
Let’s take a look at most of what’s B+Tree related in a single index page:

Create and populate the table
The test table in use in the illustration above can be created and populated with (make sure you’re using innodb_file_per_table and using Barracuda file format):

 CREATE TABLE t_btree (
  i INT NOT NULL,
  s CHAR(10) NOT NULL,
  PRIMARY KEY(i)
) ENGINE=InnoDB;

INSERT INTO t_btree (i, s)
  VALUES (0, "A"), (1, "B"), (2, "C");

While this table is quite small and not realistic, it does demonstrate nicely how records and record traversal works.

Verify the basic structure of the space file
The table should match what we’ve examined before, with the three standard overhead pages (FSP_HDR, IBUF_BITMAP, and INODE) followed by a single INDEX page for the root of the index, and in this case two unused ALLOCATED pages.

$ innodb_space -f t_btree.ibd space-page-type-regions

start       end         count       type                
0           0           1           FSP_HDR             
1           1           1           IBUF_BITMAP         
2           2           1           INODE               
3           3           1           INDEX               
4           5           2           FREE (ALLOCATED)

The space-index-pages-summary mode will give us a count of records in each page, and is showing the expected 3 records:

$ innodb_space -f t_btree.ibd space-index-pages-summary

page        index   level   data    free    records 
3           18      0       96      16156   3       
4           0       0       0       16384   0       
5           0       0       0       16384   0

(Note that space-index-pages-summary also shows the empty ALLOCATED pages as empty pages with zero records, since that’s often what you’re interested in for plotting purposes.)

The space-indexes mode will show the stats about our PRIMARY KEY index, which is consuming a single page on its internal file segment:

$ innodb_space -f t_btree.ibd space-indexes

id          root        fseg        used        allocated   fill_factor 
18          3           internal    1           1           100.00%     
18          3           leaf        0           0           0.00%

Set up a record describer
In order for innodb_ruby to parse the contents of records, we need to provide a record describer, which is just a Ruby class providing a method that returns a description of an index:

class SimpleTBTreeDescriber < Innodb::RecordDescriber
type :clustered
key "i", :INT, :NOT_NULL
row "s", "CHAR(10)", :NOT_NULL
end

We need to note that this is the clustered key, provide the column descriptions for the key, and the column descriptions for the non-key (“row”) fields. It’s necessary to ask innodb_space to load this class with the following additional arguments:

-r -r ./simple_t_btree_describer.rb -d SimpleTBTreeDescriber

Look at the record contents
The root page (which is a leaf page) in this example can be dumped using the page-dump mode and providing the page number for the root page:

$ innodb_space -f t_btree.ibd -r ./simple_t_btree_describer.rb -d
SimpleTBTreeDescriber -p 3 page-dump

Aside from some parts of this output we’ve looked at before, it will now print a “records:” section with the following structure per record:

{:format=>:compact,
 :offset=>125,
 :header=>
  {:next=>157,
   :type=>:conventional,
   :heap_number=>2,
   :n_owned=>0,
   :min_rec=>false,
   :deleted=>false,
   :field_nulls=>nil,
   :field_lengths=>[0, 0, 0, 0],
   :field_externs=>[false, false, false, false]},
 :next=>157,
 :type=>:clustered,
 :key=>[{:name=>"i", :type=>"INT", :value=>0, :extern=>nil}],
 :transaction_id=>"0000000f4745",
 :roll_pointer=>
  {:is_insert=>true, :rseg_id=>8, :undo_log=>{:page=>312, :offset=>272}},
 :row=>[{:name=>"s", :type=>"CHAR(10)", :value=>"A", :extern=>nil}]}

This should align with the above detailed illustration perfectly, as I’ve copied most of the information from this example for accuracy. Note the following aspects:

The :format being :compact indicates that the record is the new “compact” format in Barracuda format tables (as opposed to “redundant” in Antelope tables).
The :key listed in the output is an array of key fields for the index, and :row is an array of non-key fields.
The :transaction_id and :roll_pointer fields are internal fields for MVCC included in each record, since this is a clustered key (the PRIMARY KEY).
The :next field within the :header hash is a relative offset (32) which must be added to the current record offset (125) to yield the actual offset of the next record (157). For convenience this calculated offset is included as :next in the record hash.
Recurse the index
A nice and simple output of recursing the entire index can be achieved with the index-recurse mode, but since this is still a single-page index, the output will be very short:

$ innodb_space -f t_btree.ibd -r ./simple_t_btree_describer.rb -d
SimpleTBTreeDescriber -p 3 index-recurse

ROOT NODE #3: 3 records, 96 bytes
RECORD: (i=0) -> (s=A)
RECORD: (i=1) -> (s=B)
RECORD: (i=2) -> (s=C)

Building a non-trivial index tree
A multi-level index tree (overly simplified) in InnoDB looks like:

As previously described, all pages at each level are doubly-linked to each other, and within each page, records are singly-linked in ascending order. Non-leaf pages contain “pointers” (containing the child page number) rather than non-key row data.

If we use the simpler table schema with 1 million rows created in A quick introduction to innodb_ruby, the tree structure looks a little more interesting:

$ innodb_space -f t.ibd -r ./simple_t_describer.rb -d SimpleTDescriber
-p 3 index-recurse

ROOT NODE #3: 2 records, 26 bytes
NODE POINTER RECORD >= (i=252) -> #36
INTERNAL NODE #36: 1117 records, 14521 bytes
NODE POINTER RECORD >= (i=252) -> #4
LEAF NODE #4: 446 records, 9812 bytes
RECORD: (i=1) -> ()
RECORD: (i=2) -> ()
RECORD: (i=3) -> ()
RECORD: (i=4) -> ()

NODE POINTER RECORD >= (i=447) -> #1676
LEAF NODE #1676: 444 records, 9768 bytes
RECORD: (i=447) -> ()
RECORD: (i=448) -> ()
RECORD: (i=449) -> ()
RECORD: (i=450) -> ()

NODE POINTER RECORD >= (i=891) -> #771
LEAF NODE #771: 512 records, 11264 bytes
RECORD: (i=891) -> ()
RECORD: (i=892) -> ()
RECORD: (i=893) -> ()
RECORD: (i=894) -> ()

This is a three-level index tree, which can be seen by the ROOT, INTERNAL, LEAF lines above. We can see that some pages are completely full, with 468 records consuming almost 15 KiB of the 16 KiB page.

Looking at a non-leaf page (page 36, in the above output) using the page-dump mode, records look slightly different than the leaf pages shown previously:

$ innodb_space -f t.ibd -r ./simple_t_describer.rb -d SimpleTDescriber
-p 36 page-dump

{:format=>:compact,
 :offset=>125,
 :header=>
  {:next=>11877,
   :type=>:node_pointer,
   :heap_number=>2,
   :n_owned=>0,
   :min_rec=>true,
   :deleted=>false,
   :field_nulls=>nil,
   :field_lengths=>[0],
   :field_externs=>[false]},
 :next=>11877,
 :type=>:clustered,
 :key=>[{:name=>"i", :type=>"INT UNSIGNED", :value=>252, :extern=>nil}],
 :child_page_number=>4}

The :key array is present, although it represents the minimum key rather than an exact key, and no :row is present, as a :child_page_number takes its place.

The root page is a bit special
Since the root page is allocated when the index is first created, and that page number is stored in the data dictionary, the root page can never relocated or removed. Once the root page fills up, it will need to be split, forming a small tree of a root page plus two leaf pages.

However, the root page itself can’t actually be split, since it cannot be relocated. Instead, a new, empty page is allocated, the records in the root are moved there (the root is “raised” a level), and that new page is split into two. The root page then does not need to be split again until the level immediately below it has enough pages that the root becomes full of child page pointers (called “node pointers”), which in practice often means several hundred to more than a thousand.

B+Tree levels and increasing tree depth
As an example of the efficiency of B+Tree indexes, assume perfect record packing (every page full, which will never quite happen in practice, but is useful for discussion). A B+Tree index in InnoDB for the simple table in the examples above will be able to store 468 records per leaf page, or 1203 records per non-leaf page. The index tree can then be a maximum of the following sizes at the given tree heights:

Height  Non-leaf pages  Leaf pages  Rows    Size in bytes
1   0   1   468 16.0 KiB
2   1   1203    > 563 thousand  18.8 MiB
3   1204    1447209 > 677 million   22.1 GiB
4   1448413 1740992427  > 814 billion   25.9 TiB

As you can imagine, most tables with sensible PRIMARY KEY definitions are 2-3 levels, with some achieving 4 levels. Using an excessively large PRIMARY KEY can cause the B+Tree to be much less efficient, however, since primary key values must be stored in the non-leaf pages. This can drastically inflate the size of the records in non-leaf pages, meaning many fewer of those records fit in each non-leaf page, making the whole structure less efficient.

回复收藏 0 原文

~没有更多了~