在大型数据库中为简单查询建立索引

发布于 2024-10-14 23:42:21 字数 861 浏览 2 评论 0原文

我得到一个包含近 850,000,000 行的表。

该表具有以下字段：

[ID] [bigint] IDENTITY(1,1) NOT NULL,
[D1] [int] NOT NULL,
[D2] [int] NOT NULL,
[D3] [int] NOT NULL,
[D4] [int] NOT NULL,
[D5] [int] NOT NULL,
[D6] [int] NOT NULL,
[D7] [int] NOT NULL,
[D8] [int] NOT NULL,
[D9] [int] NOT NULL,
[A] [int] NOT NULL,
[Hb] [bit] NOT NULL,

我对此表的所有查询都完全相同 -

选择 [D1-D9], [A] 其中 [Hb] = 0 AND [D1] <> x AND [D2] <> y AND [D3] = z，

等...

每个查询将始终查询所有 [D1-D9] 字段并始终要求 [Hb] = 0

查询示例：

SELECT [D1], [D2], [D3], [D4], [D5], [D6],[D7], [D8],[D9], [A] 
  from [myTable] 
 WHERE [D1] <> 8 AND [D2] <> 2 AND [D3] <> 5 AND [D4] = 8 AND [D5] = 2 
   AND [D6] = 5 AND [D7] = 5 AND [D8] = 3 AND [D9] = 4 AND [A] = 0 AND [Hb] = 0

我应该如何为此表建立索引最快的结果？

多谢

原文

I got a table contains nearly 850,000,000 rows.

The table has the following fields:

[ID] [bigint] IDENTITY(1,1) NOT NULL,
[D1] [int] NOT NULL,
[D2] [int] NOT NULL,
[D3] [int] NOT NULL,
[D4] [int] NOT NULL,
[D5] [int] NOT NULL,
[D6] [int] NOT NULL,
[D7] [int] NOT NULL,
[D8] [int] NOT NULL,
[D9] [int] NOT NULL,
[A] [int] NOT NULL,
[Hb] [bit] NOT NULL,

All my queries for this table are quite the same -

Select [D1-D9], [A] Where [Hb] = 0 AND [D1] <> x AND [D2] <> y AND [D3] = z,

etc....

Each query will ALWAYS query ALL [D1-D9] fields and always ask for [Hb] = 0

Example for a query:

SELECT [D1], [D2], [D3], [D4], [D5], [D6],[D7], [D8],[D9], [A] 
  from [myTable] 
 WHERE [D1] <> 8 AND [D2] <> 2 AND [D3] <> 5 AND [D4] = 8 AND [D5] = 2 
   AND [D6] = 5 AND [D7] = 5 AND [D8] = 3 AND [D9] = 4 AND [A] = 0 AND [Hb] = 0

How should I index this table for the fastest results?

Thanks a lot

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

一笑百媚生 2024-10-21 23:42:21

您能做的最好的事情就是让索引首先进行相等性检查，然后进行剩余的不相等性查找。即，= 位于 <> 之前。

重新排列 WHERE 子句：

WHERE
--Equality
D4 = 8 AND D5 = 2 AND D6 = 5 AND D7 = 5 AND D8 = 3 AND D9 = 4 AND A = 0 
--in the middle    
AND Hb = 0
--Non-Equality
D1 <> 8 AND D2 <> 2 AND D3 <> 5

因此，初稿是这样的：

CREATE .. INDEX ... ON (D4, D5, D6, D7, D8, D9, A, Hb, D1, D2, D3)

D4 到 D9 的顺序应基于选择性。数字最大的优先。 Hb 应该始终在等式列中排在最后，因为它是位

SELECT
   COUNT(DISTINCT D4) AS D4COunt,
   COUNT(DISTINCT D5) AS D5COunt,
   COUNT(DISTINCT D6) AS D6COunt,
   COUNT(DISTINCT D7) AS D7COunt,
   COUNT(DISTINCT D8) AS D8COunt,
   COUNT(DISTINCT D9) AS D9COunt,
   COUNT(DISTINCT A) AS ACOunt
FROM
    Mytable

最后，这可以是集群或非集群。如果您没有其他索引或没有 FK，那么我会考虑将其设为集群 PK。否则，只需创建一个聚集代理键并使该索引为非聚集

编辑：

一篇文章（希望）解释了为什么列顺序对于多列索引很重要：Craig Freedman 的搜索谓词。还有他的扫描和搜索

编辑2：

我询问 <> 之前的 = 是否在同一列上：显示“是”。
OP对此答案的评论说“不”，所以我在这里所说的一切都是毫无意义的。Damien_The_Unbelier

建议的索引交叉点的答案是尝试绕过这个平等/非平等的混合。

The best you can do is have your index do equality checks first followed by a residual non-equality lookup. That is, = before <>.

Rearranging the WHERE clause:

WHERE
--Equality
D4 = 8 AND D5 = 2 AND D6 = 5 AND D7 = 5 AND D8 = 3 AND D9 = 4 AND A = 0 
--in the middle    
AND Hb = 0
--Non-Equality
D1 <> 8 AND D2 <> 2 AND D3 <> 5

So, first draft is this:

CREATE .. INDEX ... ON (D4, D5, D6, D7, D8, D9, A, Hb, D1, D2, D3)

The order of D4 to D9 should be based on selectivity. Highest numbers first. Hb should always go last in the equality columns because it's bit

SELECT
   COUNT(DISTINCT D4) AS D4COunt,
   COUNT(DISTINCT D5) AS D5COunt,
   COUNT(DISTINCT D6) AS D6COunt,
   COUNT(DISTINCT D7) AS D7COunt,
   COUNT(DISTINCT D8) AS D8COunt,
   COUNT(DISTINCT D9) AS D9COunt,
   COUNT(DISTINCT A) AS ACOunt
FROM
    Mytable

Finally, this can be clustered or non-clustered. If you have no other indexes or no FKs then I'd consider make this the clustered PK. Otherwise, just create a clustered surrogate key and make this index NONCLUSTERED

Edit:

An article that (hopefully) explains why column order matters for mulitple column indexes: Craig Freedman's Seek Predicates. And his Scans and Seeks too

Edit2:

I asked if the = before <> are on the same columns: it appeared "yes".
OP's comment to this answer says "no" so everything I've said here is pointless

The answer from Damien_The_Unbeliever's suggested index intersections to try and get around this the equality/nonequality mix.

回复收藏 0 原文

┊风居住的梦幻卍 2024-10-21 23:42:21

您可能会发现（如果每个查询中的十列的各个相等/不等式测试不同），您能做的最好的事情就是在每个列上单独构建一个窄索引，然后希望优化器将应用索引交集，它将在其中在有意义的地方使用每列上的索引。

回复收藏 0 原文

朮生 2024-10-21 23:42:21

扩展@gbn的答案。

对于这种大小的表，您肯定需要一个覆盖所有选定列的索引。

但是，对于每一列，您应该决定是否希望它成为索引中的键列或包含列。

为此，请运行以下查询：

SELECT  SUM(CASE D1 WHEN 8 THEN 0 ELSE 1 END) / COUNT(*) AS D1Card,
        SUM(CASE D2 WHEN 2 THEN 0 ELSE 1 END) / COUNT(*) / COUNT(DISTINCT D2) AS D2Card,
        SUM(CASE D3 WHEN 5 THEN 0 ELSE 1 END) / COUNT(*) / COUNT(DISTINCT D3) AS D3Card,
        SUM(CASE d4 WHEN 8 THEN 1 ELSE 0 END) / COUNT(DISTINCT D4) AS D4Card,
        SUM(CASE d5 WHEN 2 THEN 1 ELSE 0 END) / COUNT(DISTINCT D5) AS D5Card,
        SUM(CASE d6 WHEN 5 THEN 1 ELSE 0 END) / COUNT(DISTINCT D6) AS D6Card,
        SUM(CASE d7 WHEN 5 THEN 1 ELSE 0 END) / COUNT(DISTINCT D7) AS D7Card,
        SUM(CASE d8 WHEN 3 THEN 1 ELSE 0 END) / COUNT(DISTINCT D8) AS D8Card,
        SUM(CASE d9 WHEN 4 THEN 1 ELSE 0 END) / COUNT(DISTINCT D9) AS D9Card,
        SUM(CASE a WHEN 0 THEN 1 ELSE 0 END) / COUNT(DISTINCT A) AS ACard,
        SUM(CASE Hb WHEN 0 THEN 1 ELSE 0 END) / COUNT(DISTINCT Hb) AS HbCard
FROM    Mytable

您应该创建一个选择性最少的列（具有最高 *Card 值的列）的列表，这些列（总共）包含超过 25% 您的记录。

比如说，列上的选择性图表如下所示：

Column  Selectivity  Cumulative selectivity
D4      0.96         0.96
D8      0.87         0.84
D9      0.85         0.70
D7      0.72         0.51
D6      0.65         0.33 -- here
D5      0.20         0.07
A       0.02         0.00
Hb      0.01         0.00

这意味着 d4、d8、d9、d7、d6 列上的条件总共匹配大约 33% 的记录。

在这种情况下，不需要将它们用作键列。您应该在其他选择性列上创建索引，并将非选择性列包含到索引中。

CREATE INDEX ix_mytable_filter ON (Hb, A, D5) INLCUDE (D1, D2, D3, D4, D6, D7, D8, D9)

具有非等式过滤器的列始终转到 INCLUDE 部分。

请注意，它只会使用给定的过滤器值改进当前查询。如果您的过滤器是任意的，则需要使用所有相等过滤列作为索引的键。

也可能存在像[D1] <>这样的条件。 8 涉及幻数，并且很少有记录满足此条件。

在这种情况下，您可以将计算列添加到表的定义中：

ALTER TABLE mytable ADD d1_ne_8 AS CASE D1 WHEN 8 THEN 0 ELSE 1 END

并将该表达式添加到索引中（根据上述规则）。

如果这样做，则必须使用 d1_ne_8 = 1 而不是 d1 <> 8..

Extending @gbn's answer.

For a table of this size, you definitely need an index which would cover all columns selected.

However, for each column you should decide whether you want it to be a key column or an included column in the index.

To do this, run this query:

SELECT  SUM(CASE D1 WHEN 8 THEN 0 ELSE 1 END) / COUNT(*) AS D1Card,
        SUM(CASE D2 WHEN 2 THEN 0 ELSE 1 END) / COUNT(*) / COUNT(DISTINCT D2) AS D2Card,
        SUM(CASE D3 WHEN 5 THEN 0 ELSE 1 END) / COUNT(*) / COUNT(DISTINCT D3) AS D3Card,
        SUM(CASE d4 WHEN 8 THEN 1 ELSE 0 END) / COUNT(DISTINCT D4) AS D4Card,
        SUM(CASE d5 WHEN 2 THEN 1 ELSE 0 END) / COUNT(DISTINCT D5) AS D5Card,
        SUM(CASE d6 WHEN 5 THEN 1 ELSE 0 END) / COUNT(DISTINCT D6) AS D6Card,
        SUM(CASE d7 WHEN 5 THEN 1 ELSE 0 END) / COUNT(DISTINCT D7) AS D7Card,
        SUM(CASE d8 WHEN 3 THEN 1 ELSE 0 END) / COUNT(DISTINCT D8) AS D8Card,
        SUM(CASE d9 WHEN 4 THEN 1 ELSE 0 END) / COUNT(DISTINCT D9) AS D9Card,
        SUM(CASE a WHEN 0 THEN 1 ELSE 0 END) / COUNT(DISTINCT A) AS ACard,
        SUM(CASE Hb WHEN 0 THEN 1 ELSE 0 END) / COUNT(DISTINCT Hb) AS HbCard
FROM    Mytable

You should create a list of the least selective columns (those with the highest values of *Card) which (together) comprise more than 25% of your records.

Say, the selectivity chart on the columns looks like this:

Column  Selectivity  Cumulative selectivity
D4      0.96         0.96
D8      0.87         0.84
D9      0.85         0.70
D7      0.72         0.51
D6      0.65         0.33 -- here
D5      0.20         0.07
A       0.02         0.00
Hb      0.01         0.00

This means that the conditions on columns d4, d8, d9, d7, d6 together match about 33% of your records.

In this case, there is no need to use them as key columns. You should create an index on the other, selective, columns and include the non-selective ones into the index.

CREATE INDEX ix_mytable_filter ON (Hb, A, D5) INLCUDE (D1, D2, D3, D4, D6, D7, D8, D9)

The columns with the non-equality filter always go to the INCLUDE section.

Note that it will only improve the current query, with the given values of the filters. If your filters are arbitrary, you would need to use all equality filtered columns as the keys of the index.

It may also be case that the conditions like [D1] <> 8 involve magic numbers, and there are few records for which this condition holds.

In this case, you can add a computed column into your table's definition:

ALTER TABLE mytable ADD d1_ne_8 AS CASE D1 WHEN 8 THEN 0 ELSE 1 END

and add this expression to the index (with regard to the rules above).

If you do this, you will have to use d1_ne_8 = 1 instead of d1 <> 8.

回复收藏 0 原文

物价感观 2024-10-21 23:42:21

首先，使用“where X <> 8”等条件的查询可能会使任何索引变得无用（这可能取决于您的实际数据库引擎。

会更安全：

SELECT [D1], [D2], [D3], [D4], [D5], [D6],[D7], [D8],[D9], [A] 
  from [myTable] 
 WHERE [D1] <> 8 AND [D2] <> 2 AND [D3] <> 5 AND [D4] = 8 AND [D5] = 2 
   AND [D6] = 5 AND [D7] = 5 AND [D8] = 3 AND [D9] = 4 AND [A] = 0 AND [Hb] = 0

将其更改为类似这样的内容

SELECT [D1], [D2], [D3], [D4], [D5], [D6],[D7], [D8],[D9], [A] 
  from [myTable] 
 WHERE ([D1] < 8 or [D1] > 8) 
       AND ([D2] < 2 or [D2] > 2) 
       AND ([D3] < 5 or [D3] > 5) 
       AND [D4] = 8 AND [D5] = 2  AND [D6] = 5 AND [D7] = 5 AND [D8] = 3 
       AND [D9] = 4 AND [A] = 0 AND [Hb] = 0

First of all queries with conditions like "where X <> 8" may make any index useless (it may depend on your actual DB engine.

It is safer to change this from

SELECT [D1], [D2], [D3], [D4], [D5], [D6],[D7], [D8],[D9], [A] 
  from [myTable] 
 WHERE [D1] <> 8 AND [D2] <> 2 AND [D3] <> 5 AND [D4] = 8 AND [D5] = 2 
   AND [D6] = 5 AND [D7] = 5 AND [D8] = 3 AND [D9] = 4 AND [A] = 0 AND [Hb] = 0

to something more like this:

SELECT [D1], [D2], [D3], [D4], [D5], [D6],[D7], [D8],[D9], [A] 
  from [myTable] 
 WHERE ([D1] < 8 or [D1] > 8) 
       AND ([D2] < 2 or [D2] > 2) 
       AND ([D3] < 5 or [D3] > 5) 
       AND [D4] = 8 AND [D5] = 2  AND [D6] = 5 AND [D7] = 5 AND [D8] = 3 
       AND [D9] = 4 AND [A] = 0 AND [Hb] = 0

回复收藏 0 原文

玉环 2024-10-21 23:42:21

如果您的算法是确定性的（即 A = f(d1, d2, d3...d9)），那么您的 D 列与 Hb 组合构成一个密钥。尝试在所有 D 列和 Hb 上创建聚集复合索引，在 Hb 上进行分区以提高速度。您也可以考虑删除 ID 字段。

编辑：
刚刚意识到我错过了 <>状况。正如其他人提到的，这使事情变得更加困难。这里你真正想要使用的是位图索引，但据我所知 SQL Server 没有它们。您可能需要依赖各个列索引的良好配合。

回复收藏 0 原文

明月夜 2024-10-21 23:42:21

基本上，您应该创建以具有相等检查的列开始的复合索引。因此，在您的情况下，很自然地使用 [Hb] 作为第一个组成部分，因为您声明将平等地检查 [Hb]。索引的下一个元素是 [D*]，后跟 [A]

create index IXC_MyTable1 on Mytable(Hb, D1, D2, D3, D4, D5, D6, D7, D8, D9, A)

第二个想法是，您可以采用部分索引并让数据库对表进行快速索引扫描 (CMIIW) 以检查其他值。在这种情况下，您应该将 Id 包含为索引的最后一项。例如：

create index IXC_MyTable__D123 on Mytable(Hb, D1, D2, D3, Id)
create index IXC_MyTable__D456 on Mytable(Hb, D4, D5, D6, Id)
create index IXC_MyTable__D789 on Mytable(Hb, D7, D8, D9, Id)

当查询对 Hb、D1、D2 和 D3 使用相等性检查时，查询将使用 IXC_MyTable__D123 索引；等等。

Basically, you should create compound index started by column with equality check. So, in your case it is naturally to use [Hb] as the first component since you stated that [Hb] will be checked with equality. The next elements of the index are [D*], followed by [A]

create index IXC_MyTable1 on Mytable(Hb, D1, D2, D3, D4, D5, D6, D7, D8, D9, A)

On the second thought, you can employ a partial indexing and let the db do a fast index scan (CMIIW) to the table to check other values. In this case, you should include Id as the last item of the index. For example:

create index IXC_MyTable__D123 on Mytable(Hb, D1, D2, D3, Id)
create index IXC_MyTable__D456 on Mytable(Hb, D4, D5, D6, Id)
create index IXC_MyTable__D789 on Mytable(Hb, D7, D8, D9, Id)

The query will use IXC_MyTable__D123 index when it uses equality checks on Hb, D1, D2, and D3; and so on.

回复收藏 0 原文

~没有更多了~