在大型数据库中为简单查询建立索引
我得到一个包含近 850,000,000 行的表。
该表具有以下字段:
[ID] [bigint] IDENTITY(1,1) NOT NULL,
[D1] [int] NOT NULL,
[D2] [int] NOT NULL,
[D3] [int] NOT NULL,
[D4] [int] NOT NULL,
[D5] [int] NOT NULL,
[D6] [int] NOT NULL,
[D7] [int] NOT NULL,
[D8] [int] NOT NULL,
[D9] [int] NOT NULL,
[A] [int] NOT NULL,
[Hb] [bit] NOT NULL,
我对此表的所有查询都完全相同 -
选择 [D1-D9], [A] 其中 [Hb] = 0 AND [D1] <> x AND [D2] <> y AND [D3] = z,
等...
每个查询将始终查询所有 [D1-D9] 字段并始终要求 [Hb] = 0
查询示例:
SELECT [D1], [D2], [D3], [D4], [D5], [D6],[D7], [D8],[D9], [A]
from [myTable]
WHERE [D1] <> 8 AND [D2] <> 2 AND [D3] <> 5 AND [D4] = 8 AND [D5] = 2
AND [D6] = 5 AND [D7] = 5 AND [D8] = 3 AND [D9] = 4 AND [A] = 0 AND [Hb] = 0
我应该如何为此表建立索引最快的结果?
多谢
I got a table contains nearly 850,000,000 rows.
The table has the following fields:
[ID] [bigint] IDENTITY(1,1) NOT NULL,
[D1] [int] NOT NULL,
[D2] [int] NOT NULL,
[D3] [int] NOT NULL,
[D4] [int] NOT NULL,
[D5] [int] NOT NULL,
[D6] [int] NOT NULL,
[D7] [int] NOT NULL,
[D8] [int] NOT NULL,
[D9] [int] NOT NULL,
[A] [int] NOT NULL,
[Hb] [bit] NOT NULL,
All my queries for this table are quite the same -
Select [D1-D9], [A] Where [Hb] = 0 AND [D1] <> x AND [D2] <> y AND [D3] = z,
etc....
Each query will ALWAYS query ALL [D1-D9] fields and always ask for [Hb] = 0
Example for a query:
SELECT [D1], [D2], [D3], [D4], [D5], [D6],[D7], [D8],[D9], [A]
from [myTable]
WHERE [D1] <> 8 AND [D2] <> 2 AND [D3] <> 5 AND [D4] = 8 AND [D5] = 2
AND [D6] = 5 AND [D7] = 5 AND [D8] = 3 AND [D9] = 4 AND [A] = 0 AND [Hb] = 0
How should I index this table for the fastest results?
Thanks a lot
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
您能做的最好的事情就是让索引首先进行相等性检查,然后进行剩余的不相等性查找。即,
=
位于<>
之前。重新排列 WHERE 子句:
因此,初稿是这样的:
D4
到D9
的顺序应基于选择性。数字最大的优先。Hb
应该始终在等式列中排在最后,因为它是位最后,这可以是集群或非集群。如果您没有其他索引或没有 FK,那么我会考虑将其设为集群 PK。否则,只需创建一个聚集代理键并使该索引为非聚集
编辑:
一篇文章(希望)解释了为什么列顺序对于多列索引很重要:Craig Freedman 的搜索谓词。还有他的扫描和搜索
编辑2:
我询问
<>
之前的=
是否在同一列上:显示“是”。OP对此答案的评论说“不”,所以我在这里所说的一切都是毫无意义的。Damien_The_Unbelier
建议的索引交叉点的答案是尝试绕过这个平等/非平等的混合。
The best you can do is have your index do equality checks first followed by a residual non-equality lookup. That is,
=
before<>
.Rearranging the WHERE clause:
So, first draft is this:
The order of
D4
toD9
should be based on selectivity. Highest numbers first.Hb
should always go last in the equality columns because it's bitFinally, this can be clustered or non-clustered. If you have no other indexes or no FKs then I'd consider make this the clustered PK. Otherwise, just create a clustered surrogate key and make this index NONCLUSTERED
Edit:
An article that (hopefully) explains why column order matters for mulitple column indexes: Craig Freedman's Seek Predicates. And his Scans and Seeks too
Edit2:
I asked if the
=
before<>
are on the same columns: it appeared "yes".OP's comment to this answer says "no" so everything I've said here is pointless
The answer from Damien_The_Unbeliever's suggested index intersections to try and get around this the equality/nonequality mix.
您可能会发现(如果每个查询中的十列的各个相等/不等式测试不同),您能做的最好的事情就是在每个列上单独构建一个窄索引,然后希望优化器将应用 索引交集,它将在其中在有意义的地方使用每列上的索引。
You may find (if the individual equality/inequality tests are different for the ten columns in each query) that the best you can do is build a narrow index on each column individually, and then hope that the optimizer will apply index intersection, where it will use the indexes on each column where it makes sense to do so.
扩展@gbn的答案。
对于这种大小的表,您肯定需要一个覆盖所有选定列的索引。
但是,对于每一列,您应该决定是否希望它成为索引中的键列或包含列。
为此,请运行以下查询:
您应该创建一个选择性最少的列(具有最高
*Card
值的列)的列表,这些列(总共)包含超过25%
您的记录。比如说,列上的选择性图表如下所示:
这意味着
d4、d8、d9、d7、d6
列上的条件总共匹配大约33%
的记录。在这种情况下,不需要将它们用作键列。您应该在其他选择性列上创建索引,并将非选择性列包含到索引中。
具有非等式过滤器的列始终转到
INCLUDE
部分。请注意,它只会使用给定的过滤器值改进当前查询。如果您的过滤器是任意的,则需要使用所有相等过滤列作为索引的键。
也可能存在像
[D1] <>这样的条件。 8
涉及幻数,并且很少有记录满足此条件。在这种情况下,您可以将计算列添加到表的定义中:
并将该表达式添加到索引中(根据上述规则)。
如果这样做,则必须使用
d1_ne_8 = 1
而不是d1 <> 8.
.Extending @gbn's answer.
For a table of this size, you definitely need an index which would cover all columns selected.
However, for each column you should decide whether you want it to be a key column or an included column in the index.
To do this, run this query:
You should create a list of the least selective columns (those with the highest values of
*Card
) which (together) comprise more than25%
of your records.Say, the selectivity chart on the columns looks like this:
This means that the conditions on columns
d4, d8, d9, d7, d6
together match about33%
of your records.In this case, there is no need to use them as key columns. You should create an index on the other, selective, columns and include the non-selective ones into the index.
The columns with the non-equality filter always go to the
INCLUDE
section.Note that it will only improve the current query, with the given values of the filters. If your filters are arbitrary, you would need to use all equality filtered columns as the keys of the index.
It may also be case that the conditions like
[D1] <> 8
involve magic numbers, and there are few records for which this condition holds.In this case, you can add a computed column into your table's definition:
and add this expression to the index (with regard to the rules above).
If you do this, you will have to use
d1_ne_8 = 1
instead ofd1 <> 8
.首先,使用“where X <> 8”等条件的查询可能会使任何索引变得无用(这可能取决于您的实际数据库引擎。
会更安全:
将其更改为类似这样的内容
First of all queries with conditions like "where X <> 8" may make any index useless (it may depend on your actual DB engine.
It is safer to change this from
to something more like this:
如果您的算法是确定性的(即 A = f(d1, d2, d3...d9)),那么您的 D 列与 Hb 组合构成一个密钥。尝试在所有 D 列和 Hb 上创建聚集复合索引,在 Hb 上进行分区以提高速度。您也可以考虑删除 ID 字段。
编辑:
刚刚意识到我错过了 <>状况。正如其他人提到的,这使事情变得更加困难。这里你真正想要使用的是位图索引,但据我所知 SQL Server 没有它们。您可能需要依赖各个列索引的良好配合。
If your algorithm is deterministic (i.e. A = f(d1, d2, d3...d9)) then your D columns combined with Hb constitute a key. Try creating a clustered composite index on all D columns and Hb, partitioning on Hb for a little speed boost. You may also consider dropping the ID field.
EDIT:
Just realised I'd missed the <> conditions. As others have mentioned this makes things much harder. What you really want to use here is a bitmap index, but AFAIK SQL Server doesn't have them. You will probably need to rely on individual column indexes playing nicely together.
基本上,您应该创建以具有相等检查的列开始的复合索引。因此,在您的情况下,很自然地使用 [Hb] 作为第一个组成部分,因为您声明将平等地检查 [Hb]。索引的下一个元素是 [D*],后跟 [A]
第二个想法是,您可以采用部分索引并让数据库对表进行快速索引扫描 (CMIIW) 以检查其他值。在这种情况下,您应该将 Id 包含为索引的最后一项。例如:
当查询对 Hb、D1、D2 和 D3 使用相等性检查时,查询将使用 IXC_MyTable__D123 索引;等等。
Basically, you should create compound index started by column with equality check. So, in your case it is naturally to use [Hb] as the first component since you stated that [Hb] will be checked with equality. The next elements of the index are [D*], followed by [A]
On the second thought, you can employ a partial indexing and let the db do a fast index scan (CMIIW) to the table to check other values. In this case, you should include Id as the last item of the index. For example:
The query will use IXC_MyTable__D123 index when it uses equality checks on Hb, D1, D2, and D3; and so on.