SQL Server 如何判断表列是否具有低基数?
我问的原因是因为查询优化器很可能不会在性别列(值“m”和“f”)上使用索引。然而,它将如何确定性别列的基数来做出该决定呢?
最重要的是,如果万一我的表中有一百万个条目,而性别列中只有一个条目是“m”,SQL Server 是否能够确定这一点并使用索引来检索该单行?或者它只是知道列中只有 2 个不同的值而不使用索引?
我很欣赏上面讨论的一些糟糕的数据库设计,但我只是想了解查询优化器如何做出决策。
非常感谢。
How does SQL Server determine whether a table column has low cardinality?
The reason I ask is because query optimizer would most probably not use an index on a gender column (values 'm' and 'f'). However how would it determine the cardinality of the gender column to come to that decision?
On top of this, if in the unlikely event that I had a million entries in my table and only one entry in the gender column was 'm', would SQL server be able to determine this and use the index to retrieve that single row? Or would it just know there are only 2 distinct values in the column and not use the index?
I appreciate the above discusses some poor db design, but I'm just trying to understand how query optimizer comes to its decisions.
Many thanks.
发布评论
评论(1)
请参阅 Microsoft SQL Server 2005 中查询优化器使用的统计信息。
如果“m”值为 1,“f”值为 999999,则统计数据将为“m”提供基数估计为 1,而为“f”提供接近 1M 的基数估计值。但是否使用该指数,还有更多的因素。
一般来说,如此低的选择性列仅对索引没有意义。但是,作为更复杂索引上的最左列,甚至作为聚集索引上的最左列,它确实有意义。即使列对于“m”有意义,而对于“f”则不有意义,查询自动参数化也可能会对您产生欺骗,并为变量 @gender 生成计划。
您必须阅读更多内容或提供更多详细信息。 QO 团队和团队成员的博客是一些不错的资源:
See Statistics Used by the Query Optimizer in Microsoft SQL Server 2005 .
With 1 value 'm' and 999999 'f' the statistics will give a cardinality estimate of 1 for 'm', and something close to 1M for 'f'. But that whether the index will be used or not, there are more factors.
In general such a low selectivity column does not make sense on an index alone. However, it does make sense as a leftmost column on a more complex index, and even as a leftmost column on the clustered index. And even if a column would make sense for 'm' and not for 'f', the query auto-parametrization may play a trick on you and generate a plan for a variable @gender instead.
You'll have to either read more or give more details. Some good resources are the QO team and team members blogs: