使用基数较低的索引是否有意义?
根据我的理解,通过在包含很少不同值的列中设置索引并不会获得太多好处。
我有一个包含布尔值的列(实际上它是一个小整数,但我将其用作标志),并且该列在我的大多数查询的 WHERE
子句中使用。 在理论上的“平均”情况下,一半记录的值将为 1,另一半为 0。
因此,在这种情况下,数据库引擎可以避免全表扫描,但无论如何都必须读取大量行(总行数/2)。
那么,我应该将该列设为索引吗?
我正在使用 Mysql 5,但我更感兴趣的是为什么它对我知道基数较低的列进行索引有意义/没有意义的一般原理。
From my understanding you don't gain much by setting an index in a column that will hold few distinct values.
I have a column that holds a boolean value (actually it's a small int, but I'm using it as a flag), and this column is used in the WHERE
clauses of most of the queries I have.
In a theoretical "average" case, half of the records' values will be 1 and the other half, 0.
So, in this scenario, the database engine could avoid a full table scan, but will have to read a lot of rows anyway (total rows/2).
So, should I make this column an index?
I'm using Mysql 5, but I'm more interested in a general rationale on why it does / does not make sense indexing a column that I know that will have a low cardinality.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
在以下情况下,索引甚至可以在低基数字段上提供帮助:
当其中一个可能值与其他值相比非常罕见并且您搜索它时。
例如,色盲女性很少,所以这个查询:
<前><代码>选择 *
来自 color_blind_people
其中性别 = 'F'
最有可能受益于
性别
的索引。当值倾向于按表格顺序分组时:
<前><代码>选择*
来自 2008 年记录
其中年份 = 2010 年
限制1
虽然这里只有
3
个不同的年份,但较早年份的记录很可能首先添加,因此在返回第一个2010
之前必须扫描大量记录如果不是索引,则记录。当您需要
ORDER BY / LIMIT
时:<前><代码>选择 *
来自人
订购依据
性别、身份证号
限制1
如果没有索引,则需要
文件排序
。虽然它对LIMIT
做了一些优化,但它仍然需要全表扫描。当索引覆盖查询中使用的所有字段时:
当需要
DISTINCT
时:MySQL
将使用INDEX FOR GROUP-BY
,如果您的颜色很少,即使有数百万条记录,此查询也将是即时的。这是一个场景示例,低基数字段上的索引比高基数字段上的索引更有效。
请注意,如果 DML 性能问题不大,那么创建索引是安全的。
如果优化器认为索引效率低下,则不会使用该索引。
An index can help even on low cardinality fields if:
When one of possible values is very infrequent compared to the other values and you search for it.
For instance, there are very few color blind women, so this query:
would most probably benefit from an index on
gender
.When the values tend to be grouped in the table order:
Though there are only
3
distinct years here, records with earlier years are most probably added first so very many records would have to be scanned prior to returning the first2010
record if not for the index.When you need
ORDER BY / LIMIT
:Without the index, a
filesort
would be required. Though it's somewhat optimized do to theLIMIT
, it would still need a full table scan.When the index covers all fields used in the query:
When you need
DISTINCT
:MySQL
will useINDEX FOR GROUP-BY
, and if you have few colors, this query will be instant even with millions of records.This is an example of a scenario when the index on a low cardinality field is more efficient than that on a high cardinality field.
Note that if
DML
performance is not much on an issue, then it's safe to create the index.If optimizer thinks that the index is inefficient, the index just will not be used.
将布尔字段包含在复合索引中可能是值得的。例如,如果您有一个大型消息表,通常需要按日期排序,但您还有一个布尔型 Deleted 字段,那么您经常像这样查询它:
您肯定会从拥有一个复合字段中受益已删除 和日期 字段上的索引。
It might be worth including the boolean field in a composite index. For example if you have a large table of messages which typically need to be ordered by Date but you also have a boolean Deleted field, so you often query it like this:
You will definitely benefit from having a composite index on the Deleted and Date fields.
当一半记录的值为 1 而另一半为 0 时,就没有必要在该列上放置索引。查询优化器可能不会使用它。
然而,通常情况下,您有一小部分“活动”记录和越来越多的“非活动”记录。例如,在错误跟踪系统中,您关心的是活跃的错误,而几乎每次都会查看已完成和存档的错误。对于这种情况,技巧是使用“dateInactivated”列来存储记录停用/删除时的时间戳。顾名思义,当记录处于活动状态时,该值为 NULL,但一旦处于非活动状态,则写入系统日期时间。因此,随着“已删除”记录数量的增长,该列上的索引最终具有高选择性,因为每个记录都将具有唯一的(不严格来说)值。查询将
作为谓词的一部分,索引将提取您关心的正确行集。
When half of the records' values will be 1 and the other half 0, no point of putting an index on that column. The query optimizer is likely not to make use of it.
Typically, however, you have a small set of "active" records and an increasingly larger set of "inactive". For example in a bug tracking system, you care about active bugs and hardly every look at the completed and archived ones. For such a case, the trick is to use "dateInactivated" column that stores the timestamp of when the record is inactivated/deleted. As the name implies, the value is NULL while the record is active, but once inactivated, write in the system datetime. Thus, an index on that column ends up having high selectivity as the number of "deleted" records grows since each record will have a unique (not strictly speaking) value. The query would have
as part of the predicate and the index will pull in just the right set of rows that you care about.
我通常会做一个简单的“有索引”与“没有”索引测试。根据我的经验,使用 ORDER BY 索引列的查询可以获得大部分性能。如果您对该列进行任何排序,索引很可能会有所帮助。
I usually do a simple "have index" vs "don't have" index test. In my experience you get most of the performance on queries that use ORDER BY the indexed column. In case you have any sorting on that column, indexing will most likely help.
恕我直言,它的用处有限。我认为在大多数情况下,除了可能有更多帮助的标志之外,您在查询中还使用了其他标准。
在 50% 的情况下,我可能会在有/没有的情况下进行一些基准测试,看看是否有很大区别。
IMHO it's of limited usefulness. I assume in most cases there is other criteria you're using in your queries in addition to the flag that probably help out a lot more.
At 50%, I'd probably do some benchmarking with/without and see if it makes much difference.