当前位置：文江博客话题详情

使用基数较低的索引是否有意义？

发布于 2024-08-18 21:53:21 字数 301 浏览 15 评论 0原文

根据我的理解，通过在包含很少不同值的列中设置索引并不会获得太多好处。

我有一个包含布尔值的列（实际上它是一个小整数，但我将其用作标志），并且该列在我的大多数查询的 WHERE 子句中使用。在理论上的“平均”情况下，一半记录的值将为 1，另一半为 0。

因此，在这种情况下，数据库引擎可以避免全表扫描，但无论如何都必须读取大量行（总行数/2）。

那么，我应该将该列设为索引吗？

我正在使用 Mysql 5，但我更感兴趣的是为什么它对我知道基数较低的列进行索引有意义/没有意义的一般原理。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

一向肩并 2024-08-25 21:53:21

在以下情况下，索引甚至可以在低基数字段上提供帮助：

当其中一个可能值与其他值相比非常罕见并且您搜索它时。
例如，色盲女性很少，所以这个查询：
<前><代码>选择 *
来自 color_blind_people
其中性别 = 'F'
最有可能受益于性别的索引。
当值倾向于按表格顺序分组时：
<前><代码>选择*
来自 2008 年记录
其中年份 = 2010 年
限制1
虽然这里只有 3 个不同的年份，但较早年份的记录很可能首先添加，因此在返回第一个 2010 之前必须扫描大量记录如果不是索引，则记录。
当您需要ORDER BY / LIMIT时：
<前><代码>选择 *
来自人
订购依据
性别、身份证号
限制1
如果没有索引，则需要文件排序。虽然它对 LIMIT 做了一些优化，但它仍然需要全表扫描。

当索引覆盖查询中使用的所有字段时：

创建索引（low_cardinality_record，值）

选择总和（值）
来自我的表
其中 low_cardinality_record = 3

当需要DISTINCT时：
```
选择不同的颜色
来自 T 恤
```
MySQL 将使用 INDEX FOR GROUP-BY，如果您的颜色很少，即使有数百万条记录，此查询也将是即时的。
这是一个场景示例，低基数字段上的索引比高基数字段上的索引更有效。

请注意，如果 DML 性能问题不大，那么创建索引是安全的。

如果优化器认为索引效率低下，则不会使用该索引。

An index can help even on low cardinality fields if:

When one of possible values is very infrequent compared to the other values and you search for it.
For instance, there are very few color blind women, so this query:
```
SELECT  *
FROM    color_blind_people
WHERE   gender = 'F'
```
would most probably benefit from an index on gender.
When the values tend to be grouped in the table order:
```
SELECT  *
FROM    records_from_2008
WHERE   year = 2010
LIMIT 1
```
Though there are only 3 distinct years here, records with earlier years are most probably added first so very many records would have to be scanned prior to returning the first 2010 record if not for the index.
When you need ORDER BY / LIMIT:
```
SELECT  *
FROM    people
ORDER BY
        gender, id
LIMIT 1
```
Without the index, a filesort would be required. Though it's somewhat optimized do to the LIMIT, it would still need a full table scan.

When the index covers all fields used in the query:

CREATE INDEX (low_cardinality_record, value)

SELECT  SUM(value)
FROM    mytable
WHERE   low_cardinality_record = 3

When you need DISTINCT:
```
SELECT  DISTINCT color
FROM    tshirts
```
MySQL will use INDEX FOR GROUP-BY, and if you have few colors, this query will be instant even with millions of records.
This is an example of a scenario when the index on a low cardinality field is more efficient than that on a high cardinality field.

Note that if DML performance is not much on an issue, then it's safe to create the index.

If optimizer thinks that the index is inefficient, the index just will not be used.

回复收藏 0 原文

昇り龍 2024-08-25 21:53:21

将布尔字段包含在复合索引中可能是值得的。例如，如果您有一个大型消息表，通常需要按日期排序，但您还有一个布尔型 Deleted 字段，那么您经常像这样查询它：

SELECT ... FROM Messages WHERE Deleted = 0 AND Date BETWEEN @start AND @end

您肯定会从拥有一个复合字段中受益已删除 和日期字段上的索引。

It might be worth including the boolean field in a composite index. For example if you have a large table of messages which typically need to be ordered by Date but you also have a boolean Deleted field, so you often query it like this:

SELECT ... FROM Messages WHERE Deleted = 0 AND Date BETWEEN @start AND @end

You will definitely benefit from having a composite index on the Deleted and Date fields.

回复收藏 0 原文

一束光，穿透我孤独的魂 2024-08-25 21:53:21

当一半记录的值为 1 而另一半为 0 时，就没有必要在该列上放置索引。查询优化器可能不会使用它。

然而，通常情况下，您有一小部分“活动”记录和越来越多的“非活动”记录。例如，在错误跟踪系统中，您关心的是活跃的错误，而几乎每次都会查看已完成和存档的错误。对于这种情况，技巧是使用“dateInactivated”列来存储记录停用/删除时的时间戳。顾名思义，当记录处于活动状态时，该值为 NULL，但一旦处于非活动状态，则写入系统日期时间。因此，随着“已删除”记录数量的增长，该列上的索引最终具有高选择性，因为每个记录都将具有唯一的（不严格来说）值。查询将

"... AND dateInactivated is NULL ..."

作为谓词的一部分，索引将提取您关心的正确行集。

When half of the records' values will be 1 and the other half 0, no point of putting an index on that column. The query optimizer is likely not to make use of it.

Typically, however, you have a small set of "active" records and an increasingly larger set of "inactive". For example in a bug tracking system, you care about active bugs and hardly every look at the completed and archived ones. For such a case, the trick is to use "dateInactivated" column that stores the timestamp of when the record is inactivated/deleted. As the name implies, the value is NULL while the record is active, but once inactivated, write in the system datetime. Thus, an index on that column ends up having high selectivity as the number of "deleted" records grows since each record will have a unique (not strictly speaking) value. The query would have