通过组优化范围查询
在查询方面遇到麻烦。这是轮廓 -
表结构:
CREATE TABLE `world` (
`placeRef` int NOT NULL,
`forenameRef` int NOT NULL,
`surnameRef` int NOT NULL,
`incidence` int NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8mb3;
ALTER TABLE `world`
ADD KEY `surnameRef_forenameRef` (`surnameRef`,`forenameRef`),
ADD KEY `forenameRef_surnameRef` (`forenameRef`,`surnameRef`),
ADD KEY `forenameRef` (`forenameRef`,`placeRef`);
COMMIT;
此表包含类似于600,000,000行的数据:
placeRef forenameRef surnameRef incidence
1 1 2 100
2 1 3 600
这表示在某个地方具有给定的forename -sermame组合的人数。
我希望能够查询姓氏所附加的所有前令;然后对这些前命名的存在的位置进行另一次搜索,并以总和的计数。例如:获取所有具有“史密斯”姓氏的人;然后获取所有这些前命名的列表,该列表按位置和总和发病率分组。我可以通过以下查询来执行此操作:
SELECT placeRef, SUM( incidence )
FROM world
WHERE forenameRef IN
(
SELECT DISTINCT forenameRef
FROM world
WHERE surnameRef = 214488
)
GROUP BY world.placeRef
但是,此查询大约需要一分钟的时间来执行,如果搜索姓氏很常见,则需要更多时间。
根问题是:使用组执行范围查询不会使用完整的索引。
有任何建议如何提高速度?
Having trouble with a query. Here is the outline -
Table structure:
CREATE TABLE `world` (
`placeRef` int NOT NULL,
`forenameRef` int NOT NULL,
`surnameRef` int NOT NULL,
`incidence` int NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8mb3;
ALTER TABLE `world`
ADD KEY `surnameRef_forenameRef` (`surnameRef`,`forenameRef`),
ADD KEY `forenameRef_surnameRef` (`forenameRef`,`surnameRef`),
ADD KEY `forenameRef` (`forenameRef`,`placeRef`);
COMMIT;
This table contains data like and has over 600,000,000 rows:
placeRef forenameRef surnameRef incidence
1 1 2 100
2 1 3 600
This represents the number of people with a given forename-surname combination in a place.
I would like to be able to query all the forenames that a surname is attached to; and then perform another search for where those forenames exist, with a count of the sum incidence. For Example: get all the forenames of people who have the surname "Smith"; then get a list of all those forenames, grouped by place and with the sum incidence. I can do this with the following query:
SELECT placeRef, SUM( incidence )
FROM world
WHERE forenameRef IN
(
SELECT DISTINCT forenameRef
FROM world
WHERE surnameRef = 214488
)
GROUP BY world.placeRef
However, this query takes about a minute to execute and will take more time if the surname being searched for is common.
The root problem is: performing a range query with a group doesn't utilize the full index.
Any suggestions how the speed could be improved?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
根据我的经验,如果您的查询具有范围条件(即除
=
或以外的任何类型的谓词是null
),则该条件的列是您的最后一列可用于优化搜索,排序或分组的索引。换句话说,假设您在列上有一个索引
(a,b,c)
。以下使用所有三列。它能够通过C优化订单,因为由于所有与A和B的特定值匹配的行都将被绑定,然后这些匹配的行已经按
c
按顺序进行,因此订购是一个no-op。但是下一个示例仅使用列a,b。需要执行Filesort的订单,因为该索引不是按
c
按顺序订购的。组对小组也是如此。以下使用A,B进行行选择,并且还可以通过使用索引来优化组,因为
c
的每个不同值的每个值都可以保证将其分组在一起。因此,它可以计算c
的每个值的行,并且当使用一个组完成时,可以保证,以后不会再有c
的值。但是范围条件会破坏这一点。
c
的每个值的行未分组在一起。假定c
的每个值的行可以分散在b
的每个较高值之间。在这种情况下,MySQL无法通过此查询中的组来优化组。它必须使用临时表来计数
c
的不同值的行。MySQL 8.0.13引入了一种新型的优化器行为,跳过扫描范围访问方法。但据我所知,它仅适用于范围条件,而不是按或组成。
确实,如果您有范围条件,这会破坏订单和组的索引优化。
In my experience, if your query has a range condition (i.e. any kind of predicate other than
=
orIS NULL
), the column for that condition is the last column in your index that can be used to optimize search, sort, or grouping.In other words, suppose you have an index on columns
(a, b, c)
.The following uses all three columns. It is able to optimize the ORDER BY c, because since all rows matching the specific values of a and b will by definition be tied, and then those matching rows will already be in order by
c
, so the ORDER BY is a no-op.But the next example only uses columns a, b. The ORDER BY needs to do a filesort, because the index is not in order by
c
.A similar effect is true for GROUP BY. The following uses a, b for row selection, and it can also optimize the GROUP BY using the index, because each group of values per distinct value of
c
is guaranteed to be grouped together in the index. So it can count the rows for each value ofc
, and when it's done with one group, it is assured there will be no more rows later with that value ofc
.But the range condition spoils that. The rows for each value of
c
are not grouped together. It's assumed that the rows for each value ofc
may be scattered among each of the higher values ofb
.In this case, MySQL can't optimize the GROUP BY in this query. It must use a temporary table to count the rows per distinct value of
c
.MySQL 8.0.13 introduced a new type of optimizer behavior, the Skip Scan Range Access Method. But as far as I know, it only applies to range conditions, not ORDER BY or GROUP BY.
It's still true that if you have a range condition, this spoils the index optimization of ORDER BY and GROUP BY.
除非我不了解任务,否则似乎有效:
尝试一下。
它将从此顺序中的复合索引中受益:
发病率
被大量更新吗?如果是这样,请将其放下我的索引。您应该考虑从Myisam转移到InnoDB。它可能需要合适的PK,
并且需要2倍-3x磁盘空间。
Unless I don't understand the task, it seems like this works:
Give it a try.
It would benefit from a composite index in this order:
Is
incidence
being updated a lot? If so, leave it off my Index.You should consider moving from MyISAM to InnoDB. It will need a suitable PK, probably
and it will take 2x-3x the disk space.