优化 MySQL GROUP BY/ORDER BY 计算集合交集

发布于 2024-10-11 16:42:00 字数 1812 浏览 5 评论 0原文

为了使问题更加清晰，这个场景被稍微简化了。我的情况涉及MySQL中的一组数据记录。

CREATE TABLE `records` (                                          
  `id` bigint(20) NOT NULL,                                                       
  `property1` bigint(20) NOT NULL,
  `property2` bigint(20) NOT NULL,
  PRIMARY KEY  (`id`),
  KEY `property1` (`property1`),
  KEY `property2` (`property2`)
);

我们根据记录数据从每条记录生成并存储可变数量的键（哈希）。

CREATE TABLE `rkeys` (
  `rKey` bigint(20) NOT NULL,
  `rId` bigint(20) NOT NULL,
  KEY `rKey` (`rKey`),
  KEY `rId` (`rId`),
  FOREIGN KEY (`rId`) REFERENCES `records` (`id`)
);

（键值是散列，以便更均匀地分布在键空间上。）

例如，可能有 500 万条记录和 5000 万个键。

我试图做的是对键集进行模糊搜索——将记录与数据库中具有最多公共键的记录进行匹配。还需要根据记录表中的属性来过滤结果。

我一直在使用的查询如下所示：

SELECT rkeys.rId, records.property1, SUM(1) as score 
FROM rkeys, records
WHERE 
   (rKey = 10 OR rKey = 11 OR rKey = 13 OR rKey = 14) AND 
    rkeys.rId = records.id AND 
    records.property1 = 1 AND
    records.property2 = 2 
GROUP BY rId ORDER BY score DESC;

如果具有任何给定键的记录数量相当小，则性能还可以；问题是如果我按下的键出现在几千条记录中（比如说 5000 条）。突然间，GROUP BY/ORDER BY 性能急剧下降（每个查询 15-20 秒）。请注意，平滑密钥分布并不是真正的选择——记录数据本身分布不均匀。

针对记录问题的联接似乎并不是问题的核心——我只是将其包含在上下文中。如果我想做的只是这样，我仍然会看到同样的问题：

SELECT rId, SUM(1) as score 
FROM rkeys
WHERE rKey = 10 OR rKey = 11 OR rKey = 13 OR rKey = 14
GROUP BY rId ORDER BY score DESC;

EXPLAIN 输出：

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: rkeys
         type: index
possible_keys: rKey
          key: rKey
      key_len: 8
          ref: NULL
         rows: 1
        Extra: Using where; Using temporary; Using filesort

有没有办法可以重构此表或查询以加快此操作？

原文

This scenario is simplified somewhat to make the problem clearer.
My situation involves a set of data records in MySQL.

CREATE TABLE `records` (                                          
  `id` bigint(20) NOT NULL,                                                       
  `property1` bigint(20) NOT NULL,
  `property2` bigint(20) NOT NULL,
  PRIMARY KEY  (`id`),
  KEY `property1` (`property1`),
  KEY `property2` (`property2`)
);

From each record, we generate and store a variable number of keys (hashes) based on the record data.

CREATE TABLE `rkeys` (
  `rKey` bigint(20) NOT NULL,
  `rId` bigint(20) NOT NULL,
  KEY `rKey` (`rKey`),
  KEY `rId` (`rId`),
  FOREIGN KEY (`rId`) REFERENCES `records` (`id`)
);

(The key values are hashes to distribute them over the keyspace more evenly.)

There may be, for example, 5 million records and 50 million keys.

What I'm attempting to do is a fuzzy search on the key set -- match a record against the records in the database with the most keys in common. The results also need to be filtered against the properties in the records table.

The query I've been working from looks like this:

SELECT rkeys.rId, records.property1, SUM(1) as score 
FROM rkeys, records
WHERE 
   (rKey = 10 OR rKey = 11 OR rKey = 13 OR rKey = 14) AND 
    rkeys.rId = records.id AND 
    records.property1 = 1 AND
    records.property2 = 2 
GROUP BY rId ORDER BY score DESC;

The performance is ok if the number of records with any given key is fairly small; the problem is if I hit a key that appears in several thousand records (say 5000). All of a sudden, the GROUP BY/ORDER BY performance falls off a cliff (15-20s per query). Note that smoothing out the key distribution is not really an option -- the record data itself is unevenly distributed.

The join against the records problem doesn't seem to be the core of the problem -- I'm just including it for context. I still see the same problem if all I want to do is this:

SELECT rId, SUM(1) as score 
FROM rkeys
WHERE rKey = 10 OR rKey = 11 OR rKey = 13 OR rKey = 14
GROUP BY rId ORDER BY score DESC;

EXPLAIN output:

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: rkeys
         type: index
possible_keys: rKey
          key: rKey
      key_len: 8
          ref: NULL
         rows: 1
        Extra: Using where; Using temporary; Using filesort

Is there a way I can restructure this table or query in order to speed this operation up?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

红ご颜醉 2024-10-18 16:42:00

您是否尝试过向这些字段添加非聚集索引（索引）？除了在某些 SQL 引擎中主键声明所做的一些隐式聚集索引创建之外，我过去还没有看到 Keys 自动执行此操作。

回复收藏 0 原文

梦在深巷 2024-10-18 16:42:00

我是菜鸟，但尝试在 (rKey, rId) 或 (rId, rKey) 上使用复合索引

回复收藏 0 原文

一笔一画续写前缘 2024-10-18 16:42:00

尝试类似的方法

SELECT rId, Count(*) as score 
FROM rkeys
WHERE rKey = 10 OR rKey = 11 OR rKey = 13 OR rKey = 14
GROUP BY rId ORDER BY score DESC

并向表添加索引

(rKey,rId)

虽然用 count 替换 Sum ，但应该不会有太大区别。（无论如何在MSSQL中）

try something like this

SELECT rId, Count(*) as score 
FROM rkeys
WHERE rKey = 10 OR rKey = 11 OR rKey = 13 OR rKey = 14
GROUP BY rId ORDER BY score DESC

and add an index to the table

(rKey,rId)

Though replacing Sum with count, shouldn't make to much difference. (in MSSQL anyway)

回复收藏 0 原文

夕嗳→ 2024-10-18 16:42:00

“键值是散列，以便将它们更均匀地分布在键空间上”对于性能来说实际上是一个坏想法。一旦数据变得太大而无法缓存，随机性就会减慢你的速度。

Keys 表看起来很像多对多映射表。这里是提高性能的一些技巧这样的桌子。而且它可能会加快 SELECT 的速度。

通过此“复合”和“覆盖”索引，您的 SELECT 应该会显着提高：

INDEX(property1, property2, id)

(创建最佳索引的更多技巧。）

"The key values are hashes to distribute them over the keyspace more evenly" is actually a bad idea for performance. Once the data gets too large to be cached, you will be slowed down by the randomness.

The table Keys smells a lot like a many:many mapping table. Here are several tips on improving the performance of such a table. And it is likely to speed up your SELECT.

Your SELECT should improve significantly with this 'composite' and 'covering' index: