MySQL标记问题:如何选择已标记为X、Y和Z的项目?

发布于 2024-09-10 02:33:44 字数 933 浏览 10 评论 0原文

我正在处理一个数据库,其中的项目被“标记”一定次数。

item (100k rows)

  • id
  • name
  • other stuff

tag (10k rows)

  • id
  • name

item2tag (1,000,000 rows)

  • item_id
  • tag_id
  • count

我正在寻找最快的解决方案:

选择已标记为 X、Y 和 Z 的项目(其中 X、Y 和 Z 对应于(可能)标记名称)?

这是我到目前为止所拥有的...我只是想确保我以尽可能最好的方式做到这一点:

首先从名称中获取 tag_ids:

SELECT tag.id WHERE name IN ("X","Y","Z");

然后我按这些 tag_ids 进行分组并使用 Have 来过滤结果:

SELECT item2tag.*, count(tag_id)
  FROM item2tag
  WHERE tag_id=1 or tag_id=2 or tag_id=3
GROUP BY item_id
HAVING count(tag_id)=3;

然后我就可以从具有这些 id 的项目中进行选择。

SELECT * FROM item WHERE id IN ([results from prior query])

我在 item2tag 中有数百万行,索引位于 (item_id, tag_id) 上。这将是最快的解决方案吗?

I'm dealing with a database where items are "tagged" a certain number of times.

item (100k rows)

  • id
  • name
  • other stuff

tag (10k rows)

  • id
  • name

item2tag (1,000,000 rows)

  • item_id
  • tag_id
  • count

I'm looking for the fastest solution to:

Select items that have been tagged as X, Y, and Z (where X, Y, and Z correspond to (possibly) tag names) ?

Here's what I have so far... I'd just like to make sure I'm doing it in the best way possible:

First get the tag_ids from the names:

SELECT tag.id WHERE name IN ("X","Y","Z");

Then I group by those tag_ids and use Having to filter the result:

SELECT item2tag.*, count(tag_id)
  FROM item2tag
  WHERE tag_id=1 or tag_id=2 or tag_id=3
GROUP BY item_id
HAVING count(tag_id)=3;

Then I can just select from item with those ids.

SELECT * FROM item WHERE id IN ([results from prior query])

I have millions of rows in item2tag, with an index on (item_id, tag_id). Is this going to be the fastest solution?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

泅人 2024-09-17 02:33:44

您建议的方法可能是执行查询的最常见方法,但可能不是最快的。使用联接可以更快:

SELECT T1.item_id
FROM item2tag T1
JOIN item2tag T2 ON T1.item_id = T2.item_id
JOIN item2tag T3 ON T2.item_id = T3.item_id
WHERE T1.tag_id = 1 AND T2.tag_id = 2 AND T3.tag_id = 3

您应该确保具有以下索引:

  • (item_id, tag_id) 上的主键
  • (tag_id) 上的索引。

我在几个不同的场景中针对原始查询对此查询进行了性能测试。

  • 对于表中几乎每一项都至少标有一个正在搜索的标签的情况,原始查询大约需要 5 秒,而 JOIN 版本大约需要 10 秒 - 稍微慢一些。
  • 对于其中两个标签出现非常频繁而其中一个标签很少出现的情况,原始查询大约需要 0.9 秒,而 JOIN 查询只需要 0.003 秒——这是一个相当大的性能改进。

下面贴出我用来做性能测试的SQL。您可以自己运行此测试或稍微修改它并测试其他查询或不同的场景。

警告:请勿在生产数据库上运行此脚本,因为它会修改 item2tag 表的内容。运行该脚本可能需要几分钟的时间,因为它会创建大量数据。

CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

DELIMITER $

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt <= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$
CALL prc_filler(1000000);

CREATE TABLE item2tag (
    item_id INT NOT NULL,
    tag_id INT NOT NULL,
    count INT NOT NULL
);

INSERT INTO item2tag (item_id, tag_id, count)
SELECT  id % 150001, id % 10, 1
FROM    filler;
ALTER TABLE item2tag ADD PRIMARY KEY (item_id, tag_id);
ALTER TABLE item2tag ADD KEY (tag_id);

-- Make tag 3 occur rarely.    
UPDATE item2tag SET tag_id = 10 WHERE tag_id = 3 AND item_id > 0;

SELECT T1.item_id
FROM item2tag T1
JOIN item2tag T2 ON T1.item_id = T2.item_id
JOIN item2tag T3 ON T2.item_id = T3.item_id
WHERE T1.tag_id = 1 AND T2.tag_id = 2 AND T3.tag_id = 3;

SELECT item_id
FROM item2tag
WHERE tag_id=1 or tag_id=2 or tag_id=3
GROUP BY item_id
HAVING count(tag_id)=3;

The method you have suggested is probably the most common way to perform the query but might not be the fastest. Using joins can be faster:

SELECT T1.item_id
FROM item2tag T1
JOIN item2tag T2 ON T1.item_id = T2.item_id
JOIN item2tag T3 ON T2.item_id = T3.item_id
WHERE T1.tag_id = 1 AND T2.tag_id = 2 AND T3.tag_id = 3

You should ensure that you have the following indexes:

  • Primary key on (item_id, tag_id)
  • Index on (tag_id).

I performance tested this query against the original in a few different scenarios.

  • For the case where nearly every item in the table is tagged with at least one of the tags being searched for, the original query takes about 5 seconds and the JOIN version takes about 10 seconds - slightly slower.
  • For the case where two of the tags occur very frequently and one of the tags occurs only very rarely the original query takes about 0.9 seconds, whereas the JOIN query takes just 0.003 seconds - a considerable performance improvement.

The SQL I used to make performance test is pasted below. You can run this test yourself or modify it slightly and test other queries, or different scenarios.

Warning: Don't run this script on your production database as it modifies the contents of the item2tag table. Running the script can take a few minutes as it creates a lot of data.

CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

DELIMITER $

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt <= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$
CALL prc_filler(1000000);

CREATE TABLE item2tag (
    item_id INT NOT NULL,
    tag_id INT NOT NULL,
    count INT NOT NULL
);

INSERT INTO item2tag (item_id, tag_id, count)
SELECT  id % 150001, id % 10, 1
FROM    filler;
ALTER TABLE item2tag ADD PRIMARY KEY (item_id, tag_id);
ALTER TABLE item2tag ADD KEY (tag_id);

-- Make tag 3 occur rarely.    
UPDATE item2tag SET tag_id = 10 WHERE tag_id = 3 AND item_id > 0;

SELECT T1.item_id
FROM item2tag T1
JOIN item2tag T2 ON T1.item_id = T2.item_id
JOIN item2tag T3 ON T2.item_id = T3.item_id
WHERE T1.tag_id = 1 AND T2.tag_id = 2 AND T3.tag_id = 3;

SELECT item_id
FROM item2tag
WHERE tag_id=1 or tag_id=2 or tag_id=3
GROUP BY item_id
HAVING count(tag_id)=3;
撧情箌佬 2024-09-17 02:33:44

最好有一个以 tag_id 作为第一列的索引 - 否则查找所有 tag_id 为 1 的项目将需要全表扫描(当然,对于任何 tag_id 都相同)。

You'll be better placed having an index that has tag_id as the first column - otherwise finding all items with tag_id 1 will require a full table scan (same for any tag_id, of course).

没有伤那来痛 2024-09-17 02:33:44

根据使用单个标签标记的项目数量,您可以通过获取使用一个标签标记的项目列表,然后过滤它以查找其他标签的出现来完成此操作,如下所示:

select item_id from item2tag
where item_id in (
    select item_id from item2tag
    where item_id in (
        select item_id from item2tag where tag_id = TID1
    ) and tag_id = TID2
) and tag_id = TID3

Depending on how many items are tagged with individual tags, you might do it by getting list of items tagged with one tag, and then filtering it for occurences of other tags, like this:

select item_id from item2tag
where item_id in (
    select item_id from item2tag
    where item_id in (
        select item_id from item2tag where tag_id = TID1
    ) and tag_id = TID2
) and tag_id = TID3
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文