优化计数查询还是重构设计?

发布于 2025-01-13 01:56:22 字数 1880 浏览 5 评论 0原文

我正在尝试按范围日期(从/到日期)为用户查看/点击的产品(大约 30k 个将在数据库中重复记录的产品)创建报告系统。

每次用户单击产品时,我都会在数据库中的单行中记录产品 ID 和日期。当我必须选择并显示报告时,我的问题就来了,因为表格在 2 个月内迅速增长到 400 万个,而且我必须保留长达 6 个月的记录。

我的问题是有没有更好的方法来优化查询或记录它们的方式?

数据库表

CREATE TABLE `product_view` (
    `id` int(11) NOT NULL,
    `product_id` int(11) NOT NULL,
    `date_create` datetime NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

ALTER TABLE `product_view`
    ADD PRIMARY KEY (`id`),
    ADD KEY `product_id` (`product_id`) USING BTREE;

我的选择查询没有范围日期需要大约 50 秒才能提取结果

SELECT SQL_NO_CACHE pd.name, pv.product_id, p.model, COUNT(pv.id) as total
FROM product_view pv
    LEFT JOIN product p ON p.product_id = pv.product_id
    LEFT JOIN product_description pd ON pd.product_id = pv.product_id
WHERE pv.product_id > 0
GROUP BY pv.product_id
ORDER BY total
DESC LIMIT 0,20

查询示例

id  select_type     table   type      possible_keys     key           key_len      ref                      rows        Extra
1   SIMPLE           pv     range   product_id           product_id     4         NULL                      1647717     Using where; Using index; Using temporary; Using filesort
1   SIMPLE           p     eq_ref   PRIMARY              PRIMARY        4         test.pv.product_id        1
1   SIMPLE           pd     ref     PRIMARY,product_id   PRIMARY        4         test.pv.product_id        1

查询有范围日期

SELECT SQL_NO_CACHE pd.name, pv.product_id, p.model, COUNT(pv.id) as total
FROM product_view pv
    LEFT JOIN product p ON p.product_id = pv.product_id
    LEFT JOIN product_description pd ON pd.product_id = pv.product_id
WHERE pv.product_id > 0
    AND DATE(pv.date_create) >= '2021-07-25'
    AND DATE(pv.date_create) <= '2022-03-10'
GROUP BY pv.product_id
ORDER BY total DESC LIMIT 0,20

Im trying to do report system by range date (from/to date) for products (around 30k products that will repetitive record in DB) viewed/clicked by users.

Each time user click on product i record product_id and date on single row in Database. My problem is coming when i have to select and display report because table quickly grow up to 4 millions in 2 months and i have to keep record up to 6 months.

My question is there any better way to optimize the query or the way i record them?

DB Table

CREATE TABLE `product_view` (
    `id` int(11) NOT NULL,
    `product_id` int(11) NOT NULL,
    `date_create` datetime NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

ALTER TABLE `product_view`
    ADD PRIMARY KEY (`id`),
    ADD KEY `product_id` (`product_id`) USING BTREE;

My select query without range date take around 50 seconds to pull results

SELECT SQL_NO_CACHE pd.name, pv.product_id, p.model, COUNT(pv.id) as total
FROM product_view pv
    LEFT JOIN product p ON p.product_id = pv.product_id
    LEFT JOIN product_description pd ON pd.product_id = pv.product_id
WHERE pv.product_id > 0
GROUP BY pv.product_id
ORDER BY total
DESC LIMIT 0,20

Query EXAMPlE

id  select_type     table   type      possible_keys     key           key_len      ref                      rows        Extra
1   SIMPLE           pv     range   product_id           product_id     4         NULL                      1647717     Using where; Using index; Using temporary; Using filesort
1   SIMPLE           p     eq_ref   PRIMARY              PRIMARY        4         test.pv.product_id        1
1   SIMPLE           pd     ref     PRIMARY,product_id   PRIMARY        4         test.pv.product_id        1

Query with range date

SELECT SQL_NO_CACHE pd.name, pv.product_id, p.model, COUNT(pv.id) as total
FROM product_view pv
    LEFT JOIN product p ON p.product_id = pv.product_id
    LEFT JOIN product_description pd ON pd.product_id = pv.product_id
WHERE pv.product_id > 0
    AND DATE(pv.date_create) >= '2021-07-25'
    AND DATE(pv.date_create) <= '2022-03-10'
GROUP BY pv.product_id
ORDER BY total DESC LIMIT 0,20

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

舂唻埖巳落 2025-01-20 01:56:22

尝试将查询重写为相关查询:

select p.product_id, p.model, pd.name, (
    select count(*)
    from product_view as pv
    where pv.product_id = p.product_id
    and pv.date_create >= '2021-07-25'
    and pv.date_create <  '2022-03-10' + interval 1 day
) as total
from product as p
left join product_description as pd on p.product_id = pd.product_id
where exists (
    select 1
    from product_view as pv
    where pv.product_id = p.product_id
    and pv.date_create >= '2021-07-25'
    and pv.date_create <  '2022-03-10' + interval 1 day
    -- this is a far more optimized version for dates used in your op
)
order by total desc
limit 0, 20

这不涉及分组,因此它应该比原始查询更快。如果不需要日期过滤器,则从计数子查询中删除 where isn't 部分以及 and pv.date_create ...

其次,我在解释中没有看到任何有用的索引。您应该尝试以下索引:

create index ix1 on product_view (product_id, date_create)
-- should be (i) good for joining (ii) "covers" the date column

Try rewriting the query as a correlated query:

select p.product_id, p.model, pd.name, (
    select count(*)
    from product_view as pv
    where pv.product_id = p.product_id
    and pv.date_create >= '2021-07-25'
    and pv.date_create <  '2022-03-10' + interval 1 day
) as total
from product as p
left join product_description as pd on p.product_id = pd.product_id
where exists (
    select 1
    from product_view as pv
    where pv.product_id = p.product_id
    and pv.date_create >= '2021-07-25'
    and pv.date_create <  '2022-03-10' + interval 1 day
    -- this is a far more optimized version for dates used in your op
)
order by total desc
limit 0, 20

This does not involve grouping so it should be faster than your original query. If date filter is not required then remove the where exists part and and pv.date_create ... from the count sub-query.

Secondly, I don't see any useful indexes in the explain. You should try the following indexes:

create index ix1 on product_view (product_id, date_create)
-- should be (i) good for joining (ii) "covers" the date column
﹂绝世的画 2025-01-20 01:56:22
CREATE TABLE `product_view` (
-- toss, as useless:  `id` int(11) NOT NULL,
`product_id` int(11) NOT NULL,
`date_create` datetime NOT NULL,
PRIMARY KEY(product_id, date_create)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

然后 PARTITION BY RANGE(TO_DAYS(date_create)) 使删除更加高效。我建议每周进行 30 次左右的分区。请参阅分区

COUNT(pv.id)——通常的模式就是COUNT(*)。对于 id,它会检查该 id 是否为 NOT NULL,这是不必要的。

pv.product_id> 0 -- ids <=0 有什么特别之处吗?

让我们重新排列查询,以“开始”计数:

SELECT pd.name, pv.product_id, p.model, s.total
    FROM ( SELECT pv.product_id, COUNT(*) AS total
             FROM product_view AS pv
             WHERE pv.date_create >= '2021-07-25'
         ) AS s
    JOIN product AS p  ON p.product_id = pv.product_id
    ORDER BY total DESC
    LIMIT 0, 20

注意:

  • 去掉 DATE(),它不是“可控制的”并且会阻止使用任何索引。
  • 如果您希望计数到昨天,则添加
    AND pv.date_create < CURDATE()
  • LEFT 暗示“右”行可能丢失;我怀疑情况并非如此。
  • 我摆脱了pd,因为它没有被使用(如果不使用它,它的成本会很高)。
  • SQL_NO_CACHE 将在 8.0 中消失;您现在也可以关闭查询缓存。
  • 如果日期范围回溯大部分数据,则将扫描整个表,因此我在该领域对性能没有多大帮助。所以...
  • 构建并维护一个汇总表(dy,product_id,小计);然后对其进行查询。请参阅汇总表
CREATE TABLE `product_view` (
-- toss, as useless:  `id` int(11) NOT NULL,
`product_id` int(11) NOT NULL,
`date_create` datetime NOT NULL,
PRIMARY KEY(product_id, date_create)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Then PARTITION BY RANGE(TO_DAYS(date_create)) to make the deletion much more efficient. I recommend about 30 weekly partitions. See Partition .

COUNT(pv.id) -- The usual pattern is simply COUNT(*). With an id, it checks that id for being NOT NULL, which is unnecessary.

pv.product_id > 0 -- is there something special about ids <=0?

Let's rearrange the query to "start" with the count:

SELECT pd.name, pv.product_id, p.model, s.total
    FROM ( SELECT pv.product_id, COUNT(*) AS total
             FROM product_view AS pv
             WHERE pv.date_create >= '2021-07-25'
         ) AS s
    JOIN product AS p  ON p.product_id = pv.product_id
    ORDER BY total DESC
    LIMIT 0, 20

Note:

  • Get rid of DATE(), it is not "sargable" and prevents the use of any index.
  • If you want the count to go through yesterday, then add
    AND pv.date_create < CURDATE()
  • LEFT implies that the 'right' row may be missing; I suspect that is not the case.
  • I got rid of pd since it is not used (and costs a lot to have if it won't be used).
  • SQL_NO_CACHE is going away in 8.0; you may as well turn off the Query cache now.
  • If the date range goes back through most of the data, the entire table will be scanned, so I have not helped performance much in that area. So...
  • Build and maintain a Summary Table with (dy, product_id, subtotal); then do the query against it. See Summary Tables
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文