MySQL:用 FROM 子句中的相关子查询重写 MSSQL?
我们有一个包含网站页面浏览量的表,例如:
time | page_id
----------|-----------------------------
1256645862| pageA
1256645889| pageB
1256647199| pageA
1256647198| pageA
1256647300| pageB
1257863235| pageA
1257863236| pageC
在我们的生产表中,当前有大约 40K 行。我们希望每天生成过去 30 天、60 天和 90 天内查看的唯一页面数。因此,在结果集中,我们可以查找某一天,并查看在该天之前的 60 天内访问了多少唯一页面。
我们能够在 MSSQL 中运行查询:
SELECT DISTINCT
CONVERT(VARCHAR,P.NDATE,101) AS 'DATE',
(SELECT COUNT(DISTINCT SUB.PAGE_ID) FROM (SELECT PAGE_ID FROM perflog WHERE NDATE BETWEEN DATEADD(D,-29,P.NDATE) AND P.NDATE) AS SUB) AS '30D',
(SELECT COUNT(DISTINCT SUB.PAGE_ID) FROM (SELECT PAGE_ID FROM perflog WHERE NDATE BETWEEN DATEADD(D,-59,P.NDATE) AND P.NDATE) AS SUB) AS '60D',
(SELECT COUNT(DISTINCT SUB.PAGE_ID) FROM (SELECT PAGE_ID FROM perflog WHERE NDATE BETWEEN DATEADD(D,-89,P.NDATE) AND P.NDATE) AS SUB) AS '90D'
FROM PERFLOG P
ORDER BY 'DATE'
注意:因为 MSSQL 没有 FROM_UNIXTIME 函数,所以我们添加了 NDATE 列进行测试,它只是转换后的时间。生产表中不存在 NDATE。
将此查询转换为 MySQL 会出现“未知列 P.time”错误:
SELECT DISTINCT
FROM_UNIXTIME(P.time,'%Y-%m-%d') AS 'DATE',
(SELECT COUNT(DISTINCT SUB.PAGE_ID) FROM (SELECT PAGE_ID FROM perflog WHERE FROM_UNIXTIME(time,'%Y-%m-%d') BETWEEN DATE_SUB(FROM_UNIXTIME(P.time,'%Y-%m-%d'), INTERVAL 30 DAY) AND FROM_UNIXTIME(P.time,'%Y-%m-%d')) AS SUB) AS '30D',
(SELECT COUNT(DISTINCT SUB.PAGE_ID) FROM (SELECT PAGE_ID FROM perflog WHERE FROM_UNIXTIME(time,'%Y-%m-%d') BETWEEN DATE_SUB(FROM_UNIXTIME(P.time,'%Y-%m-%d'), INTERVAL 60 DAY) AND FROM_UNIXTIME(P.time,'%Y-%m-%d')) AS SUB) AS '60D',
(SELECT COUNT(DISTINCT SUB.PAGE_ID) FROM (SELECT PAGE_ID FROM perflog WHERE FROM_UNIXTIME(time,'%Y-%m-%d') BETWEEN DATE_SUB(FROM_UNIXTIME(P.time,'%Y-%m-%d'), INTERVAL 90 DAY) AND FROM_UNIXTIME(P.time,'%Y-%m-%d')) AS SUB) AS '90D'
FROM PERFLOG P
ORDER BY 'DATE'
我理解这是因为我们不能有引用外部 FROM 子句中的表的相关子查询。但不幸的是,我们不知道如何将此查询转换为在 MySQL 中工作。现在,我们只需从表中返回所有 DISTINCT 行并在 PHP 中对其进行后处理。 40K 行大约需要 2-3 秒。当我们有数百行或数千行时,我担心性能。
可以在MySQL中实现吗?如果是这样,我们是否可以期望它比我们的 PHP 后处理解决方案表现更好。
更新: 以下是创建表的查询:
CREATE TABLE `perflog` (
`user_id` VARBINARY( 40 ) NOT NULL ,
`elapsed` float UNSIGNED NOT NULL ,
`page_id` VARCHAR( 255 ) NOT NULL ,
`time` INT( 10 ) UNSIGNED NOT NULL ,
`ip` VARBINARY( 40 ) NOT NULL ,
`agent` VARCHAR( 255 ) NOT NULL ,
PRIMARY KEY ( `user_id` , `page_id` , `time` , `ip`, `agent` )
) ENGINE MyISAM
到目前为止,我们的生产表大约有 40K 行!
We have a table that contains website page views, like:
time | page_id
----------|-----------------------------
1256645862| pageA
1256645889| pageB
1256647199| pageA
1256647198| pageA
1256647300| pageB
1257863235| pageA
1257863236| pageC
In our production table, there is currently about 40K rows. We want to generate, for each day, the count of unique pages viewed in the last 30 days, 60 days, and 90 days. So, in the result set, we can look-up a day, and see how many unique pages were accessed within the 60-day period preceding that day.
We were able to get a query to work in MSSQL:
SELECT DISTINCT
CONVERT(VARCHAR,P.NDATE,101) AS 'DATE',
(SELECT COUNT(DISTINCT SUB.PAGE_ID) FROM (SELECT PAGE_ID FROM perflog WHERE NDATE BETWEEN DATEADD(D,-29,P.NDATE) AND P.NDATE) AS SUB) AS '30D',
(SELECT COUNT(DISTINCT SUB.PAGE_ID) FROM (SELECT PAGE_ID FROM perflog WHERE NDATE BETWEEN DATEADD(D,-59,P.NDATE) AND P.NDATE) AS SUB) AS '60D',
(SELECT COUNT(DISTINCT SUB.PAGE_ID) FROM (SELECT PAGE_ID FROM perflog WHERE NDATE BETWEEN DATEADD(D,-89,P.NDATE) AND P.NDATE) AS SUB) AS '90D'
FROM PERFLOG P
ORDER BY 'DATE'
NOTE: because MSSQL doesn't have the FROM_UNIXTIME function, we added the NDATE column for testing, which is just the converted time
. NDATE does not exist in the production table.
Converting this query to MySQL gives us the "Unknown colum P.time" error:
SELECT DISTINCT
FROM_UNIXTIME(P.time,'%Y-%m-%d') AS 'DATE',
(SELECT COUNT(DISTINCT SUB.PAGE_ID) FROM (SELECT PAGE_ID FROM perflog WHERE FROM_UNIXTIME(time,'%Y-%m-%d') BETWEEN DATE_SUB(FROM_UNIXTIME(P.time,'%Y-%m-%d'), INTERVAL 30 DAY) AND FROM_UNIXTIME(P.time,'%Y-%m-%d')) AS SUB) AS '30D',
(SELECT COUNT(DISTINCT SUB.PAGE_ID) FROM (SELECT PAGE_ID FROM perflog WHERE FROM_UNIXTIME(time,'%Y-%m-%d') BETWEEN DATE_SUB(FROM_UNIXTIME(P.time,'%Y-%m-%d'), INTERVAL 60 DAY) AND FROM_UNIXTIME(P.time,'%Y-%m-%d')) AS SUB) AS '60D',
(SELECT COUNT(DISTINCT SUB.PAGE_ID) FROM (SELECT PAGE_ID FROM perflog WHERE FROM_UNIXTIME(time,'%Y-%m-%d') BETWEEN DATE_SUB(FROM_UNIXTIME(P.time,'%Y-%m-%d'), INTERVAL 90 DAY) AND FROM_UNIXTIME(P.time,'%Y-%m-%d')) AS SUB) AS '90D'
FROM PERFLOG P
ORDER BY 'DATE'
I understand this is because we cannot have a correlated subquery that references a table in the outer FROM clause. But, unfortunately, we are at a loss on how to convert this query to work in MySQL. For now, we simply return all DISTINCT rows from the table and post-process it in PHP. Takes about 2-3 seconds for 40K rows. I'm worried about the performance when we have 100's of 1000's of rows.
Is it possible to do in MySQL? If so, can we expect it to perform better than our PHP post-processed solution.
UPDATE:
Here's the query for creating the table:
CREATE TABLE `perflog` (
`user_id` VARBINARY( 40 ) NOT NULL ,
`elapsed` float UNSIGNED NOT NULL ,
`page_id` VARCHAR( 255 ) NOT NULL ,
`time` INT( 10 ) UNSIGNED NOT NULL ,
`ip` VARBINARY( 40 ) NOT NULL ,
`agent` VARCHAR( 255 ) NOT NULL ,
PRIMARY KEY ( `user_id` , `page_id` , `time` , `ip`, `agent` )
) ENGINE MyISAM
Our production table has ~40K rows thus far!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
注意:我是在阅读了 @astander、@Donnie、@longneck 的解决方案后写下这篇文章的。
我知道性能很重要,但为什么不存储聚合呢?十年的每行一天有 3650 行,每行只有几列。
这样,您只需在一天结束时运行查询一次,仅运行一天。预先计算的聚合是任何高性能分析解决方案(多维数据集)的基础。
更新:
您可以通过引入另一列
DayInEpoch int
(自 1990 年 1 月 1 日以来的天数)来加快这些查询的速度。然后您可以删除所有这些日期/时间转换函数。Note: I am writing this after reading solutions by @astander, @Donnie, @longneck.
I understand that performance is important, but why don't you store aggregates? Ten years of day-per-row is 3650 rows with only few columns each.
This way you would run the query only once at the end of the day, for one day only. Pre-calculated aggregates are at the root of any high-performance analytic solution (cubes).
UPDATE:
You could speed up those queries by introducing another column
DayInEpoch int
(day number since say 1990-01-01). Then you can remove all those date/time conversion functions.为什么你把子查询埋在这样的第二层?试试这个:
why do you have the subquery buried in a second level like that? try this instead:
您可以尝试使用单选。
仅选择该日期和 90 天前之间的值。
然后在每个字段中使用 case 语句来检查日期是否在 30、60、90 之间。对于每个字段,如果 case 为 true,则为 1,否则为 0,并计算这些值。
像这样的东西
You can try using a single select.
Select the values between the date and 90 days prior only.
Then use a case statement in each fiels to check if the date falls between 30, 60, 90. For each field if the case is true, then 1 else 0, and count those.
Something like
将子选择更改为联接,如下所示:
但是,由于一堆函数杀死了日期列上的任何索引,这可能会运行缓慢,更好的解决方案可能是:
Change the subselects into joins, as such:
However, that's likely to run slowly because of the pile of functions killing any indicies on your date columns, a better solution might be: