在 MySQL 中转换 UNION 查询

发布于 2024-07-05 21:15:22 字数 1369 浏览 6 评论 0原文

我有一个非常大的表（8GB），其中包含有关文件的信息，我需要针对它运行一个报告，看起来像这样：

(select * from fs_walk_scan where file_path like '\\\\server1\\groot$\\%' order by file_size desc limit 0,30)
UNION ALL
(select * from fs_walk_scan where file_path like '\\\\server1\\hroot$\\%' order by file_size desc limit 0,30)
UNION ALL
(select * from fs_walk_scan where file_path like '\\\\server1\\iroot$\\%' order by file_size desc limit 0,30)
UNION ALL
(select * from fs_walk_scan where file_path like '\\\\server2\\froot$\\%' order by file_size desc limit 0,30)
UNION ALL
(select * from fs_walk_scan where file_path like '\\\\server2\\groot$\\%' order by file_size desc limit 0,30)
UNION ALL
(select * from fs_walk_scan where file_path like '\\\\server3\\hroot$\\%' order by file_size desc limit 0,30)
UNION ALL
(select * from fs_walk_scan where file_path like '\\\\server4\\iroot$\\%' order by file_size desc limit 0,30)
UNION ALL
(select * from fs_walk_scan where file_path like '\\\\server5\\iroot$\\%' order by file_size desc limit 0,30)
[...]
order by substring_index(file_path,'\\',4), file_size desc

此方法完成了我需要做的事情：获取每个文件的 30 个最大文件的列表体积。然而，这是极其缓慢的，并且“喜欢”的搜索是硬编码的，即使它们位于另一个表中并且可以通过这种方式获得。

我正在寻找一种方法来做到这一点，而无需多次浏览巨大的桌子。有人有主意吗？

谢谢。

PS我无法以任何方式改变巨大的源表的结构。

更新：file_path 和 file_size 上有索引，但是每个子（？）查询仍然需要大约 10 分钟，而且我必须至少执行 22 分钟。

原文

I have a very large table (8gb) with information about files, and i need to run a report against it that would would look something like this:

(select * from fs_walk_scan where file_path like '\\\\server1\\groot$\\%' order by file_size desc limit 0,30)
UNION ALL
(select * from fs_walk_scan where file_path like '\\\\server1\\hroot$\\%' order by file_size desc limit 0,30)
UNION ALL
(select * from fs_walk_scan where file_path like '\\\\server1\\iroot$\\%' order by file_size desc limit 0,30)
UNION ALL
(select * from fs_walk_scan where file_path like '\\\\server2\\froot$\\%' order by file_size desc limit 0,30)
UNION ALL
(select * from fs_walk_scan where file_path like '\\\\server2\\groot$\\%' order by file_size desc limit 0,30)
UNION ALL
(select * from fs_walk_scan where file_path like '\\\\server3\\hroot$\\%' order by file_size desc limit 0,30)
UNION ALL
(select * from fs_walk_scan where file_path like '\\\\server4\\iroot$\\%' order by file_size desc limit 0,30)
UNION ALL
(select * from fs_walk_scan where file_path like '\\\\server5\\iroot$\\%' order by file_size desc limit 0,30)
[...]
order by substring_index(file_path,'\\',4), file_size desc

This method accomplishes what I need to do: Get a list of the 30 biggest files for each volume. However, this is deathly slow, and the 'like' searches are hardcoded even though they are sitting in another table and can be gotten that way.

What I'm looking for is a way to do this without going through the huge table several times. Anyone have any ideas?

Thanks.

P.S. I cant change the structure of the huge source table in any way.

Update: There are indexes on file_path and file_size, but each one of those sub(?)queries still takes about 10 mins, and I have to do 22 minimum.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

毅然前行 2024-07-12 21:15:22

您可以使用正则表达式：

select * from fs_walk_scan
  where file_path regexp '^\\\\server(1\\[ghi]|2\\[fg]|3\\h|[45]\\i)root$\\'

否则，如果您可以修改表结构，请添加两列来保存服务器名称和基本路径（并对它们建立索引），以便您可以创建更简单的查询：

select * from fs_walk_scan
  where server = 'server1' and base_path in ('groot
您可以设置一个触发器来初始化插入记录时的字段，或者随后进行批量更新以填充两个额外的列。
, 'hroot
您可以设置一个触发器来初始化插入记录时的字段，或者随后进行批量更新以填充两个额外的列。
, 'iroot
您可以设置一个触发器来初始化插入记录时的字段，或者随后进行批量更新以填充两个额外的列。
)
     or server = 'server2' and base_path in ('froot
您可以设置一个触发器来初始化插入记录时的字段，或者随后进行批量更新以填充两个额外的列。
, 'groot
您可以设置一个触发器来初始化插入记录时的字段，或者随后进行批量更新以填充两个额外的列。
)

您可以设置一个触发器来初始化插入记录时的字段，或者随后进行批量更新以填充两个额外的列。

You could use a regexp:

select * from fs_walk_scan
  where file_path regexp '^\\\\server(1\\[ghi]|2\\[fg]|3\\h|[45]\\i)root$\\'

Otherwise if you can modify your table structure, add two columns to hold the server name and base path (and index them), so that you can create a simpler query:

select * from fs_walk_scan
  where server = 'server1' and base_path in ('groot
You can either set up a trigger to initialise the fields when you insert the record, or else do a bulk update afterwards to fill in the two extra columns.
, 'hroot
You can either set up a trigger to initialise the fields when you insert the record, or else do a bulk update afterwards to fill in the two extra columns.
, 'iroot
You can either set up a trigger to initialise the fields when you insert the record, or else do a bulk update afterwards to fill in the two extra columns.
)
     or server = 'server2' and base_path in ('froot
You can either set up a trigger to initialise the fields when you insert the record, or else do a bulk update afterwards to fill in the two extra columns.
, 'groot
You can either set up a trigger to initialise the fields when you insert the record, or else do a bulk update afterwards to fill in the two extra columns.
)

You can either set up a trigger to initialise the fields when you insert the record, or else do a bulk update afterwards to fill in the two extra columns.

回复收藏 0 原文

无所谓啦 2024-07-12 21:15:22

该表上有什么类型的索引？此索引：

CREATE INDEX fs_search_idx ON fs_walk_scan(file_path, file_size desc)

将显着加快此查询...如果您还没有类似的索引。

更新：

您说 file_path 和 file_size 上已经有索引......它们是单独的索引吗？或者是否有一个索引将两列一起索引？对于这个查询来说，差异将是巨大的。即使有 22 个子查询，如果索引正确，速度也应该非常快。

回复收藏 0 原文

不甘平庸 2024-07-12 21:15:22

您可以使用分组和自加入来实现此目的。

SELECT substring_index(file_path, '\\', 4), file_path
from fs_walk_scan as ws1
WHERE 30<= (
select count(*) from fs_Walk_scan as ws2
where substring_index(ws2.file_path, '\\', 4) = substring_index(ws1.file_path, '\\', 4)
and ws2.file_size > ws1.file_size
and ws2.file_path <> ws1.file_path)
group by substring_index(file_path, '\\', 4)

它仍然是一个 O(n) 查询（n 是组数），但更灵活、更短。

编辑：
另一种方法是使用变量。您的目的的可行性将取决于您将如何运行此查询。

set @idx=0; set @cur_vol=0;                                                                      
SELECT file_volume, file_path, file_size FROM (
    SELECT file_volume, file_path, file_size,
    IF(@cur_vol != a.file_volume, @idx:=1, @idx:=@idx+1) AS row_index,
    IF(@cur_vol != a.file_volume, @cur_vol:=a.file_volume, 0) AS discard
    FROM (SELECT substring_index(file_path, '\\', 4) as file_volume, file_path, file_size 
        FROM fs_walk_scan
        ORDER BY substring_index(file_path,'\\',4), file_size DESC) AS a
    HAVING row_index <= 30) AS b;

我还没有尝试过这段代码，但是可以像这样使用变量的概念来达到您的目的。

You can use grouping and self join for achieving this.

SELECT substring_index(file_path, '\\', 4), file_path
from fs_walk_scan as ws1
WHERE 30<= (
select count(*) from fs_Walk_scan as ws2
where substring_index(ws2.file_path, '\\', 4) = substring_index(ws1.file_path, '\\', 4)
and ws2.file_size > ws1.file_size
and ws2.file_path <> ws1.file_path)
group by substring_index(file_path, '\\', 4)

It still is an O(n) query (n being number of groups) but is more flexible and shorter.

Edit:
Another approach is using variables. Feasibility for your purpose will depend on how you are going to run this query.

set @idx=0; set @cur_vol=0;                                                                      
SELECT file_volume, file_path, file_size FROM (
    SELECT file_volume, file_path, file_size,
    IF(@cur_vol != a.file_volume, @idx:=1, @idx:=@idx+1) AS row_index,
    IF(@cur_vol != a.file_volume, @cur_vol:=a.file_volume, 0) AS discard
    FROM (SELECT substring_index(file_path, '\\', 4) as file_volume, file_path, file_size 
        FROM fs_walk_scan
        ORDER BY substring_index(file_path,'\\',4), file_size DESC) AS a
    HAVING row_index <= 30) AS b;

I haven't tried this code yet, but the concept of variables can be used like this for your purpose.

回复收藏 0 原文

星光不落少年眉 2024-07-12 21:15:22

像这样的东西怎么样（尚未测试过，但看起来很接近）：

select * from fs_walk_scan where file_path like '\\\\server' and file_path like 'root$\\%' order by file_size desc

这样您就可以对单个字段进行一对比较，这些比较通常与您所描述的内容相匹配。也可以使用正则表达式，但我还没有这样做。

How about something like this (haven't tested it, but looks close):

select * from fs_walk_scan where file_path like '\\\\server' and file_path like 'root$\\%' order by file_size desc

This way you're doing a pair of comparisons on the individual field which will generically match what you've described. It may be possible to use a regex, too, but I've not done it.

回复收藏 0 原文

与往事干杯 2024-07-12 21:15:22

试试这个。
您希望获取文件大小较大且文件路径相同的记录少于 30 条的每条记录。

SELECT * 
FROM   fs_walk_scan a
WHERE  ( SELECT COUNT(*) 
         FROM   fs_walk_scan b 
         WHERE  b.file_size  > a.file_size 
         AND    b.file_path  = a.file_path
       ) < 30

编辑：

显然这表现得像狗一样。那么...这个循环语法怎么样？

SELECT DISTINCT file_path
INTO tmp1
FROM   fs_walk_scan a

DECLARE path VARCHAR(255);

SELECT MIN(file_path)
INTO   path
FROM   tmp1 

WHILE  path IS NOT NULL DO
    SELECT * 
    FROM   fs_walk_scan
    WHERE  file_path = path
    ORDER BY file_size DESC
    LIMIT 0,30

    SELECT MIN(file_path)
    INTO   path
    FROM   tmp1
    WHERE  file_path > path 
END WHILE

这里的想法是
1. 获取文件路径列表
2. 循环，对每个路径进行查询，这将获得 30 个最大的文件大小。

（我确实查阅了语法，但我对 MySQL 不太热衷，所以如果它不完全存在，请原谅。请随意编辑/评论）

Try this.
You want to get every record where there are fewer than 30 records with greater file size and the same file path.

SELECT * 
FROM   fs_walk_scan a
WHERE  ( SELECT COUNT(*) 
         FROM   fs_walk_scan b 
         WHERE  b.file_size  > a.file_size 
         AND    b.file_path  = a.file_path
       ) < 30

Edit:

Apparently this performs like a dog. So... How about this looping syntax?

SELECT DISTINCT file_path
INTO tmp1
FROM   fs_walk_scan a

DECLARE path VARCHAR(255);

SELECT MIN(file_path)
INTO   path
FROM   tmp1 

WHILE  path IS NOT NULL DO
    SELECT * 
    FROM   fs_walk_scan
    WHERE  file_path = path
    ORDER BY file_size DESC
    LIMIT 0,30

    SELECT MIN(file_path)
    INTO   path
    FROM   tmp1
    WHERE  file_path > path 
END WHILE

The idea here is to
1. get a list of the file paths
2. loop, doing a query for each path which will get the 30 largest file sizes.

(I did look up the syntax, but I'm not very hot on MySQL, so appologies if it's not quite there. Feel free to edit/comment)

回复收藏 0 原文

盛夏尉蓝 2024-07-12 21:15:22

你可以做这样的事情......假设 fs_list 有一个你的“LIKE”搜索列表：

DELIMITER $

DROP PROCEDURE IF EXISTS `test`.`proc_fs_search` $
CREATE PROCEDURE `test`.`proc_fs_search` ()
BEGIN

DECLARE cur_path VARCHAR(255);
DECLARE done INT DEFAULT 0;


DECLARE list_cursor CURSOR FOR select file_path from fs_list;

DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = 1;

SET @sql_query = '';

OPEN list_cursor;

REPEAT
  FETCH list_cursor INTO cur_path;

  IF NOT done THEN
    IF @sql_query <> '' THEN
      SET @sql_query = CONCAT(@sql_query, ' UNION ALL ');
    END IF;

    SET @sql_query = CONCAT(@sql_query, ' (select * from fs_walk_scan where file_path like ''', cur_path , ''' order by file_size desc limit 0,30)');
  END IF;

UNTIL done END REPEAT;

SET @sql_query = CONCAT(@sql_query, ' order by file_path, file_size desc');

PREPARE stmt FROM @sql_query;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;

END $

DELIMITER ;

You could do something like this... assuming fs_list has a list of your "LIKE" searches:

DELIMITER $

DROP PROCEDURE IF EXISTS `test`.`proc_fs_search` $
CREATE PROCEDURE `test`.`proc_fs_search` ()
BEGIN

DECLARE cur_path VARCHAR(255);
DECLARE done INT DEFAULT 0;


DECLARE list_cursor CURSOR FOR select file_path from fs_list;

DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = 1;

SET @sql_query = '';

OPEN list_cursor;

REPEAT
  FETCH list_cursor INTO cur_path;

  IF NOT done THEN
    IF @sql_query <> '' THEN
      SET @sql_query = CONCAT(@sql_query, ' UNION ALL ');
    END IF;

    SET @sql_query = CONCAT(@sql_query, ' (select * from fs_walk_scan where file_path like ''', cur_path , ''' order by file_size desc limit 0,30)');
  END IF;

UNTIL done END REPEAT;

SET @sql_query = CONCAT(@sql_query, ' order by file_path, file_size desc');

PREPARE stmt FROM @sql_query;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;

END $

DELIMITER ;

回复收藏 0 原文

~没有更多了~