嵌套子查询太慢 - 外连接等效吗？

发布于 2024-10-17 12:04:47 字数 1124 浏览 9 评论 0原文

我正在收集有关我们的代码库的一些基本统计数据，并尝试使用以下架构数据生成查询

保存所有文件的文件表（合成主键 ID、唯一路径和保存文件所属人员的区域列。
file_stats 表保存特定日期的文件数据（主键是日期和 file_id 的组合）

CREATE TABLE files (
id INT PRIMARY KEY,
path VARCHAR(255) NOT NULL UNIQUE,
region VARCHAR(4) CHECK (region IN ('NYK', 'LDN', 'CORE', 'TKY')),
)

CREATE TABLE file_stats (
date DATE NOT NULL,
file_id INT NOT NULL REFERENCES files,
num_lines INT NOT NULL,

CONSTRAINT file_stats__pk PRIMARY KEY(date, file_id)
)

我正在尝试创建一个查询，该查询将返回表中日期和区域的所有组合以及该组合的文件数。

的简单方法

SELECT date, region, COUNT(*) FROM file_stats fs, files f WHERE fs.file_id = f.id
GROUP BY date, region

不起作用，因为并非所有区域都在所有日期都有代表我已经尝试过

SELECT 
d.date, 
r.region,
(SELECT COUNT(*) FROM file_stats fs, files f 
WHERE fs.file_id = file.id AND fs.date = d.date AND d.region = r.region
) AS num_files
FROM
(SELECT DISTINCT date FROM file_stats) AS d,
(SELECT DiSTINCT region FROM files) AS r

，但由于嵌套子查询，性能不可接受。

我尝试过 LEFT OUTER JOINS，但似乎永远无法使它们工作。数据库是 SQLITE

任何人都可以建议更好的查询吗？

原文

I'm collecting some basic statistics on our codebase and am trying to generate a query using the following schema data

A files table holding all the files (synthetic Primary Key ID, unique path, and a region column which holds who the file belongs to.
A file_stats table holding data for the files on a specific date (Primary Key is combination of date and file_id)

CREATE TABLE files (
id INT PRIMARY KEY,
path VARCHAR(255) NOT NULL UNIQUE,
region VARCHAR(4) CHECK (region IN ('NYK', 'LDN', 'CORE', 'TKY')),
)

CREATE TABLE file_stats (
date DATE NOT NULL,
file_id INT NOT NULL REFERENCES files,
num_lines INT NOT NULL,

CONSTRAINT file_stats__pk PRIMARY KEY(date, file_id)
)

I'm trying to create a query which will return all combinations of dates and regions in the tables and the number of files for that combination.

The simple approach of

SELECT date, region, COUNT(*) FROM file_stats fs, files f WHERE fs.file_id = f.id
GROUP BY date, region

doesn't work as not all regions are represnted at all dates.
I've tried

SELECT 
d.date, 
r.region,
(SELECT COUNT(*) FROM file_stats fs, files f 
WHERE fs.file_id = file.id AND fs.date = d.date AND d.region = r.region
) AS num_files
FROM
(SELECT DISTINCT date FROM file_stats) AS d,
(SELECT DiSTINCT region FROM files) AS r

but the performance is unacceptable because of the nested subquery.

I've tried LEFT OUTER JOINS, but never seem to be able to make them work.
The database is SQLITE

Can anyone suggest a better query?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

云之铃。 2024-10-24 12:04:47

SELECT date, region, COUNT(*) FROM file_stats fs, files f WHERE fs.file_id = f.id
GROUP BY date, region

不起作用，因为并非所有区域都有效
代表在所有日期。

假设您的意思是它工作正常，但您需要所有日期来显示某个区域是否会出现在那里，那么您需要两件事。

日历表。
日历表上的左连接。

当你有了日历表之后，就像这样。。。

SELECT c.cal_date, f.region, COUNT(*) 
FROM calendar c
LEFT JOIN file_stats fs ON (fs.date = c.cal_date)
INNER JOIN files f ON (fs.file_id = f.id) 
GROUP BY date, region

我上面使用了 cal_date 。您使用的名称取决于您的日历表。这将帮助您开始。您可以使用电子表格来生成日期。

CREATE TABLE calendar (cal_date date primary key);
INSERT INTO "calendar" VALUES('2011-01-01');
INSERT INTO "calendar" VALUES('2011-01-02');
INSERT INTO "calendar" VALUES('2011-01-03');
INSERT INTO "calendar" VALUES('2011-01-04');
INSERT INTO "calendar" VALUES('2011-01-05');
INSERT INTO "calendar" VALUES('2011-01-06');
INSERT INTO "calendar" VALUES('2011-01-07');
INSERT INTO "calendar" VALUES('2011-01-08');

如果您确定所有日期都在 file_stats 中，则可以不用日历表。但也有一些注意事项。

select fs.date, f.region, count(*)
from file_stats fs
left join files f on (f.id = fs.file_id)
group by fs.date, f.region;

如果您的数据正确，这将起作用，但您的表不能保证数据正确。您没有外键引用，因此每个表中的文件 ID 号可能在另一个表中没有匹配的 ID 号。让我们看一些示例数据。

insert into files values (1, 'a long path', 'NYK');
insert into files values (2, 'another long path', 'NYK');
insert into files values (3, 'a shorter long path', 'LDN'); -- not in file_stats

insert into file_stats values ('2011-01-01', 1, 35);
insert into file_stats values ('2011-01-02', 1, 37);
insert into file_stats values ('2011-01-01', 2, 40);
insert into file_stats values ('2011-01-01', 4, 35); -- not in files

运行此查询（与上面相同，但添加 ORDER BY）。。。

select fs.date, f.region, count(*)
from file_stats fs
left join files f on (f.id = fs.file_id)
group by fs.date, f.region
order by fs.date, f.region;

。。。返回

2011-01-01||1
2011-01-01|NYK|2
2011-01-02|NYK|1

'LDN' 不会显示，因为 file_stats 中没有文件 id 号为 3 的行。一行有空区域，因为文件中没有行的文件 id 号为 4。

您可以通过左连接快速找到不匹配的行。

select f.id, fs.file_id 
from files f
left join file_stats fs on (fs.file_id = f.id)
where fs.file_id is null;

3|

意味着文件中有 id 为 3 的行，但 file_stats 中没有 id 为 3 的行。翻转表格以确定 file_stats 中文件中没有匹配行的行。

select fs.file_id, f.id
from file_stats fs 
left join files f  on (fs.file_id = f.id)
where f.id is null;

SELECT date, region, COUNT(*) FROM file_stats fs, files f WHERE fs.file_id = f.id
GROUP BY date, region

doesn't work as not all regions are
represnted at all dates.

Assuming you mean it works correctly, but you need all the dates to show whether a region might appear there or not, then you need two things.

A calendar table.
A left join on the calendar table.

After you have a calendar table, something like this . . .

SELECT c.cal_date, f.region, COUNT(*) 
FROM calendar c
LEFT JOIN file_stats fs ON (fs.date = c.cal_date)
INNER JOIN files f ON (fs.file_id = f.id) 
GROUP BY date, region

I used cal_date above. The name you use depends on your calendar table. This will get you started. You can use a spreadsheet to generate the dates.

CREATE TABLE calendar (cal_date date primary key);
INSERT INTO "calendar" VALUES('2011-01-01');
INSERT INTO "calendar" VALUES('2011-01-02');
INSERT INTO "calendar" VALUES('2011-01-03');
INSERT INTO "calendar" VALUES('2011-01-04');
INSERT INTO "calendar" VALUES('2011-01-05');
INSERT INTO "calendar" VALUES('2011-01-06');
INSERT INTO "calendar" VALUES('2011-01-07');
INSERT INTO "calendar" VALUES('2011-01-08');

If you're certain that all the dates are in file_stats, you can do without a calendar table. But there are some cautions.

select fs.date, f.region, count(*)
from file_stats fs
left join files f on (f.id = fs.file_id)
group by fs.date, f.region;

This will work if your data is right, but your tables don't guarantee the data will be right. You don't have a foreign key reference, so there might be file id numbers in each table that don't have matching id numbers in the other table. Let's have some sample data.

insert into files values (1, 'a long path', 'NYK');
insert into files values (2, 'another long path', 'NYK');
insert into files values (3, 'a shorter long path', 'LDN'); -- not in file_stats

insert into file_stats values ('2011-01-01', 1, 35);
insert into file_stats values ('2011-01-02', 1, 37);
insert into file_stats values ('2011-01-01', 2, 40);
insert into file_stats values ('2011-01-01', 4, 35); -- not in files

Running this query (same as immediately above, but add ORDER BY) . . .

select fs.date, f.region, count(*)
from file_stats fs
left join files f on (f.id = fs.file_id)
group by fs.date, f.region
order by fs.date, f.region;

. . . returns

2011-01-01||1
2011-01-01|NYK|2
2011-01-02|NYK|1

'LDN' doesn't show, because there's no row in file_stats with file id number 3. One row has a null region, because no row in files has file id number 4.

You can quickly find mismatched rows with a left join.

select f.id, fs.file_id 
from files f
left join file_stats fs on (fs.file_id = f.id)
where fs.file_id is null;

returns

3|

meaning that there's a row in files that has id 3, but no row in file_stats that has id 3. Flip the table around to determine the rows in file_stats that have no matching row in files.

select fs.file_id, f.id
from file_stats fs 
left join files f  on (fs.file_id = f.id)
where f.id is null;

回复收藏 0 原文

仄言 2024-10-24 12:04:47

一种（由于下半年的性能下降而较慢）做你想做的事情的方法是将计数的事物与计数为零的事物的制造列表结合起来：

-- Include the counts for date/region pairs that HAVE files
SELECT date, region, COUNT(*) as COUNT1
FROM file_stats fs, files f 
WHERE fs.file_id = f.id
GROUP BY date, region

UNION

SELECT DISTINCT date, region, 0 as COUNT1
FROM file_stats fs0, files f0
WHERE NOT EXISTS (
    SELECT 1
    FROM   file_stats fs, files f 
    WHERE  fs.file_id = f.id
    AND    fs.date=fs0.date
    AND    f.region=f0.region
)

我不完全确定你为什么反对使用临时表？例如（这是用于临时表填充的 Sybasyish 语法，但应该很容易移植 - 不记得确切的 SQLite 语法）。表大小应尽可能小（只需天数 * 区域数）

CREATE TABLE COMBINATIONS TEMPORARY (region VARCHAR(4), date DATE)

INSERT COMBINATIONS SELECT DISTINCT date, region FROM files, file_stats

SELECT c.date, c.region, SUM(CASE WHEN file_stats.id IS NULL THEN 0 ELSE 1 END) 
FROM COMBINATIONS c
LEFT JOIN files f ON f.region=c.region
LEFT OUTER JOIN file_stats fs ON fs.date=c.date AND fs.file_id = f.id
GROUP BY c.date, c.region

One (slower due to performance hit of a second half) way of doing what you want is a UNION of things that have a count with manufactured list of things that have zero count:

-- Include the counts for date/region pairs that HAVE files
SELECT date, region, COUNT(*) as COUNT1
FROM file_stats fs, files f 
WHERE fs.file_id = f.id
GROUP BY date, region

UNION

SELECT DISTINCT date, region, 0 as COUNT1
FROM file_stats fs0, files f0
WHERE NOT EXISTS (
    SELECT 1
    FROM   file_stats fs, files f 
    WHERE  fs.file_id = f.id
    AND    fs.date=fs0.date
    AND    f.region=f0.region
)

I'm not entirely sure why you're opposed to the use of temp tables? E.g. (this is Sybasyish syntax for temp table population but should port easily - don't recall exact SQLite one). Table size should be minimal (just # of days * # of regions)

CREATE TABLE COMBINATIONS TEMPORARY (region VARCHAR(4), date DATE)

INSERT COMBINATIONS SELECT DISTINCT date, region FROM files, file_stats

SELECT c.date, c.region, SUM(CASE WHEN file_stats.id IS NULL THEN 0 ELSE 1 END) 
FROM COMBINATIONS c
LEFT JOIN files f ON f.region=c.region
LEFT OUTER JOIN file_stats fs ON fs.date=c.date AND fs.file_id = f.id
GROUP BY c.date, c.region

回复收藏 0 原文

贪了杯 2024-10-24 12:04:47

我怀疑它必须尝试扫描 file_stats 和文件以获取输出的每一行。以下版本可能会快得多。并且不需要创建新表。

SELECT d.date
  , r.region
  , count(f.file_id) AS num_files
FROM (SELECT DISTINCT date FROM file_states) AS d,
  (SELECT DISTINCT region FROM files) AS r,
  LEFT JOIN file_stats AS fs
    ON fs.date = d.date
  LEFT JOIN files f
    ON f.file_id = fs.file_id
      AND f.region = r.region
GROUP BY d.date, r.region;

I suspect that it is having to try scan file_stats and files for every single row of the output. The following version might be substantially faster. And it won't require creating new tables.

SELECT d.date
  , r.region
  , count(f.file_id) AS num_files
FROM (SELECT DISTINCT date FROM file_states) AS d,
  (SELECT DISTINCT region FROM files) AS r,
  LEFT JOIN file_stats AS fs
    ON fs.date = d.date
  LEFT JOIN files f
    ON f.file_id = fs.file_id
      AND f.region = r.region
GROUP BY d.date, r.region;

回复收藏 0 原文

~没有更多了~