嵌套子查询太慢 - 外连接等效吗?
我正在收集有关我们的代码库的一些基本统计数据,并尝试使用以下架构数据生成查询
- 保存所有文件的文件表(合成主键 ID、唯一路径和保存文件所属人员的区域列。
- file_stats 表保存特定日期的文件数据(主键是日期和 file_id 的组合)
CREATE TABLE files (
id INT PRIMARY KEY,
path VARCHAR(255) NOT NULL UNIQUE,
region VARCHAR(4) CHECK (region IN ('NYK', 'LDN', 'CORE', 'TKY')),
)
CREATE TABLE file_stats (
date DATE NOT NULL,
file_id INT NOT NULL REFERENCES files,
num_lines INT NOT NULL,
CONSTRAINT file_stats__pk PRIMARY KEY(date, file_id)
)
我正在尝试创建一个查询,该查询将返回表中日期和区域的所有组合以及该组合的文件数 。
的简单方法
SELECT date, region, COUNT(*) FROM file_stats fs, files f WHERE fs.file_id = f.id
GROUP BY date, region
不起作用,因为并非所有区域都在所有日期都有代表 我已经尝试过
SELECT
d.date,
r.region,
(SELECT COUNT(*) FROM file_stats fs, files f
WHERE fs.file_id = file.id AND fs.date = d.date AND d.region = r.region
) AS num_files
FROM
(SELECT DISTINCT date FROM file_stats) AS d,
(SELECT DiSTINCT region FROM files) AS r
,但由于嵌套子查询,性能不可接受。
我尝试过 LEFT OUTER JOINS,但似乎永远无法使它们工作。 数据库是 SQLITE
任何人都可以建议更好的查询吗?
I'm collecting some basic statistics on our codebase and am trying to generate a query using the following schema data
- A files table holding all the files (synthetic Primary Key ID, unique path, and a region column which holds who the file belongs to.
- A file_stats table holding data for the files on a specific date (Primary Key is combination of date and file_id)
CREATE TABLE files (
id INT PRIMARY KEY,
path VARCHAR(255) NOT NULL UNIQUE,
region VARCHAR(4) CHECK (region IN ('NYK', 'LDN', 'CORE', 'TKY')),
)
CREATE TABLE file_stats (
date DATE NOT NULL,
file_id INT NOT NULL REFERENCES files,
num_lines INT NOT NULL,
CONSTRAINT file_stats__pk PRIMARY KEY(date, file_id)
)
I'm trying to create a query which will return all combinations of dates and regions in the tables and the number of files for that combination.
The simple approach of
SELECT date, region, COUNT(*) FROM file_stats fs, files f WHERE fs.file_id = f.id
GROUP BY date, region
doesn't work as not all regions are represnted at all dates.
I've tried
SELECT
d.date,
r.region,
(SELECT COUNT(*) FROM file_stats fs, files f
WHERE fs.file_id = file.id AND fs.date = d.date AND d.region = r.region
) AS num_files
FROM
(SELECT DISTINCT date FROM file_stats) AS d,
(SELECT DiSTINCT region FROM files) AS r
but the performance is unacceptable because of the nested subquery.
I've tried LEFT OUTER JOINS, but never seem to be able to make them work.
The database is SQLITE
Can anyone suggest a better query?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
假设您的意思是它工作正常,但您需要所有日期来显示某个区域是否会出现在那里,那么您需要两件事。
当你有了日历表之后,就像这样。 。 。
我上面使用了 cal_date 。您使用的名称取决于您的日历表。这将帮助您开始。您可以使用电子表格来生成日期。
如果您确定所有日期都在 file_stats 中,则可以不用日历表。但也有一些注意事项。
如果您的数据正确,这将起作用,但您的表不能保证数据正确。您没有外键引用,因此每个表中的文件 ID 号可能在另一个表中没有匹配的 ID 号。让我们看一些示例数据。
运行此查询(与上面相同,但添加 ORDER BY)。 。 。
。 。 。返回
'LDN' 不会显示,因为 file_stats 中没有文件 id 号为 3 的行。一行有空区域,因为文件中没有行的文件 id 号为 4。
您可以通过左连接快速找到不匹配的行。
返回
意味着文件中有 id 为 3 的行,但 file_stats 中没有 id 为 3 的行。翻转表格以确定 file_stats 中文件中没有匹配行的行。
Assuming you mean it works correctly, but you need all the dates to show whether a region might appear there or not, then you need two things.
After you have a calendar table, something like this . . .
I used cal_date above. The name you use depends on your calendar table. This will get you started. You can use a spreadsheet to generate the dates.
If you're certain that all the dates are in file_stats, you can do without a calendar table. But there are some cautions.
This will work if your data is right, but your tables don't guarantee the data will be right. You don't have a foreign key reference, so there might be file id numbers in each table that don't have matching id numbers in the other table. Let's have some sample data.
Running this query (same as immediately above, but add ORDER BY) . . .
. . . returns
'LDN' doesn't show, because there's no row in file_stats with file id number 3. One row has a null region, because no row in files has file id number 4.
You can quickly find mismatched rows with a left join.
returns
meaning that there's a row in files that has id 3, but no row in file_stats that has id 3. Flip the table around to determine the rows in file_stats that have no matching row in files.
一种(由于下半年的性能下降而较慢)做你想做的事情的方法是将计数的事物与计数为零的事物的制造列表结合起来:
我不完全确定你为什么反对使用临时表?例如(这是用于临时表填充的 Sybasyish 语法,但应该很容易移植 - 不记得确切的 SQLite 语法)。表大小应尽可能小(只需天数 * 区域数)
One (slower due to performance hit of a second half) way of doing what you want is a UNION of things that have a count with manufactured list of things that have zero count:
I'm not entirely sure why you're opposed to the use of temp tables? E.g. (this is Sybasyish syntax for temp table population but should port easily - don't recall exact SQLite one). Table size should be minimal (just # of days * # of regions)
我怀疑它必须尝试扫描 file_stats 和文件以获取输出的每一行。以下版本可能会快得多。并且不需要创建新表。
I suspect that it is having to try scan file_stats and files for every single row of the output. The following version might be substantially faster. And it won't require creating new tables.