针对单一记录的 Unnest 查询优化
我正在尝试优化我的查询,以防止内部客户只想返回一个结果*(及其关联的嵌套数据集)。我的目标是减少查询过程的大小。
然而,无论我查询的是 1 条记录(未嵌套的 48,000 长度数组)还是整个数据集(10,000 条记录,未嵌套的数组总长度为 514,048,748),它似乎都是完全相同的值!
因此,我的表对一条记录查询的结果
SELECT test_id, value
FROM <my_table_reference>, unnest(TimeSeries)timeseries
WHERE test_id= "T0003" and SignalName = "Distance";
如下所示:
test_id | value |
---|---|
T0003 | 1.0 |
T0003 | 2.0 |
T0003 | 3.0 |
T0003 | 4.0 |
(48000 rows)
这将一直持续到 1 条记录的 value (Distance) = 48000m (48000 rows): WHERE = ='T0003
。
总进程为3.84GB
对于整个表(约10,000条记录):
SELECT test_id, value
FROM <my_table_reference>, unnest(TimeSeries)timeseries
WHERE SignalName = "Distance";
如下所示:
test_id | value |
---|---|
T0001 | 1.0 |
T0001 | 2.0 |
T0001 | 3.0 |
T0001 | 4.0 |
(514,048,748行)
总进程为3.84GB >
为什么两个查询的进程大小相同以及如何优化它奇异行提取?
I'm trying to optimise my query for when an internal customer only want to return one result *(and it's associated nested dataset). My aim is to reduce the query process size.
However, it appears to be the exact same value regardless of whether I'm querying for 1 record (with unnested 48,000 length array) or the whole dataset (10,000 records with unnest total 514,048,748 in total length of arrays)!
So my table results for one record query:
SELECT test_id, value
FROM <my_table_reference>, unnest(TimeSeries)timeseries
WHERE test_id= "T0003" and SignalName = "Distance";
looks like this:
test_id | value |
---|---|
T0003 | 1.0 |
T0003 | 2.0 |
T0003 | 3.0 |
T0003 | 4.0 |
(48000 rows)
This will continue until value (Distance) = 48000m (48000 rows) for 1 record: WHERE == 'T0003
.
Total process was 3.84GB
For whole table (~10,000 records):
SELECT test_id, value
FROM <my_table_reference>, unnest(TimeSeries)timeseries
WHERE SignalName = "Distance";
looks like this:
test_id | value |
---|---|
T0001 | 1.0 |
T0001 | 2.0 |
T0001 | 3.0 |
T0001 | 4.0 |
(514,048,748 rows)
Total process was 3.84GB
Why are the process size the same for both queries and how can I optimise this for singular row extractions?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
发生这种情况是因为仍然需要进行全表扫描来查找与指定 ID 相等的所有测试 ID。
从您的示例中不清楚哪些列是
timeseries
记录的一部分。如果test_id
不是其中之一,我建议将表聚集在test_id
列上。通过聚类,数据会根据test_id
列的内容自动组织。因此,当您使用该列上的过滤器进行查询时,不需要完整扫描来查找所有值。
请此处详细了解集群表。
This is happening because there is still need for a full table scan to find all the test IDs that are equal to the specified one.
It is not clear from your example which columns are part of the
timeseries
record. In casetest_id
is not one of them, I would suggest to cluster the table on thetest_id
column. By clustering, the data will be automatically organized according to the contents of thetest_id
column.So, when you query with a filter on that column a full scan won't be needed to find all values.
Read more about clustered tables here.