选择一个JSONB列的行,在一个JSONB属性中具有不同的值和另一个JSONB属性中的匹配值
我有一个表通知
,其中包含有效载荷
type JSONB
列,并在此列上使用杜松子酒索引。 TALBE当前包含2,742,691行
该表看起来像这样:
ID | 有效载荷 | created_at |
---|---|---|
1 | {“ customer”:{“ email”:“ [email&nbsp | ] |
pretanced | ; /l/电子邮件保护“ class =” __ cf_email__“ data-cfemail =” CCAAA3A38CA9B4ADA1BCA0A1BCA0A9E2AFA3AA1> [emage  procented] | |
“电子邮件“:” | 2022-06-20 | |
4 | {“客户”:{“ email”:“ “,“ externalId”:444} | 2022-04-14 |
5 | {“ customer”:{“ email”:“ [email  protected] ”,“ externalID”:555} | 2022-04-12 |
6 | { -cgi/l/电子邮件保护“ class =” __ cf_email__“ data-cfemail =” 385F5659785D4059554854545D165B575555555'> [emagy  nbsp; | procted |
] {“电子邮件”:“ | 2022-06-11 |
我试图查询匹配以下条件的电子邮件地址列表:
- 相同
电子邮件>电子邮件
地址存在的多个行中存在 - 其中一个行确实有不同的
externalID
比以前的一个 - create_at 在示例表内容的上个月之内
,这只能返回 [email  proceed]
因为
[email  preatected]
仅出现[email  prectioned]
没有上个月在上个月中创建的行[email  procented]
正在使用左JOIN横向
这样:
select
n.payload -> 'customer' -> 'email'
from
notifications n
left join lateral (
select
n2.payload -> 'customer' ->> 'externalId' tid
from
notifications n2
where
n2.payload @> jsonb_build_object(
'customer',
jsonb_build_object('email', n.payload -> 'customer' -> 'email')
)
and not (n2.payload @> jsonb_build_object(
'customer',
jsonb_build_object('externalId', n.payload -> 'customer' -> 'externalId')
))
and n2.created_at > NOW() - INTERVAL '1 month'
limit
1
) sub on true
where
n.created_at > NOW() - INTERVAL '1 month'
and sub.tid is not null;
但是,这需要年龄才能运行。此的查询计划看起来像 https://explain.depesz.com/s/mrib
QUERY PLAN
Nested Loop (cost=0.17..53386349.38 rows=259398 width=32)
-> Index Scan using index_notifications_created_at on notifications n (cost=0.09..51931.08 rows=259398 width=514)
Index Cond: (created_at > (now() - '1 mon'::interval))
-> Subquery Scan on sub (cost=0.09..205.60 rows=1 width=0)
Filter: (sub.tid IS NOT NULL)
-> Limit (cost=0.09..205.60 rows=1 width=32)
-> Index Scan using index_notifications_created_at on notifications n2 (cost=0.09..53228.33 rows=259 width=32)
Index Cond: (created_at > (now() - '1 mon'::interval))
Filter: ((payload @> jsonb_build_object('customer', jsonb_build_object('email', ((n.payload -> 'customer'::text) -> 'email'::text)))) AND (NOT (payload @> jsonb_build_object('customer', jsonb_build_object('externalId', ((n.payload -> 'customer'::text) -> 'externalId'::text))))))
JIT:
Functions: 13
Options: Inlining true, Optimization true, Expressions true, Deforming true
有任何指示我在这里做错了什么 /如何优化这个问题?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
PostgreSQL的计划者对JSON值的内部内容没有真正的见解,因此它不知道N2中有多少行与N的一行相同的电子邮件。为此,它甚至不知道这是被考虑的问题,因为它不了解 @>与jsonb_build_object的内部工作相互作用。因此,它只是对计划使用了一些非常通用的行估计,并且可能高估了行的数量。
您最好的选择可能是将电子邮件和外部ID从JSONB中拉出,并将其放入真实的列中。这将使既可以更容易制定更好的计划,又可以提供更好的信息,以便计划者可以选择这些更好的计划。
更好的计划是在
上使用复合索引(电子邮件,create_at)
,以便可以直接跳到索引的一部分,同时满足电子邮件相等条件和created_at不平等。如果您无法重构数据,那么您至少可以使用多列表达索引来获得很多好处,例如
通知(((有效载荷 - >'customers' - >>'''电子邮件'),create_at)
然后您需要重写第一个 @>查询中的条件看起来像:您还可以将外部体式表达式作为索引中的第三列中包含,在这种情况下是另一个 @>条件将需要以类似的方式重写。
构建表达索引后,您应该立即手动分析表,否则计划者将不具有表达式统计信息,它可能需要做出最佳选择。
PostgreSQL's planner has no real insight into the internals of the JSON values, so it doesn't know how many rows from n2 are expected to have the same email as some row from n does. For that matter, it doesn't even know that that is the question being considered, as it doesn't understand how @> interacts with the inner workings of jsonb_build_object. So it just uses some very generic row estimates for the planning, and probably overestimates the number of rows substantially.
Your best bet is probably to pull the email and externalId out of the JSONB and put them into real columns. This will make it easier both to make better plans available, and make better info available so that the planner can choose those better plans.
The better plan would be to use a composite index on
(email, created_at)
so it can jump directly to the part of the index that satisfies both the email equality condition and the created_at inequality simultaneously.If you can't refactor the data, then you could at least use an multi-column expressional index to get much of the benefit, for example
on notifications ((payload -> 'customer' ->> 'email'), created_at)
Then you would need to rewrite the first @> condition in your query to look like:You could also include the externalId expression as the 3rd column in the index, in which case the other @> condition would need to be rewritten in an analogous way.
After building an expressional index, you should immediately manually ANALYZE the table, otherwise the planner won't have the expression statistics it might need to make the best choice.
这是我的建议。它使用函数
array_unique
by @klin来自 所以发帖。Here is my suggestion. It uses function
array_unique
by @klin from this SO post.简单的嵌套查询将无需加入或功能即可完成工作:
这将在
create_at
上使用您的索引,因此应快速合理。Simple nested queries will do the trick without any need for joining or functions:
This will use your index on
created_at
and should thus be reasonable quick.您的主要错误是,当您加入
通知
时,您在表中每行创建4个JSON对象,该对象受条件create_at>现在() - 间隔'1个月'
。这种条件将行计数限制为259398,现在您的子查询需要创建 259398 * 4 = 1 037 592 jsons。最后,<代码>限制用于当处理中的所有行
中的所有行时仅获得1行。您应该重构查询。
您可以使用CTE获取
电子邮件
,DISTICTexternal_id
计数和最大create_at
peremage> email
like this,请检查a
”代码> created_at 为了正常工作
,请检查
”使用@jjanes,如果您提取
电子邮件
和externalId
有效载荷
的值。Your main mistake is that when you've joined
notifications
laterally you create 4 JSON objects per row in the table limited by conditioncreated_at > NOW() - INTERVAL '1 month'
. This condition limits row count to 259398, and now your subquery need to create 259398 * 4 = 1 037 592 JSONS. And finallyLIMIT
is used to get only 1 row when all rows inJOIN
are processed.You should refactor your query.
You can use CTE to obtain
email
, distinctexternal_id
count and maximumcreated_at
peremail
value like thisPlease, check a demo
And it can be made even simplier, like this to make index on
created_at
to workPlease, check this demo
Meanwhile I agree with @jjanes in that if you extract
email
andexternalId
values ofpayload
into separate columns and create indexes for them, than they can affect query performance.