选择一个JSONB列的行,在一个JSONB属性中具有不同的值和另一个JSONB属性中的匹配值

发布于 2025-02-09 10:30:42 字数 4575 浏览 2 评论 0 原文

我有一个表通知,其中包含有效载荷 type JSONB 列,并在此列上使用杜松子酒索引。 TALBE当前包含2,742,691行

该表看起来像这样:

ID 有效载荷 created_at
1 {“ customer”:{“ email”:“ [email&nbsp ]
pretanced ; /l/电子邮件保护“ class =” __ cf_email__“ data-cfemail =” CCAAA3A38CA9B4ADA1BCA0A1BCA0A9E2AFA3AA1> [emage  procented]
“电子邮件“:” 2022-06-20
4 {“客户”:{“ email”:“ “,“ externalId”:444} 2022-04-14
5 {“ customer”:{“ email”:“ [email  protected] ”,“ externalID”:555} 2022-04-12
6 { -cgi/l/电子邮件保护“ class =” __ cf_email__“ data-cfemail =” 385F5659785D4059554854545D165B575555555'> [emagy  nbsp; procted
] {“电子邮件”:“ 2022-06-11

我试图查询匹配以下条件的电子邮件地址列表:

  • 相同电子邮件>电子邮件地址存在的多个行中存在
  • 其中一个行确实有不同的 externalID 比以前的一个
  • create_at 在示例表内容的上个月之内

,这只能返回 [email  proceed] 因为

正在使用左JOIN横向这样:

select
  n.payload -> 'customer' -> 'email'
from
  notifications n
  left join lateral (
    select
      n2.payload -> 'customer' ->> 'externalId' tid
    from
      notifications n2
    where
      n2.payload @> jsonb_build_object(
        'customer',
        jsonb_build_object('email', n.payload -> 'customer' -> 'email')
      )
      and not (n2.payload @> jsonb_build_object(
        'customer',
        jsonb_build_object('externalId', n.payload -> 'customer' -> 'externalId')
      ))
      and n2.created_at > NOW() - INTERVAL '1 month' 
    limit
      1
  ) sub on true
where
  n.created_at > NOW() - INTERVAL '1 month'
  and sub.tid is not null;

但是,这需要年龄才能运行。此的查询计划看起来像 https://explain.depesz.com/s/mrib

QUERY PLAN
Nested Loop  (cost=0.17..53386349.38 rows=259398 width=32)
  ->  Index Scan using index_notifications_created_at on notifications n  (cost=0.09..51931.08 rows=259398 width=514)
        Index Cond: (created_at > (now() - '1 mon'::interval))
  ->  Subquery Scan on sub  (cost=0.09..205.60 rows=1 width=0)
        Filter: (sub.tid IS NOT NULL)
        ->  Limit  (cost=0.09..205.60 rows=1 width=32)
              ->  Index Scan using index_notifications_created_at on notifications n2  (cost=0.09..53228.33 rows=259 width=32)
                    Index Cond: (created_at > (now() - '1 mon'::interval))
                    Filter: ((payload @> jsonb_build_object('customer', jsonb_build_object('email', ((n.payload -> 'customer'::text) -> 'email'::text)))) AND (NOT (payload @> jsonb_build_object('customer', jsonb_build_object('externalId', ((n.payload -> 'customer'::text) -> 'externalId'::text))))))
JIT:
  Functions: 13
  Options: Inlining true, Optimization true, Expressions true, Deforming true

有任何指示我在这里做错了什么 /如何优化这个问题?

I have a table notifications which contain a payload column of type jsonb with a gin index on this column. The talbe currently contains 2,742,691 rows

The table looks something like this:

id payload created_at
1 {"customer": {"email": "[email protected]", "externalId": 111 } 2022-06-21
2 {"customer": {"email": "[email protected]", "externalId": 222 } 2022-06-20
3 {"customer": {"email": "[email protected]", "externalId": 333 } 2022-06-20
4 {"customer": {"email": "[email protected]", "externalId": 444 } 2022-04-14
5 {"customer": {"email": "[email protected]", "externalId": 555 } 2022-04-12
6 {"customer": {"email": "[email protected]", "externalId": 666 } 2022-06-10
7 {"customer": {"email": "[email protected]", "externalId": 666 } 2022-06-11

I am trying to query a list of email addresses that match the following condition:

  • multiple rows for the same email address exist
  • one of those rows does have a different externalId than one of the previous ones
  • created_at is within the last month

For the example table contents, this should only return [email protected] because

What I was trying is using a LEFT JOIN LATERAL like this:

select
  n.payload -> 'customer' -> 'email'
from
  notifications n
  left join lateral (
    select
      n2.payload -> 'customer' ->> 'externalId' tid
    from
      notifications n2
    where
      n2.payload @> jsonb_build_object(
        'customer',
        jsonb_build_object('email', n.payload -> 'customer' -> 'email')
      )
      and not (n2.payload @> jsonb_build_object(
        'customer',
        jsonb_build_object('externalId', n.payload -> 'customer' -> 'externalId')
      ))
      and n2.created_at > NOW() - INTERVAL '1 month' 
    limit
      1
  ) sub on true
where
  n.created_at > NOW() - INTERVAL '1 month'
  and sub.tid is not null;

however, this is taking ages to run. The Query plan for this looks like https://explain.depesz.com/s/mriB

QUERY PLAN
Nested Loop  (cost=0.17..53386349.38 rows=259398 width=32)
  ->  Index Scan using index_notifications_created_at on notifications n  (cost=0.09..51931.08 rows=259398 width=514)
        Index Cond: (created_at > (now() - '1 mon'::interval))
  ->  Subquery Scan on sub  (cost=0.09..205.60 rows=1 width=0)
        Filter: (sub.tid IS NOT NULL)
        ->  Limit  (cost=0.09..205.60 rows=1 width=32)
              ->  Index Scan using index_notifications_created_at on notifications n2  (cost=0.09..53228.33 rows=259 width=32)
                    Index Cond: (created_at > (now() - '1 mon'::interval))
                    Filter: ((payload @> jsonb_build_object('customer', jsonb_build_object('email', ((n.payload -> 'customer'::text) -> 'email'::text)))) AND (NOT (payload @> jsonb_build_object('customer', jsonb_build_object('externalId', ((n.payload -> 'customer'::text) -> 'externalId'::text))))))
JIT:
  Functions: 13
  Options: Inlining true, Optimization true, Expressions true, Deforming true

Any pointers what I'm doing wrong here / how to optimize this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

森林迷了鹿 2025-02-16 10:30:43

PostgreSQL的计划者对JSON值的内部内容没有真正的见解,因此它不知道N2中有多少行与N的一行相同的电子邮件。为此,它甚至不知道这是被考虑的问题,因为它不了解 @>与jsonb_build_object的内部工作相互作用。因此,它只是对计划使用了一些非常通用的行估计,并且可能高估了行的数量。

您最好的选择可能是将电子邮件和外部ID从JSONB中拉出,并将其放入真实的列中。这将使既可以更容易制定更好的计划,又可以提供更好的信息,以便计划者可以选择这些更好的计划。

更好的计划是在上使用复合索引(电子邮件,create_at),以便可以直接跳到索引的一部分,同时满足电子邮件相等条件和created_at不平等。

如果您无法重构数据,那么您至少可以使用多列表达索引来获得很多好处,例如通知(((有效载荷 - >'customers' - >>'''电子邮件'),create_at)然后您需要重写第一个 @>查询中的条件看起来像:

n2.payload->'customer'->>'email' = n.payload -> 'customer' ->> 'email'

您还可以将外部体式表达式作为索引中的第三列中包含,在这种情况下是另一个 @>条件将需要以类似的方式重写。

构建表达索引后,您应该立即手动分析表,否则计划者将不具有表达式统计信息,它可能需要做出最佳选择。

PostgreSQL's planner has no real insight into the internals of the JSON values, so it doesn't know how many rows from n2 are expected to have the same email as some row from n does. For that matter, it doesn't even know that that is the question being considered, as it doesn't understand how @> interacts with the inner workings of jsonb_build_object. So it just uses some very generic row estimates for the planning, and probably overestimates the number of rows substantially.

Your best bet is probably to pull the email and externalId out of the JSONB and put them into real columns. This will make it easier both to make better plans available, and make better info available so that the planner can choose those better plans.

The better plan would be to use a composite index on (email, created_at) so it can jump directly to the part of the index that satisfies both the email equality condition and the created_at inequality simultaneously.

If you can't refactor the data, then you could at least use an multi-column expressional index to get much of the benefit, for example on notifications ((payload -> 'customer' ->> 'email'), created_at) Then you would need to rewrite the first @> condition in your query to look like:

n2.payload->'customer'->>'email' = n.payload -> 'customer' ->> 'email'

You could also include the externalId expression as the 3rd column in the index, in which case the other @> condition would need to be rewritten in an analogous way.

After building an expressional index, you should immediately manually ANALYZE the table, otherwise the planner won't have the expression statistics it might need to make the best choice.

风渺 2025-02-16 10:30:43

这是我的建议。它使用函数 array_unique by @klin来自 所以发帖。

select email from
(
  select 
         payload -> 'customer' ->> 'email' email, 
         array_agg(payload -> 'customer' ->> 'externalId') externalid_arr
  from notifications 
  where created_at >= current_date - interval 'P1M'
  -- created_at is within the last month
  group by email
  having count(*) > 1 
  -- multiple rows for the same email address exist
) t 
where array_length(array_unique(externalid_arr), 1) > 1; 
-- one of those rows does have a different externalId than ...

Here is my suggestion. It uses function array_unique by @klin from this SO post.

select email from
(
  select 
         payload -> 'customer' ->> 'email' email, 
         array_agg(payload -> 'customer' ->> 'externalId') externalid_arr
  from notifications 
  where created_at >= current_date - interval 'P1M'
  -- created_at is within the last month
  group by email
  having count(*) > 1 
  -- multiple rows for the same email address exist
) t 
where array_length(array_unique(externalid_arr), 1) > 1; 
-- one of those rows does have a different externalId than ...
め七分饶幸 2025-02-16 10:30:43

简单的嵌套查询将无需加入或功能即可完成工作:

SELECT email
FROM (
    SELECT email, exid
    FROM (
        SELECT payload -> 'customer' -> 'email' AS email,
               payload -> 'customer' -> 'externalId' AS exid
        FROM notifications
        WHERE created_at > CURRENT_DATE - INTERVAL '1 month' ) recent 
    GROUP BY 1, 2 ) emails
GROUP BY 1
HAVING count(*) > 1

这将在 create_at 上使用您的索引,因此应快速合理。

Simple nested queries will do the trick without any need for joining or functions:

SELECT email
FROM (
    SELECT email, exid
    FROM (
        SELECT payload -> 'customer' -> 'email' AS email,
               payload -> 'customer' -> 'externalId' AS exid
        FROM notifications
        WHERE created_at > CURRENT_DATE - INTERVAL '1 month' ) recent 
    GROUP BY 1, 2 ) emails
GROUP BY 1
HAVING count(*) > 1

This will use your index on created_at and should thus be reasonable quick.

素年丶 2025-02-16 10:30:43

您的主要错误是,当您加入通知时,您在表中每行创建4个JSON对象,该对象受条件 create_at&gt;现在() - 间隔'1个月'。这种条件将行计数限制为259398,现在您的子查询需要创建 259398 * 4 = 1 037 592 jsons。最后,<代码>限制用于当处理中的所有行中的所有行时仅获得1行。

您应该重构查询。

您可以使用CTE获取电子邮件,DISTICT external_id 计数和最大 create_at per emage> email like this

WITH cte(email, ext_id_count, max_created_at) AS (
    SELECT
        DISTINCT payload -> 'customer' -> 'email', 
        COUNT(DISTINCT payload -> 'customer' -> 'externalId'), 
        MAX(created_at)
    FROM notifications
    GROUP BY payload -> 'customer' -> 'email'
)
SELECT email FROM cte 
WHERE ext_id_count > 1 AND max_created_at > CURRENT_DATE - INTERVAL '1 month'

,请检查a

”代码> created_at 为了正常工作

SELECT
    DISTINCT payload -> 'customer' -> 'email'
FROM notifications
WHERE EXISTS (
    SELECT 1 FROM notifications n
    WHERE n.created_at > current_date - interval '1 month' 
        AND n.payload->'customer'->>'email' = notifications.payload->'customer'->>'email'
)
AND EXISTS (
    SELECT 1 FROM notifications n
    WHERE n.payload->'customer'->>'email' = notifications.payload->'customer'->>'email'
    AND n.payload->'customer'->>'externalId' != notifications.payload->'customer'->>'externalId'
)

,请检查

”使用@jjanes,如果您提取电子邮件 externalId 有效载荷的值。

Your main mistake is that when you've joined notifications laterally you create 4 JSON objects per row in the table limited by condition created_at > NOW() - INTERVAL '1 month'. This condition limits row count to 259398, and now your subquery need to create 259398 * 4 = 1 037 592 JSONS. And finally LIMIT is used to get only 1 row when all rows in JOIN are processed.

You should refactor your query.

You can use CTE to obtain email, distinct external_id count and maximum created_at per email value like this

WITH cte(email, ext_id_count, max_created_at) AS (
    SELECT
        DISTINCT payload -> 'customer' -> 'email', 
        COUNT(DISTINCT payload -> 'customer' -> 'externalId'), 
        MAX(created_at)
    FROM notifications
    GROUP BY payload -> 'customer' -> 'email'
)
SELECT email FROM cte 
WHERE ext_id_count > 1 AND max_created_at > CURRENT_DATE - INTERVAL '1 month'

Please, check a demo

And it can be made even simplier, like this to make index on created_at to work

SELECT
    DISTINCT payload -> 'customer' -> 'email'
FROM notifications
WHERE EXISTS (
    SELECT 1 FROM notifications n
    WHERE n.created_at > current_date - interval '1 month' 
        AND n.payload->'customer'->>'email' = notifications.payload->'customer'->>'email'
)
AND EXISTS (
    SELECT 1 FROM notifications n
    WHERE n.payload->'customer'->>'email' = notifications.payload->'customer'->>'email'
    AND n.payload->'customer'->>'externalId' != notifications.payload->'customer'->>'externalId'
)

Please, check this demo

Meanwhile I agree with @jjanes in that if you extract email and externalId values of payload into separate columns and create indexes for them, than they can affect query performance.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文