从JSONB提取整数阵列11+
我正在设计一个具有JSONB
列的表,以以下格式实现权限:
[
{"role": 5, "perm": "view"},
{"role": 30, "perm": "edit"},
{"role": 52, "perm": "view"}
]
tl; dr
如何转换此类JSONB
在此示例中,值为整数角色的SQL数组?,它将是'{5,30,52}':: int []
。我有一些解决方案,但没有解决方案足够快。继续阅读...
每个登录的用户都有一些角色(一个或多个)。这个想法是在int []
上使用重叠运算符(&&
)过滤记录。
SELECT * FROM data WHERE extract_roles(access) && '{1,5,17}'::int[]
我正在寻找extract_roles
函数/表达式,也可以在索引的定义中使用:
CREATE INDEX data_roles ON data USING gin ((extract_roles(access)))
Postgres中的JSONB
似乎对建筑和转换有广泛的支持,但对提取值-SQL数组在这种情况下。
我尝试的是:
create or replace function extract_roles(access jsonb) returns int[]
language sql
strict
parallel safe
immutable
-- with the following bodies:
-- (0) 629ms
select translate(jsonb_path_query_array(access, '$.role')::text, '[]', '{}')::int[]
-- (1) 890ms
select array_agg(r::int) from jsonb_path_query(access, '$.role') r
-- (2) 866ms
select array_agg((t ->> 'role')::int) from jsonb_array_elements(access) as x(t)
-- (3) 706ms
select f1 from jsonb_populate_record(row('{}'::int[]), jsonb_build_object('f1', jsonb_path_query_array(access, '$.role'))) as x (f1 int[])
当使用索引时,查询很快。但是这些表达式有两个问题:
- 某些功能仅是
稳定
而不是不可变的
;这也适用于cast
。 我是否允许将我的功能标记为不变
?索引定义要求不变性。 - 他们很慢;计划者在某些情况下不使用索引,然后查询可能会变得非常慢(上面的时间在具有3M记录的表格上):
explain (analyse)
select id, access
from data
where extract_roles(access) && '{-3,99}'::int[]
order by id
limit 100
具有以下计划(对于上面的所有变体相同;更喜欢扫描与主相关的索引钥匙,得到分类的结果并希望它能尽快找到其中的100个):
Limit (cost=1000.45..2624.21 rows=100 width=247) (actual time=40.668..629.193 rows=100 loops=1)
-> Gather Merge (cost=1000.45..476565.03 rows=29288 width=247) (actual time=40.667..629.162 rows=100 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Index Scan using data_pkey on data (cost=0.43..472184.44 rows=12203 width=247) (actual time=25.522..513.463 rows=35 loops=3)
Filter: (extract_roles(access) && '{-3,99}'::integer[])
Rows Removed by Filter: 84918
Planning Time: 0.182 ms
Execution Time: 629.245 ms
删除limit
子句在矛盾上很快:
Gather Merge (cost=70570.65..73480.29 rows=24938 width=247) (actual time=63.263..75.710 rows=40094 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=69570.63..69601.80 rows=12469 width=247) (actual time=59.870..61.569 rows=13365 loops=3)
Sort Key: id
Sort Method: external merge Disk: 3744kB
Worker 0: Sort Method: external merge Disk: 3232kB
Worker 1: Sort Method: external merge Disk: 3160kB
-> Parallel Bitmap Heap Scan on data (cost=299.93..68722.36 rows=12469 width=247) (actual time=13.823..49.336 rows=13365 loops=3)
Recheck Cond: (extract_roles(access) && '{-3,99}'::integer[])
Heap Blocks: exact=9033
-> Bitmap Index Scan on data_roles (cost=0.00..292.44 rows=29926 width=0) (actual time=9.429..9.430 rows=40094 loops=1)
Index Cond: (extract_roles(access) && '{-3,99}'::integer[])
Planning Time: 0.234 ms
Execution Time: 77.719 ms
有什么更好,更快的方法来提取int []
int [] /code>来自
JSONB
?,因为我不能依靠计划者始终使用索引。播放
的
extract_roles
函数的功能有助于一点(计划者开始使用limit限制1000
的索引限制100
的索引。
评论:
如果没有,我可能会将信息存储在另一列角色int []
中,它很快,但需要额外的空间并需要额外的处理(可以使用Postgres 12+上的生成的列来解决,Azure仍然没有提供,或触发器,或在应用程序逻辑中)。
展望未来,Postgres 15中会有更好的支持吗?也许改进是因为其返回
条款可能是指整个结果,而不是其元素。
也许JSONB_POPULATE_RECORD
也可以考虑非复合类型(其签名允许),例如:
select jsonb_populate_record(null::int[], '[123,456]'::jsonb)
两个最接近的问题是:
对建议的归一化的反应:
标准化可能是不可行的。但是,让我们遵循思想列车。
我假设额外的表看起来像这样:*_ perm(id,crom,perm)
。 id
上将有一个索引,而角色
上的另一个索引。
由于用户具有多个角色,因此它可以加入相同ID的多个记录,这将导致数据表中的记录乘法,并通过集合强制A 组。
组的组对性能不利,因为它可以防止一些优化。我正在设计一个构建块。因此,可以有两个数据表在播放中:
select pd.*, jsonb_agg(to_jsonb(pp))
from posts_data pd
join posts_perm pp on pd.id = pp.id
where exists(
select 1
from comments_data cd on cd.post_id = pd.id
join comments_perm cp on cp.id = cd.id
where cd.reputation > 100
and cp.role in (3,34,52)
-- no group by needed due to semi-join
)
and cp.role in (3,34,52)
group by pd.id
order by pd.title
limit 10
如果我没记错的话,此查询将需要在所有记录进行分类之前汇总它们。没有索引可以在这里提供帮助。这将永远不会有数百万张记录。此外,使用使用 - 并非总是需要的组背后的非平凡逻辑。
如果我们不需要返回权限,而只关心它的存在该怎么办?
select pd.*
from posts_data pd
where exists(
select 1
from posts_perm pp on pd.id = pp.id
where cp.role in (3,34,52)
)
and exists(
select 1
from comments_data cd on cd.post_id = pd.id
where exists(
select 1
from comments_perm cp on cp.id = cd.id
where cp.role in (3,34,52)
)
and cd.reputation > 100
)
order by pd.title
limit 10
那么我们不需要任何聚合 - 数据库将仅发布半加入。如果在标题
上有索引,则数据库可以考虑使用它。我们甚至可以在投影中获取权限。类似的内容:
select pd.*, (select jsonb_agg(to_jsonb(pp)) from posts_perm pp on pd.id = pp.id) perm
...
只有少数(10)个记录将发出嵌套环连接的地方。我将测试这种方法。
另一个选项是将数据保存在两个表中 - 数据表只能存储int []
角色。然后,我们在末尾保存一个加入,只从权限表中获取。现在,我们需要一个支持数组操作的索引-Gin。
select pd.*, (select jsonb_agg(to_jsonb(pp)) from posts_perm pp on pd.id = pp.id) perm
from posts_data pd
where pd.roles && '{3,34,52}'::int[]
and exists(
select 1
from comments_data cd on cd.post_id = pd.id
where cd.roles && '{3,34,52}'::int[]
and cd.reputation > 100
)
order by pd.title
limit 10
因为我们始终汇总返回记录的所有权限(它们的解释在应用程序中,并且我们返回全部都不重要),所以我们可以将post_perms存储为json
。因为我们不需要与SQL中的值一起工作,所以将它们直接存储在数据表中似乎是合理的。
我们将需要支持一些批量共享操作,以更新许多记录的权限,但这比选择要稀有得多。因此,我们可以改用JSONB
。
投影不再需要选择权限的选择:
select pd.*
...
但是现在角色
列是多余的 - 我们在同一表中只有JSON格式中的同一表中有相同的信息。如果我们可以编写仅提取角色的函数,则可以直接索引它。
我们回到一开始。但是看起来extract_roles
函数永远不会快,因此我们需要保留角色
列。
将权限保留在同一表中的另一个原因是,可以使用位图组合多个索引并避免加入。
角色会有巨大的偏见。一些几乎所有行都会出现(管理员可以编辑所有内容),而另一些则很少见(John Doe只能访问与他明确共享的这3个记录)。我不确定统计信息在int []
方法上的效果如何,但是到目前为止,我的测试表明,当角色不经常(高选择性)时使用了杜松子酒索引。
I am designing a table that has a jsonb
column realizing permissions with the following format:
[
{"role": 5, "perm": "view"},
{"role": 30, "perm": "edit"},
{"role": 52, "perm": "view"}
]
TL;DR
How do I convert such jsonb
value into an SQL array of integer roles? In this example, it would be '{5,30,52}'::int[]
. I have some solutions but none are fast enough. Keep reading...
Each logged-in user has some roles (one or more). The idea is to filter the records using the overlap operator (&&
) on int[]
.
SELECT * FROM data WHERE extract_roles(access) && '{1,5,17}'::int[]
I am looking for the extract_roles
function/expression that can also be used in the definition of an index:
CREATE INDEX data_roles ON data USING gin ((extract_roles(access)))
jsonb
in Postgres seems to have broad support for building and transforming but less for extracting values - SQL arrays in this case.
What I tried:
create or replace function extract_roles(access jsonb) returns int[]
language sql
strict
parallel safe
immutable
-- with the following bodies:
-- (0) 629ms
select translate(jsonb_path_query_array(access, '$.role')::text, '[]', '{}')::int[]
-- (1) 890ms
select array_agg(r::int) from jsonb_path_query(access, '$.role') r
-- (2) 866ms
select array_agg((t ->> 'role')::int) from jsonb_array_elements(access) as x(t)
-- (3) 706ms
select f1 from jsonb_populate_record(row('{}'::int[]), jsonb_build_object('f1', jsonb_path_query_array(access, '$.role'))) as x (f1 int[])
When the index is used, the query is fast. But there are two problems with these expressions:
- some of the functions are only
stable
and notimmutable
; this also applies tocast
. Am I allowed to mark my function asimmutable
? The immutability is required by the index definition. - they are slow; the planner does not use the index in some scenarios, and then the query can become really slow (times above are on a table with 3M records):
explain (analyse)
select id, access
from data
where extract_roles(access) && '{-3,99}'::int[]
order by id
limit 100
with the following plan (same for all variants above; prefers scanning the index associated with the primary key, gets sorted results and hopes that it finds 100 of them soon):
Limit (cost=1000.45..2624.21 rows=100 width=247) (actual time=40.668..629.193 rows=100 loops=1)
-> Gather Merge (cost=1000.45..476565.03 rows=29288 width=247) (actual time=40.667..629.162 rows=100 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Index Scan using data_pkey on data (cost=0.43..472184.44 rows=12203 width=247) (actual time=25.522..513.463 rows=35 loops=3)
Filter: (extract_roles(access) && '{-3,99}'::integer[])
Rows Removed by Filter: 84918
Planning Time: 0.182 ms
Execution Time: 629.245 ms
Removing the LIMIT
clause is paradoxically fast:
Gather Merge (cost=70570.65..73480.29 rows=24938 width=247) (actual time=63.263..75.710 rows=40094 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=69570.63..69601.80 rows=12469 width=247) (actual time=59.870..61.569 rows=13365 loops=3)
Sort Key: id
Sort Method: external merge Disk: 3744kB
Worker 0: Sort Method: external merge Disk: 3232kB
Worker 1: Sort Method: external merge Disk: 3160kB
-> Parallel Bitmap Heap Scan on data (cost=299.93..68722.36 rows=12469 width=247) (actual time=13.823..49.336 rows=13365 loops=3)
Recheck Cond: (extract_roles(access) && '{-3,99}'::integer[])
Heap Blocks: exact=9033
-> Bitmap Index Scan on data_roles (cost=0.00..292.44 rows=29926 width=0) (actual time=9.429..9.430 rows=40094 loops=1)
Index Cond: (extract_roles(access) && '{-3,99}'::integer[])
Planning Time: 0.234 ms
Execution Time: 77.719 ms
Is there any better and faster way to extract int[]
from a jsonb
? Because I cannot rely on the planner always using the index. Playing with COST
of the extract_roles
function helps a bit (planner starts using the index for LIMIT 1000
) but even an insanely high value does not force the index for LIMIT 100
.
Comments:
If there is not, I will probably store the information in another column roles int[]
, which is fast but takes extra space and requires extra treatment (can be solved using generated columns on Postgres 12+, which Azure still does not provide, or a trigger, or in the application logic).
Looking into the future, will there be any better support in Postgres 15? Maybe JSON_QUERY but I don’t see any immediate improvement because its RETURNING
clause probably refers to the whole result and not its elements.
Maybe jsonb_populate_record
could also consider non-composite types (its signature allows it) such as:
select jsonb_populate_record(null::int[], '[123,456]'::jsonb)
The two closest questions are:
- Extract integer array from jsonb within postgres 9.6
- Cast postgresql jsonb value as array of int and remove element from it
Reaction to suggested normalization:
Normalization is probably not viable. But let's follow the train of thoughts.
I assume that the extra table would look like this: *_perm (id, role, perm)
. There would be an index on id
and another index on role
.
Because a user has multiple roles, it could join multiple records for the same id, which would cause multiplication of the records in the data table and force a group by
aggregation.
A group by
is bad for performance because it prevents some optimizations. I am designing a building block. So there can be for example two data tables at play:
select pd.*, jsonb_agg(to_jsonb(pp))
from posts_data pd
join posts_perm pp on pd.id = pp.id
where exists(
select 1
from comments_data cd on cd.post_id = pd.id
join comments_perm cp on cp.id = cd.id
where cd.reputation > 100
and cp.role in (3,34,52)
-- no group by needed due to semi-join
)
and cp.role in (3,34,52)
group by pd.id
order by pd.title
limit 10
If I am not mistaken, this query will require the aggregation of all records before they are sorted. No index can help here. That will never be fast with millions of records. Moreover, there is non-trivial logic behind group by
usage - it is not always needed.
What if we did not need to return the permissions but only cared about its existence?
select pd.*
from posts_data pd
where exists(
select 1
from posts_perm pp on pd.id = pp.id
where cp.role in (3,34,52)
)
and exists(
select 1
from comments_data cd on cd.post_id = pd.id
where exists(
select 1
from comments_perm cp on cp.id = cd.id
where cp.role in (3,34,52)
)
and cd.reputation > 100
)
order by pd.title
limit 10
Then we don't need any aggregation - the database will simply issue a SEMI-JOIN. If there is an index on title
, the database may consider using it. We can even fetch the permissions in the projection; something like this:
select pd.*, (select jsonb_agg(to_jsonb(pp)) from posts_perm pp on pd.id = pp.id) perm
...
Where a nested-loop join will be issued for only the few (10) records. I will test this approach.
Another option is to keep the data in both tables - the data table would only store an int[]
of roles. Then we save a JOIN and only fetch from the permission table at the end. Now we need an index that supports array operations - GIN.
select pd.*, (select jsonb_agg(to_jsonb(pp)) from posts_perm pp on pd.id = pp.id) perm
from posts_data pd
where pd.roles && '{3,34,52}'::int[]
and exists(
select 1
from comments_data cd on cd.post_id = pd.id
where cd.roles && '{3,34,52}'::int[]
and cd.reputation > 100
)
order by pd.title
limit 10
Because we always aggregate all permissions for the returned records (their interpretation is in the application and does not matter that we return all of them), we can store the post_perms as a json
. Because we never need to work with the values in SQL, storing them directly in the data table seems reasonable.
We will need to support some bulk-sharing operations later that update the permissions for many records, but that is much rarer than selects. Because of this we could favor jsonb
instead.
The projection does not need the select of permissions anymore:
select pd.*
...
But now the roles
column is redundant - we have the same information in the same table, just in JSON format. If we can write a function that extracts just the roles, we can directly index it.
And we are back at the beginning. But it looks like the extract_roles
function is never going to be fast, so we need to keep roles
column.
Another reason for keeping permissions in the same table is the possibility of combining multiple indices using Bitmap And and avoiding a join.
There will be a huge bias in the roles. Some are going to be present on almost all rows (admin can edit everything), others will be rare (John Doe can only access these 3 records that were explicitly shared with him). I am not sure how well statistics will work on the int[]
approach but so far my tests show that the GIN index is used when the role is infrequent (high selectivity).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
看来这里的核心问题是
的经典问题,其中...按...限制
,计划者假设所有合格行在整个订购过程中均匀散布。但情况并非如此:符合您的&&
条件的行选择性缺陷在低数字的“ ID”中。因此,它必须行走的索引远远超出了它在抓住极限之前所需的要远。您无能为力(在任何版本中)可以让计划者估算得更好。您可以通过将其重写为iD+0 将其重写为
订单来防止使用该索引。但是,即使它真正更快,也不会使用该计划,就像所有内容的管理员一样。 (顺便说一句,这似乎是一个坏主意 - 一个杰出的用户可能应该被异常处理,而不是将其插入普通系统中)。
不变的提取功能当然很慢,但是如果解决上述计划问题,那就没关系了。更快地使功能可能需要一些编译的代码,而Azure肯定不会让您将.SO文件链接到其托管服务器中。
It looks like the core problem here is the classic one with
WHERE...ORDER BY...LIMIT
, that the planner assumes all of the qualifying rows are scattered evenly throughout the ordering. But that isn't the case here: rows meeting your&&
condition are selectively deficient in low-numbered "id". So it has to walk that index far farther than it thought it would need to before it catches the LIMIT.There is nothing you can do (in any version) to get the planner to estimate that better. You could just prevent that index from being used by rewriting it to
order by id+0
. But then it wouldn't use that plan even when it would truly be faster, like the admin who is on everything. (Which by the way seems like a bad idea--an exceptional user should probably be handled exceptionally, not shoehorned into the normal system).The immutable extraction function certainly is slow, but if the above planning problem were fixed that wouldn't matter. Making the function faster would probably require some compiled code, and Azure surely would not let you link the .so file into their managed server.
由于JSON具有常规结构(INT,文本),因此我还考虑了两个替代储藏:
的角色>(int,text)
并存储阵列<代码>角色[] ;Extract_roles
仍然需要功能;int []
和text []
。后一个因以下原因获胜:
extract_roles
函数 - 直接存储int数组;缺点是:
Because the JSON has a regular structure (int, text), I also considered two alternative storages:
role
of(int, text)
and store the arrayrole[]
;extract_roles
function is still needed;int[]
andtext[]
.The latter one won for the following reasons:
extract_roles
function - the int array is stored directly;Disadvantages are: