取消嵌套与仅包含表中所需的每一行

发布于 2025-01-14 15:39:58 字数 897 浏览 3 评论 0原文

我可以选择如何创建数据表，并且想知道哪种方法性能更高。

为每个数据点制作一个包含一行的表格，
制作一个包含数组列的表格，该表格将允许重复的内容被取消嵌套

也就是说，如果我有数据：

day	val1	val2
Mon	7	11
Tue	7	11
Wed	8	9
Thu	1	4

是否最好如图所示输入数据，或者改为：

day	val1	val2
(Mon,Tue)	7	11
(Wed)	8	9
(Thu)	1	4

然后在需要时使用 unnest() 将它们分解为唯一的行？

假设我们讨论的是现实中的大数据 - 每天生成 10 万行数据 x 20 列。使用数组会大大减少表中的行数，但我担心取消嵌套的性能会低于仅拥有所有行的性能。

原文

I have a choice in how a data table is created and am wondering which approach is more performant.

Making a table with a row for every data point,
Making a table with an array column that will allow repeated content to be unnested

That is, if I have the data:

day	val1	val2
Mon	7	11
Tue	7	11
Wed	8	9
Thu	1	4

Is it better to enter the data in as shown, or instead:

day	val1	val2
(Mon,Tue)	7	11
(Wed)	8	9
(Thu)	1	4

And then use unnest() to explode those into unique rows when I need them?

Assume that we're talking about large data in reality - 100k rows of data generated every day x 20 columns. Using the array would greatly reduce the number of rows in the table but I'm concerned that unnest would be less performant than just having all of the rows.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

离鸿 2025-01-21 15:39:58

“每天生成 10 万行数据 x 20 列”
还有：
“数组会大大减少行数” - 所以有很多重复项。

基于此，我建议第三种选择：
创建一个包含 20 列数据的表，并向其中添加代理 bigint PK。要强制所有 20 列的唯一性，请添加生成的哈希并使其唯一。我建议为此目的使用一个自定义函数：

-- hash function
CREATE OR REPLACE FUNCTION public.f_uniq_hash20(col1 text, col2 text, ... , col20 text)
  RETURNS uuid
  LANGUAGE sql IMMUTABLE COST 30 PARALLEL SAFE AS 
'SELECT md5(textin(record_out(($1,$2, ... ,$20))))::uuid';

-- data table
CREATE TABLE data (
  data_id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY
, col1 text
, col2 text
, ...
, col20 text
, uniq_hash uuid GENERATED ALWAYS AS (public.f_uniq_hash20(col1, col2, ... , col20)) STORED
, CONSTRAINT data_uniq_hash_uni UNIQUE (uniq_hash)   
);

-- reference data_id in next table
CREATE TABLE day_data (
  day text 
, data_id bigint REFERENCES data ON UPDATE CASCADE -- FK to enforce referential integrity
, PRIMARY KEY (day, data_id)      -- must be unique?
);

db<>fiddle 此处

只有 text 列，该函数实际上是IMMUTABLE（我们需要它！）。对于其他数据类型（例如timestamptz）则不然。

这个密切相关的答案中的深入解释：

为什么我的 UNIQUE 约束没有触发？

你可以直接使用uniq_hash作为PK，但对于许多参考，bigint更有效（8字节与16字节）。

关于生成的列：

PostgreSQL 中的计算/计算/虚拟/派生列

插入新数据时避免重复的基本技术：

INSERT INTO data (col1, col2) VALUES
('foo', 'bam')
ON CONFLICT DO NOTHING
RETURNING *;

如果可以并发写入，请参阅：

如何在 PostgreSQL 中使用 RETURNING 和 ON CONFLICT ？

"100k rows of data generated every day x 20 columns"
And:
"the array would greatly reduce the number of rows" - so lots of duplicates.

Based on this I would suggest a third option:
Create a table with your 20 columns of data and add a surrogate bigint PK to it. To enforce uniqueness across all 20 columns, add a generated hash and make it UNIQUE. I suggest a custom function for the purpose:

-- hash function
CREATE OR REPLACE FUNCTION public.f_uniq_hash20(col1 text, col2 text, ... , col20 text)
  RETURNS uuid
  LANGUAGE sql IMMUTABLE COST 30 PARALLEL SAFE AS 
'SELECT md5(textin(record_out(($1,$2, ... ,$20))))::uuid';

-- data table
CREATE TABLE data (
  data_id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY
, col1 text
, col2 text
, ...
, col20 text
, uniq_hash uuid GENERATED ALWAYS AS (public.f_uniq_hash20(col1, col2, ... , col20)) STORED
, CONSTRAINT data_uniq_hash_uni UNIQUE (uniq_hash)   
);

-- reference data_id in next table
CREATE TABLE day_data (
  day text 
, data_id bigint REFERENCES data ON UPDATE CASCADE -- FK to enforce referential integrity
, PRIMARY KEY (day, data_id)      -- must be unique?
);

db<>fiddle here

With only text columns, the function is actually IMMUTABLE (which we need!). For other data types (like timestamptz) it would not be.

In-depth explanation in this closely related answer:

Why doesn't my UNIQUE constraint trigger?

You could use uniq_hash as PK directly, but for many references, a bigint is more efficient (8 vs. 16 bytes).

About generated columns:

Computed / calculated / virtual / derived columns in PostgreSQL

Basic technique to avoid duplicates while inserting new data:

INSERT INTO data (col1, col2) VALUES
('foo', 'bam')
ON CONFLICT DO NOTHING
RETURNING *;

If there can be concurrent writes, see:

How to use RETURNING with ON CONFLICT in PostgreSQL?

回复收藏 0 原文

疯了 2025-01-21 15:39:58

我相信为每个数据点制作一个包含一行的表格将是我的选择。由于大量数据的取消嵌套会变慢。加
除非你的数据非常重复，否则 20 列需要对齐很多。

回复收藏 0 原文

~没有更多了~

关于作者

暖树树初阳…

暂无简介

文章

25 人气

关注发私信

友情链接

文江博客

取消嵌套与仅包含表中所需的每一行

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

alipaysp_snBf0MSZIv

梦断已成空

瞎闹

凯凯我们等你回来

寄意

似梦非梦

友情链接

取消嵌套与仅包含表中所需的每一行

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

alipaysp_snBf0MSZIv

梦断已成空

瞎闹

凯凯我们等你回来

寄意

似梦非梦

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。