取消嵌套与仅包含表中所需的每一行

发布于 2025-01-14 15:39:58 字数 897 浏览 3 评论 0原文

我可以选择如何创建数据表,并且想知道哪种方法性能更高。

  1. 为每个数据点制作一个包含一行的表格,
  2. 制作一个包含数组列的表格,该表格将允许重复的内容被取消嵌套

也就是说,如果我有数据:

dayval1val2
Mon711
Tue711
Wed89
Thu14

是否最好如图所示输入数据,或者改为:

dayval1val2
(Mon,Tue)711
(Wed)89
(Thu)14

然后在需要时使用 unnest() 将它们分解为唯一的行?

假设我们讨论的是现实中的大数据 - 每天生成 10 万行数据 x 20 列。使用数组会大大减少表中的行数,但我担心取消嵌套的性能会低于仅拥有所有行的性能。

I have a choice in how a data table is created and am wondering which approach is more performant.

  1. Making a table with a row for every data point,
  2. Making a table with an array column that will allow repeated content to be unnested

That is, if I have the data:

dayval1val2
Mon711
Tue711
Wed89
Thu14

Is it better to enter the data in as shown, or instead:

dayval1val2
(Mon,Tue)711
(Wed)89
(Thu)14

And then use unnest() to explode those into unique rows when I need them?

Assume that we're talking about large data in reality - 100k rows of data generated every day x 20 columns. Using the array would greatly reduce the number of rows in the table but I'm concerned that unnest would be less performant than just having all of the rows.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

离鸿 2025-01-21 15:39:58

“每天生成 10 万行数据 x 20 列”
还有:
“数组会大大减少行数” - 所以有很多重复项。

基于此,我建议第三种选择
创建一个包含 20 列数据的表,并向其中添加代理 bigint PK。要强制所有 20 列的唯一性,请添加生成的哈希并使其唯一。我建议为此目的使用一个自定义函数:

-- hash function
CREATE OR REPLACE FUNCTION public.f_uniq_hash20(col1 text, col2 text, ... , col20 text)
  RETURNS uuid
  LANGUAGE sql IMMUTABLE COST 30 PARALLEL SAFE AS 
'SELECT md5(textin(record_out(($1,$2, ... ,$20))))::uuid';

-- data table
CREATE TABLE data (
  data_id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY
, col1 text
, col2 text
, ...
, col20 text
, uniq_hash uuid GENERATED ALWAYS AS (public.f_uniq_hash20(col1, col2, ... , col20)) STORED
, CONSTRAINT data_uniq_hash_uni UNIQUE (uniq_hash)   
);

-- reference data_id in next table
CREATE TABLE day_data (
  day text 
, data_id bigint REFERENCES data ON UPDATE CASCADE -- FK to enforce referential integrity
, PRIMARY KEY (day, data_id)      -- must be unique?
);

db<>fiddle 此处

只有 text 列,该函数实际上是IMMUTABLE(我们需要它!)。对于其他数据类型(例如timestamptz)则不然。

这个密切相关的答案中的深入解释:

可以直接使用uniq_hash作为PK,但对于许多参考,bigint更有效(8字节与16字节)。

关于生成的列:

插入新数据时避免重复的基本技术:

INSERT INTO data (col1, col2) VALUES
('foo', 'bam')
ON CONFLICT DO NOTHING
RETURNING *;

如果可以并发写入,请参阅:

"100k rows of data generated every day x 20 columns"
And:
"the array would greatly reduce the number of rows" - so lots of duplicates.

Based on this I would suggest a third option:
Create a table with your 20 columns of data and add a surrogate bigint PK to it. To enforce uniqueness across all 20 columns, add a generated hash and make it UNIQUE. I suggest a custom function for the purpose:

-- hash function
CREATE OR REPLACE FUNCTION public.f_uniq_hash20(col1 text, col2 text, ... , col20 text)
  RETURNS uuid
  LANGUAGE sql IMMUTABLE COST 30 PARALLEL SAFE AS 
'SELECT md5(textin(record_out(($1,$2, ... ,$20))))::uuid';

-- data table
CREATE TABLE data (
  data_id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY
, col1 text
, col2 text
, ...
, col20 text
, uniq_hash uuid GENERATED ALWAYS AS (public.f_uniq_hash20(col1, col2, ... , col20)) STORED
, CONSTRAINT data_uniq_hash_uni UNIQUE (uniq_hash)   
);

-- reference data_id in next table
CREATE TABLE day_data (
  day text 
, data_id bigint REFERENCES data ON UPDATE CASCADE -- FK to enforce referential integrity
, PRIMARY KEY (day, data_id)      -- must be unique?
);

db<>fiddle here

With only text columns, the function is actually IMMUTABLE (which we need!). For other data types (like timestamptz) it would not be.

In-depth explanation in this closely related answer:

You could use uniq_hash as PK directly, but for many references, a bigint is more efficient (8 vs. 16 bytes).

About generated columns:

Basic technique to avoid duplicates while inserting new data:

INSERT INTO data (col1, col2) VALUES
('foo', 'bam')
ON CONFLICT DO NOTHING
RETURNING *;

If there can be concurrent writes, see:

疯了 2025-01-21 15:39:58

我相信为每个数据点制作一个包含一行的表格将是我的选择。由于大量数据的取消嵌套会变慢。加
除非你的数据非常重复,否则 20 列需要对齐很多。

I believe making a table with a row for every data point would be the option I would go for. As unnest for large amounts of data would be just as if not slower. Plus
unless your data will be very repeated 20 columns is alot to align.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文