取消嵌套与仅包含表中所需的每一行
我可以选择如何创建数据表,并且想知道哪种方法性能更高。
- 为每个数据点制作一个包含一行的表格,
- 制作一个包含数组列的表格,该表格将允许重复的内容被取消嵌套
也就是说,如果我有数据:
day | val1 | val2 |
---|---|---|
Mon | 7 | 11 |
Tue | 7 | 11 |
Wed | 8 | 9 |
Thu | 1 | 4 |
是否最好如图所示输入数据,或者改为:
day | val1 | val2 |
---|---|---|
(Mon,Tue) | 7 | 11 |
(Wed) | 8 | 9 |
(Thu) | 1 | 4 |
然后在需要时使用 unnest()
将它们分解为唯一的行?
假设我们讨论的是现实中的大数据 - 每天生成 10 万行数据 x 20 列。使用数组会大大减少表中的行数,但我担心取消嵌套的性能会低于仅拥有所有行的性能。
I have a choice in how a data table is created and am wondering which approach is more performant.
- Making a table with a row for every data point,
- Making a table with an array column that will allow repeated content to be unnested
That is, if I have the data:
day | val1 | val2 |
---|---|---|
Mon | 7 | 11 |
Tue | 7 | 11 |
Wed | 8 | 9 |
Thu | 1 | 4 |
Is it better to enter the data in as shown, or instead:
day | val1 | val2 |
---|---|---|
(Mon,Tue) | 7 | 11 |
(Wed) | 8 | 9 |
(Thu) | 1 | 4 |
And then use unnest()
to explode those into unique rows when I need them?
Assume that we're talking about large data in reality - 100k rows of data generated every day x 20 columns. Using the array would greatly reduce the number of rows in the table but I'm concerned that unnest would be less performant than just having all of the rows.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
“每天生成 10 万行数据 x 20 列”
还有:
“数组会大大减少行数” - 所以有很多重复项。
基于此,我建议第三种选择:
创建一个包含 20 列数据的表,并向其中添加代理
bigint
PK。要强制所有 20 列的唯一性,请添加生成的哈希并使其唯一
。我建议为此目的使用一个自定义函数:db<>fiddle 此处
只有
text
列,该函数实际上是IMMUTABLE
(我们需要它!)。对于其他数据类型(例如timestamptz
)则不然。这个密切相关的答案中的深入解释:
你可以直接使用
uniq_hash
作为PK,但对于许多参考,bigint
更有效(8字节与16字节)。关于生成的列:
插入新数据时避免重复的基本技术:
如果可以并发写入,请参阅:
"100k rows of data generated every day x 20 columns"
And:
"the array would greatly reduce the number of rows" - so lots of duplicates.
Based on this I would suggest a third option:
Create a table with your 20 columns of data and add a surrogate
bigint
PK to it. To enforce uniqueness across all 20 columns, add a generated hash and make itUNIQUE
. I suggest a custom function for the purpose:db<>fiddle here
With only
text
columns, the function is actuallyIMMUTABLE
(which we need!). For other data types (liketimestamptz
) it would not be.In-depth explanation in this closely related answer:
You could use
uniq_hash
as PK directly, but for many references, abigint
is more efficient (8 vs. 16 bytes).About generated columns:
Basic technique to avoid duplicates while inserting new data:
If there can be concurrent writes, see:
我相信为每个数据点制作一个包含一行的表格将是我的选择。由于大量数据的取消嵌套会变慢。加
除非你的数据非常重复,否则 20 列需要对齐很多。
I believe making a table with a row for every data point would be the option I would go for. As unnest for large amounts of data would be just as if not slower. Plus
unless your data will be very repeated 20 columns is alot to align.