红移枢轴函数
我有一个类似的表,我正在尝试在 Redshift 中进行旋转:
UUID | Key | Value |
---|---|---|
a123 | Key1 | Val1 |
b123 | Key2 | Val2 |
c123 | Key3 | Val3 |
目前我正在使用以下代码来旋转它,它工作正常。但是,当我用子查询替换 IN 部分时,它会引发错误。
select *
from (select UUID ,"Key", value from tbl) PIVOT (max(value) for "key" in (
'Key1',
'Key2',
'Key3
))
问题:将 IN 部分替换为从 Key
列中获取不同值的子查询的最佳方法是什么?
我想要实现的目标;
select *
from (select UUID ,"Key", value from tbl) PIVOT (max(value) for "key" in (
select distinct "keys" from tbl
))
I've got a similar table which I'm trying to pivot in Redshift:
UUID | Key | Value |
---|---|---|
a123 | Key1 | Val1 |
b123 | Key2 | Val2 |
c123 | Key3 | Val3 |
Currently I'm using following code to pivot it and it works fine. However, when I replace the IN part with subquery it throws an error.
select *
from (select UUID ,"Key", value from tbl) PIVOT (max(value) for "key" in (
'Key1',
'Key2',
'Key3
))
Question: What's the best way to replace the IN part with sub query which takes distinct values from Key
column?
What I am trying to achieve;
select *
from (select UUID ,"Key", value from tbl) PIVOT (max(value) for "key" in (
select distinct "keys" from tbl
))
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
来自 Redshift 文档 - “PIVOT IN 列表值不能是列引用或子查询。每个值必须与 FOR 列引用类型兼容。”请参阅:https://docs.aws。 amazon.com/redshift/latest/dg/r_FROM_clause-pivot-unpivot-examples.html
所以我认为这需要作为2 个查询的序列。如果您需要将其作为单个命令,您可能可以在存储过程中执行此操作。
使用请求的存储过程进行更新,并将结果显示为游标示例:
为了使您能够支持这一点,我将添加一些背景信息和其工作原理的描述。首先,存储过程不能产生仅限于您的工作台的结果。它可以将结果存储在(临时)表中或命名游标中。游标只是将查询结果存储在领导节点上等待获取。游标的生命周期是当前事务,因此提交或回滚将删除游标。
以下是您希望作为单独的 SQL 语句发生的情况,但首先让我们设置测试数据:
您要执行的操作首先是为 PIVOT 子句 IN 列表创建一个字符串,如下所示:
然后您想要获取该字符串并将其插入到你的 PIVOT 查询中,它应该看起来像这样:
但是在工作台上执行此操作意味着将一个查询的结果复制/粘贴到第二个查询中,并且你希望这自动发生。不幸的是,出于上述原因,Redshift 确实允许 PIVOT 语句中的子查询。
我们可以获取一个查询的结果,并使用它在存储过程中构造并运行另一个查询。这是这样一个存储过程:
该过程的作用是定义并使用生成 IN 列表的查询结果填充名为“row”的“记录”(1 行数据)。接下来,它打开一个游标,其名称由调用命令提供,其中包含使用记录“行”中的 IN 列表的 PIVOT 查询的内容。完毕。
执行时(通过运行调用)此函数将在包含 PIVOT 查询结果的领导节点上生成一个游标。在此存储过程中,要创建的游标名称作为字符串传递给函数。
此时需要做的就是从指定游标中“获取”数据。这是通过 FETCH 命令完成的。
我在单节点 Redshift 集群上对此进行了原型设计,但此配置不支持“FETCH ALL”,因此我不得不使用“FETCH 1000”。因此,如果您也在单节点集群上,则需要使用:
最后要注意的一点是游标“mycursor”现在存在,如果您尝试重新运行存储过程,它将失败。您可以向过程传递不同的名称(创建另一个游标),或者可以结束事务(END、COMMIT 或 ROLLBACK),或者可以使用 CLOSE 关闭游标。一旦光标被销毁,您可以对新光标使用相同的名称。如果您希望这是可重复的,您可以运行这批命令:
请记住,游标具有当前事务的生命周期,因此结束事务的任何操作都会破坏游标。如果您的工作台中启用了 AUTOCOMMIT,这将插入破坏游标的 COMMIT(您可以批量运行 CALL 和 FETCH 以防止在许多工作台中发生这种情况)。此外,某些命令执行隐式 COMMIT,并且还会销毁游标(如 TRUNCATE)。
由于这些原因,并且根据您需要围绕 PIVOT 查询执行的其他操作,您可能希望将存储过程写入临时表而不是游标。然后可以查询临时表以获取结果。临时表具有会话的生命周期,因此粘性稍强,但效率稍低,因为需要创建表,需要将 PIVOT 查询的结果写入计算节点,然后必须将结果写入计算节点。发送到领导节点以产生所需的输出。只需要为工作选择合适的工具即可。
=====================================
要填充存储过程中的表,您只需执行命令即可。整个事情看起来像:
如果您希望这个新表具有用于优化下游查询的键,您将需要在一条语句中创建该表,然后插入其中,但这是快速路径。
From the Redshift documentation - "The PIVOT IN list values cannot be column references or sub-queries. Each value must be type compatible with the FOR column reference." See: https://docs.aws.amazon.com/redshift/latest/dg/r_FROM_clause-pivot-unpivot-examples.html
So I think this will need to be done as a sequence of 2 queries. You likely can do this in a stored procedure if you need it as a single command.
Updated with requested stored procedure with results to a cursor example:
In order to make this supportable by you I'll add some background info and description of how this works. First off a stored procedure cannot produce results strait to your bench. It can either store the results in a (temp) table or to a named cursor. A cursor is just storing the results of a query on the leader node where they wait to be fetched. The lifespan of the cursor is the current transaction so a commit or rollback will delete the cursor.
Here's what you want to happen as individual SQL statements but first lets set up the test data:
The actions you want to perform are first to create a string for the PIVOT clause IN list like so:
Then you want to take this string and insert it into your PIVOT query which should look like this:
But doing this in the bench will mean taking the result of one query and copy/paste-ing into a second query and you want this to happen automatically. Unfortunately Redshift does allow sub-queries in PIVOT statement for the reason given above.
We can take the result of one query and use it to construct and run another query in a stored procedure. Here's such a store procedure:
What this procedure does is define and populate a "record" (1 row of data) called "row" with the result of the query that produces the IN list. Next it opens a cursor, whose name is provided by the calling command, with the contents of the PIVOT query which uses the IN list from the record "row". Done.
When executed (by running call) this function will produce a cursor on the leader node that contains the result of the PIVOT query. In this stored procedure the name of the cursor to create is passed to the function as a string.
All that needs to be done at this point is to "fetch" the data from the named cursor. This is done with the FETCH command.
I prototyped this on a single node Redshift cluster and "FETCH ALL" is not supported at this configuration so I had to use "FETCH 1000". So if you are also on a single node cluster you will need to use:
The last point to note is that the cursor "mycursor" now exists and if you tried to rerun the stored procedure it will fail. You could pass a different name to the procedure (making another cursor) or you could end the transaction (END, COMMIT, or ROLLBACK) or you could close the cursor using CLOSE. Once the cursor is destroyed you can use the same name for a new cursor. If you wanted this to be repeatable you could run this batch of commands:
Remember that the cursor has a lifespan of the current transaction so any action that ends the transaction will destroy the cursor. If you have AUTOCOMMIT enable in your bench this will insert COMMITs destroying the cursor (you can run the CALL and FETCH in a batch to prevent this in many benches). Also some commands perform an implicit COMMIT and will also destroy the cursor (like TRUNCATE).
For these reasons, and depending on what else you need to do around the PIVOT query, you may want to have the stored procedure write to a temp table instead of a cursor. Then the temp table can be queried for the results. A temp table has a lifespan of the session so is a little stickier but is a little less efficient as a table needs to be created, the result of the PIVOT query needs to be written to the compute nodes, and then the results have to be sent to the leader node to produce the desired output. Just need to pick the right tool for the job.
===================================
To populate a table within a stored procedure you can just execute the commands. The whole thing will look like:
If you want this new table to have keys for optimizing downstream queries you will want to create the table in one statement then insert into it but this is quickie path.
有点偏离主题,但我想知道为什么亚马逊不能为枢纽引入更简单的语法。 IMO,如果将 GROUP BY 替换为 PIVOT BY,它可以向解释器提供足够的提示以将行转换为列。例如:
可以写成:
甚至多级透视也可以用相同的语法处理。
A little off-topic, but I wonder why Amazon couldn't introduce a simpler syntax for pivot. IMO, if GROUP BY is replaced by PIVOT BY, it can give enough hint to the interpreter to transform rows into columns. For example:
can be written as:
Even multi-level pivoting can also be handled in the same syntax.