Pyspark Transformer无法处理带有空格的数据帧列

发布于 2025-02-10 02:44:03 字数 1276 浏览 1 评论 0原文

使用Pyspark读取来自Cassandra数据库的数据。

软件包:

from  pyspark.ml.feature import SQLTransformer
from transform.Base import Transform

我已经加载了下面的数据

+----+--------------------+-------+---+
|time|   MEM UTI PERC %   |devId  |Lid|
+----+--------------------+-------+---+
| 482|         8.661052632|      6| 20|
| 654|         9.162190612|      6| 20|
| 364|         8.219230769|      6| 20|

时,当我应用SQLTransForm时,SQL语句

self.sqlstatement = "SELECT Time,MEM UTI PERC % FROM __THIS__ WHERE "

sqltrans = SQLTransformer()
sqltrans.setStatement(self.sqlstatement)
new_df = sqltrans.transform(sparkdf)

会引发错误,

mismatched input 'UTI' expecting {<EOF>, ';'}(line 1, pos 19)

因此我修改了SQL语句以将间隔列包裹在Double Qualtes/double Quald Qualtes/single Qualtes中,如下以下

SELECT Time,"MEM UTI PERC %" FROM __THIS__ WHERE

时,Transformer没有抛出异常,但用同一列名称代替了该间隔列的所有值,如下面

+----+--------------+
|Time|MEM UTI PERC %|
+----+--------------+
| 212|MEM UTI PERC %|
|  26|MEM UTI PERC %|

我想正确获取数据

+----+--------------+
|Time|MEM UTI PERC %|
+----+--------------+
| 212|20.7          |
|  26|40.0          |

Using Pyspark to read data from Cassandra database.

Packages:

from  pyspark.ml.feature import SQLTransformer
from transform.Base import Transform

I have loaded the data it looks like below

+----+--------------------+-------+---+
|time|   MEM UTI PERC %   |devId  |Lid|
+----+--------------------+-------+---+
| 482|         8.661052632|      6| 20|
| 654|         9.162190612|      6| 20|
| 364|         8.219230769|      6| 20|

When I apply SQLTransform, which SQL STATEMENT AS

self.sqlstatement = "SELECT Time,MEM UTI PERC % FROM __THIS__ WHERE "

sqltrans = SQLTransformer()
sqltrans.setStatement(self.sqlstatement)
new_df = sqltrans.transform(sparkdf)

It throws error

mismatched input 'UTI' expecting {<EOF>, ';'}(line 1, pos 19)

So I modified the SQL Statement to wrap the spaced column inside double quotes/single quotes like below

SELECT Time,"MEM UTI PERC %" FROM __THIS__ WHERE

This time, the transformer doesn't throw exception but instead in replaces all the value of that spaced column with same column name , like below

+----+--------------+
|Time|MEM UTI PERC %|
+----+--------------+
| 212|MEM UTI PERC %|
|  26|MEM UTI PERC %|

I want to get data properly like

+----+--------------+
|Time|MEM UTI PERC %|
+----+--------------+
| 212|20.7          |
|  26|40.0          |

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

愁杀 2025-02-17 02:44:03

尝试将列名封装在这样的单引号中,看看它是否有效:

self.sqlstatement = "SELECT Time,'MEM UTI PERC %' FROM __THIS__ WHERE "

另外,您可能需要以这种方式逃脱引号:

self.sqlstatement = "SELECT Time,\"MEM UTI PERC %\" FROM __THIS__ WHERE "

查看哪个工作。干杯!

Try enclosing the column name in single quotes like this and see if it works:

self.sqlstatement = "SELECT Time,'MEM UTI PERC %' FROM __THIS__ WHERE "

Alternatively, you might need to escape the quotes this way:

self.sqlstatement = "SELECT Time,\"MEM UTI PERC %\" FROM __THIS__ WHERE "

See which one works. Cheers!

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文