Pyspark Transformer无法处理带有空格的数据帧列
使用Pyspark读取来自Cassandra数据库的数据。
软件包:
from pyspark.ml.feature import SQLTransformer
from transform.Base import Transform
我已经加载了下面的数据
+----+--------------------+-------+---+
|time| MEM UTI PERC % |devId |Lid|
+----+--------------------+-------+---+
| 482| 8.661052632| 6| 20|
| 654| 9.162190612| 6| 20|
| 364| 8.219230769| 6| 20|
时,当我应用SQLTransForm时,SQL语句
self.sqlstatement = "SELECT Time,MEM UTI PERC % FROM __THIS__ WHERE "
sqltrans = SQLTransformer()
sqltrans.setStatement(self.sqlstatement)
new_df = sqltrans.transform(sparkdf)
会引发错误,
mismatched input 'UTI' expecting {<EOF>, ';'}(line 1, pos 19)
因此我修改了SQL语句以将间隔列包裹在Double Qualtes/double Quald Qualtes/single Qualtes中,如下以下
SELECT Time,"MEM UTI PERC %" FROM __THIS__ WHERE
时,Transformer没有抛出异常,但用同一列名称代替了该间隔列的所有值,如下面
+----+--------------+
|Time|MEM UTI PERC %|
+----+--------------+
| 212|MEM UTI PERC %|
| 26|MEM UTI PERC %|
我想正确获取数据
+----+--------------+
|Time|MEM UTI PERC %|
+----+--------------+
| 212|20.7 |
| 26|40.0 |
Using Pyspark to read data from Cassandra database.
Packages:
from pyspark.ml.feature import SQLTransformer
from transform.Base import Transform
I have loaded the data it looks like below
+----+--------------------+-------+---+
|time| MEM UTI PERC % |devId |Lid|
+----+--------------------+-------+---+
| 482| 8.661052632| 6| 20|
| 654| 9.162190612| 6| 20|
| 364| 8.219230769| 6| 20|
When I apply SQLTransform, which SQL STATEMENT AS
self.sqlstatement = "SELECT Time,MEM UTI PERC % FROM __THIS__ WHERE "
sqltrans = SQLTransformer()
sqltrans.setStatement(self.sqlstatement)
new_df = sqltrans.transform(sparkdf)
It throws error
mismatched input 'UTI' expecting {<EOF>, ';'}(line 1, pos 19)
So I modified the SQL Statement to wrap the spaced column inside double quotes/single quotes like below
SELECT Time,"MEM UTI PERC %" FROM __THIS__ WHERE
This time, the transformer doesn't throw exception but instead in replaces all the value of that spaced column with same column name , like below
+----+--------------+
|Time|MEM UTI PERC %|
+----+--------------+
| 212|MEM UTI PERC %|
| 26|MEM UTI PERC %|
I want to get data properly like
+----+--------------+
|Time|MEM UTI PERC %|
+----+--------------+
| 212|20.7 |
| 26|40.0 |
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
尝试将列名封装在这样的单引号中,看看它是否有效:
另外,您可能需要以这种方式逃脱引号:
查看哪个工作。干杯!
Try enclosing the column name in single quotes like this and see if it works:
Alternatively, you might need to escape the quotes this way:
See which one works. Cheers!