Pyspark SQL-如何使用不同索引属性爆炸XML
我在Hive中的表中包含XML字符串的列:
<?xml version="1.0"?>
<Employees>
<Employee index="1">
<name>Jane Doe</name>
<age>30</age>
<telephone>111 333 444</telephone>
<education>University</education>
</Employee>
<Employee index="2">
<name>John Doe</name>
<age>32</age>
<telephone>222 333 444</telephone>
<education>High School</education>
</Employee>
</Employees>
我可以在Hive中将此数据传输到新表中,而没有任何问题:
SELECT
index,
XPATH_STRING(xml, CONCAT('/Employees/Employee[@index="',Index,'"]/education')) AS education,
XPATH_NUMBER(xml, CONCAT('/Employees/Employee[@index="',Index,'"]/age')) AS age
FROM DB.XML_TABLE
LATERAL VIEW OUTER EXPLODE(XPATH(xml , '/Employees/Employee/@index')) lv AS index
但是,当我想在Pyspark中使用Spark.sql(...)创建dataFrame,并且同样的查询,执行以此错误结束:
:org.apache.spark.sql.sql.analysisexception:无法解析'xpath_string(xml,concat('/employs/employee [@index =“'',index,'']/ducunsion'”))不匹配:路径应该是字符串文字;
我尝试将索引作为字符串施放,尝试使用CTE,但无效。 有没有办法通过Spark.sql进行蜂巢查询或与其他解决方法进行操作?
I have table in Hive which contains column with xml string:
<?xml version="1.0"?>
<Employees>
<Employee index="1">
<name>Jane Doe</name>
<age>30</age>
<telephone>111 333 444</telephone>
<education>University</education>
</Employee>
<Employee index="2">
<name>John Doe</name>
<age>32</age>
<telephone>222 333 444</telephone>
<education>High School</education>
</Employee>
</Employees>
I can transfer this data into new table with this query in Hive without any problems:
SELECT
index,
XPATH_STRING(xml, CONCAT('/Employees/Employee[@index="',Index,'"]/education')) AS education,
XPATH_NUMBER(xml, CONCAT('/Employees/Employee[@index="',Index,'"]/age')) AS age
FROM DB.XML_TABLE
LATERAL VIEW OUTER EXPLODE(XPATH(xml , '/Employees/Employee/@index')) lv AS index
But when I want to create a Dataframe in PySpark with spark.sql(...) and this same query, execution ends with this error:
: org.apache.spark.sql.AnalysisException: cannot resolve 'XPATH_STRING(xml, CONCAT('/Employees/Employee[@index="',Index,'"]/education'))' due to data type mismatch: path should be a string literal;
I try to cast index as string, try to use CTE, but nothing works.
Is there a way to do it through spark.sql with Hive query or with other workaround?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论