Pyspark SQL-如何使用不同索引属性爆炸XML

发布于 2025-02-05 14:26:48 字数 1344 浏览 2 评论 0原文

我在Hive中的表中包含XML字符串的列:

<?xml version="1.0"?>
<Employees>
    <Employee index="1">
        <name>Jane Doe</name>
        <age>30</age>
        <telephone>111 333 444</telephone>
        <education>University</education>
    </Employee>
    <Employee index="2">
        <name>John Doe</name>
        <age>32</age>
        <telephone>222 333 444</telephone>
        <education>High School</education>
    </Employee>
</Employees>

我可以在Hive中将此数据传输到新表中,而没有任何问题:

SELECT   
index,
XPATH_STRING(xml, CONCAT('/Employees/Employee[@index="',Index,'"]/education')) AS education,
XPATH_NUMBER(xml, CONCAT('/Employees/Employee[@index="',Index,'"]/age')) AS age
FROM DB.XML_TABLE   
LATERAL VIEW OUTER EXPLODE(XPATH(xml ,  '/Employees/Employee/@index')) lv AS  index

但是,当我想在Pyspark中使用Spark.sql(...)创建dataFrame,并且同样的查询,执行以此错误结束:

:org.apache.spark.sql.sql.analysisexception:无法解析'xpath_string(xml,concat('/employs/employee [@index =“'',index,'']/ducunsion'”))不匹配:路径应该是字符串文字;

我尝试将索引作为字符串施放,尝试使用CTE,但无效。 有没有办法通过Spark.sql进行蜂巢查询或与其他解决方法进行操作?

I have table in Hive which contains column with xml string:

<?xml version="1.0"?>
<Employees>
    <Employee index="1">
        <name>Jane Doe</name>
        <age>30</age>
        <telephone>111 333 444</telephone>
        <education>University</education>
    </Employee>
    <Employee index="2">
        <name>John Doe</name>
        <age>32</age>
        <telephone>222 333 444</telephone>
        <education>High School</education>
    </Employee>
</Employees>

I can transfer this data into new table with this query in Hive without any problems:

SELECT   
index,
XPATH_STRING(xml, CONCAT('/Employees/Employee[@index="',Index,'"]/education')) AS education,
XPATH_NUMBER(xml, CONCAT('/Employees/Employee[@index="',Index,'"]/age')) AS age
FROM DB.XML_TABLE   
LATERAL VIEW OUTER EXPLODE(XPATH(xml ,  '/Employees/Employee/@index')) lv AS  index

But when I want to create a Dataframe in PySpark with spark.sql(...) and this same query, execution ends with this error:

: org.apache.spark.sql.AnalysisException: cannot resolve 'XPATH_STRING(xml, CONCAT('/Employees/Employee[@index="',Index,'"]/education'))' due to data type mismatch: path should be a string literal;

I try to cast index as string, try to use CTE, but nothing works.
Is there a way to do it through spark.sql with Hive query or with other workaround?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文