Pyspark SQL-如何使用不同索引属性爆炸XML

发布于 2025-02-05 14:26:48 字数 1344 浏览 2 评论 0原文

我在Hive中的表中包含XML字符串的列：

<?xml version="1.0"?>
<Employees>
    <Employee index="1">
        <name>Jane Doe</name>
        <age>30</age>
        <telephone>111 333 444</telephone>
        <education>University</education>
    </Employee>
    <Employee index="2">
        <name>John Doe</name>
        <age>32</age>
        <telephone>222 333 444</telephone>
        <education>High School</education>
    </Employee>
</Employees>

我可以在Hive中将此数据传输到新表中，而没有任何问题：

SELECT   
index,
XPATH_STRING(xml, CONCAT('/Employees/Employee[@index="',Index,'"]/education')) AS education,
XPATH_NUMBER(xml, CONCAT('/Employees/Employee[@index="',Index,'"]/age')) AS age
FROM DB.XML_TABLE   
LATERAL VIEW OUTER EXPLODE(XPATH(xml ,  '/Employees/Employee/@index')) lv AS  index

但是，当我想在Pyspark中使用Spark.sql（...）创建dataFrame，并且同样的查询，执行以此错误结束：

：org.apache.spark.sql.sql.analysisexception：无法解析'xpath_string（xml，concat（'/employs/employee [@index =“''，index，'']/ducunsion'”））不匹配：路径应该是字符串文字;

我尝试将索引作为字符串施放，尝试使用CTE，但无效。有没有办法通过Spark.sql进行蜂巢查询或与其他解决方法进行操作？

原文

I have table in Hive which contains column with xml string:

<?xml version="1.0"?>
<Employees>
    <Employee index="1">
        <name>Jane Doe</name>
        <age>30</age>
        <telephone>111 333 444</telephone>
        <education>University</education>
    </Employee>
    <Employee index="2">
        <name>John Doe</name>
        <age>32</age>
        <telephone>222 333 444</telephone>
        <education>High School</education>
    </Employee>
</Employees>

I can transfer this data into new table with this query in Hive without any problems:

SELECT   
index,
XPATH_STRING(xml, CONCAT('/Employees/Employee[@index="',Index,'"]/education')) AS education,
XPATH_NUMBER(xml, CONCAT('/Employees/Employee[@index="',Index,'"]/age')) AS age
FROM DB.XML_TABLE   
LATERAL VIEW OUTER EXPLODE(XPATH(xml ,  '/Employees/Employee/@index')) lv AS  index

But when I want to create a Dataframe in PySpark with spark.sql(...) and this same query, execution ends with this error:

: org.apache.spark.sql.AnalysisException: cannot resolve 'XPATH_STRING(xml, CONCAT('/Employees/Employee[@index="',Index,'"]/education'))' due to data type mismatch: path should be a string literal;

I try to cast index as string, try to use CTE, but nothing works.
Is there a way to do it through spark.sql with Hive query or with other workaround?

分享到QQ

分享到微博