将XML标签解析到pandas dataframe

发布于 2025-02-05 15:47:01 字数 4183 浏览 4 评论 0原文

我有以下XML文件。

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<pbsCrew>
    <schedulePeriod>
        <startDate>2022-06-02</startDate>
        <endDate>2022-07-01</endDate>
    </schedulePeriod>
    <crewMembers>
        <crewMember>
            <id>12345</id>
            <firstName>John</firstName>
            <middleName>S</middleName>
            <lastName>Doe</lastName>
            <seniorityNumber>2</seniorityNumber>
            <base>DFW</base>
            <division>I</division>
            <seat>CA</seat>
            <flightQualification>
                <hitcities/>
                <monthlyMax>9983</monthlyMax>
                <volMax>0</volMax>
                <restricted75HrCities>
                    <city>
                        <code>GUC</code>
                    </city>
                    <city>
                        <code>MSO</code>
                    </city>
                    </city>
                </restricted75HrCities>
                <equipmentTypes>
                    <equipment>
                        <type>777</type>
                    </equipment>
                </equipmentTypes>
                <green>false</green>
                <isVaccinated>true</isVaccinated>
            </flightQualification>
            <bidQualification>
                <canBid>true</canBid>
                <canBeAwarded>true</canBeAwarded>
            </bidQualification>
        </crewMember>
        <crewMember>
            <id>22222</id>
            <firstName>JANE</firstName>
            <middleName>R</middleName>
            <lastName>DOE</lastName>
            <seniorityNumber>8</seniorityNumber>
            <base>DFW</base>
            <division>I</division>
            <seat>CA</seat>
            <flightQualification>
                <hitcities>
                    <hitcity>
                        <code>OAX</code>
                    </hitcity>
                    <hitcity>
                        <code>MSO</code>
                    </hitcity>
                    <hitcity>
                        <code>US</code>
                    </hitcity>
                </hitcities>
                <monthlyMax>7642</monthlyMax>
                <volMax>0</volMax>
                <restricted75HrCities/>
                <equipmentTypes>
                    <equipment>
                        <type>787</type>
                    </equipment>
                </equipmentTypes>
                <green>false</green>
                <trainerLineCheck>false</trainerLineCheck>
                <trainerLineIndoctrination>false</trainerLineIndoctrination>
                <isVaccinated>true</isVaccinated>
            </flightQualification>
            <bidQualification>
                <canBid>true</canBid>
                <canBeAwarded>true</canBeAwarded>
            </bidQualification>
        </crewMember>
   </crewMembers>
</pbsCrew>

我想解析 crewmember 标签,并从那里获得 iD,firstName,lastname ,以及 flight> Flight Qualification 标签 Strong>全部变成一个pandas dataframe,如下所示:

ID名称lastnameisVaccinated
12345JohnDoeTrue
22222JaneDoeTrue

我知道这将涉及一个用于检索一个名称和姓氏值的循环,但是很难提取我需要的所有值。有什么快速的方法吗?

I have XML file as below.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<pbsCrew>
    <schedulePeriod>
        <startDate>2022-06-02</startDate>
        <endDate>2022-07-01</endDate>
    </schedulePeriod>
    <crewMembers>
        <crewMember>
            <id>12345</id>
            <firstName>John</firstName>
            <middleName>S</middleName>
            <lastName>Doe</lastName>
            <seniorityNumber>2</seniorityNumber>
            <base>DFW</base>
            <division>I</division>
            <seat>CA</seat>
            <flightQualification>
                <hitcities/>
                <monthlyMax>9983</monthlyMax>
                <volMax>0</volMax>
                <restricted75HrCities>
                    <city>
                        <code>GUC</code>
                    </city>
                    <city>
                        <code>MSO</code>
                    </city>
                    </city>
                </restricted75HrCities>
                <equipmentTypes>
                    <equipment>
                        <type>777</type>
                    </equipment>
                </equipmentTypes>
                <green>false</green>
                <isVaccinated>true</isVaccinated>
            </flightQualification>
            <bidQualification>
                <canBid>true</canBid>
                <canBeAwarded>true</canBeAwarded>
            </bidQualification>
        </crewMember>
        <crewMember>
            <id>22222</id>
            <firstName>JANE</firstName>
            <middleName>R</middleName>
            <lastName>DOE</lastName>
            <seniorityNumber>8</seniorityNumber>
            <base>DFW</base>
            <division>I</division>
            <seat>CA</seat>
            <flightQualification>
                <hitcities>
                    <hitcity>
                        <code>OAX</code>
                    </hitcity>
                    <hitcity>
                        <code>MSO</code>
                    </hitcity>
                    <hitcity>
                        <code>US</code>
                    </hitcity>
                </hitcities>
                <monthlyMax>7642</monthlyMax>
                <volMax>0</volMax>
                <restricted75HrCities/>
                <equipmentTypes>
                    <equipment>
                        <type>787</type>
                    </equipment>
                </equipmentTypes>
                <green>false</green>
                <trainerLineCheck>false</trainerLineCheck>
                <trainerLineIndoctrination>false</trainerLineIndoctrination>
                <isVaccinated>true</isVaccinated>
            </flightQualification>
            <bidQualification>
                <canBid>true</canBid>
                <canBeAwarded>true</canBeAwarded>
            </bidQualification>
        </crewMember>
   </crewMembers>
</pbsCrew>

I would like to parse the crewMember tag and get the id, firstName, lastName from there and the isVaccinated tag from flightQualification all into one Pandas dataframe like below:

IDfirstnamelastnameisVaccinated
12345JohnDoeTrue
22222JaneDoeTrue

I know this will involve a for loop to retrieve the firstName and lastName values, but having trouble extracting all the values I need. Is there a quick way to do this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

゛时过境迁 2025-02-12 15:47:01

尽管可以使用LOOP使用ElementTree或LXML库来完成,但是使用PANDAS READ_XML()方法可以使用StylesHeet属性实现相同的目标。此属性指定用于转换原始XML的XSL样式表。在您的情况下,需要将其弄平,使所有感兴趣的要素都包含在一个母元素中。

这是一个示例:

xml = '''<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<pbsCrew>
    <schedulePeriod>
        <startDate>2022-06-02</startDate>
        <endDate>2022-07-01</endDate>
    </schedulePeriod>
    <crewMembers>
        <crewMember>
            <id>12345</id>
            <firstName>John</firstName>
            <middleName>S</middleName>
            <lastName>Doe</lastName>
            <seniorityNumber>2</seniorityNumber>
            <base>DFW</base>
            <division>I</division>
            <seat>CA</seat>
            <flightQualification>
                <hitcities/>
                <monthlyMax>9983</monthlyMax>
                <volMax>0</volMax>
                <restricted75HrCities>
                    <city>
                        <code>GUC</code>
                    </city>
                    <city>
                        <code>MSO</code>
                    </city>
                </restricted75HrCities>
                <equipmentTypes>
                    <equipment>
                        <type>777</type>
                    </equipment>
                </equipmentTypes>
                <green>false</green>
                <isVaccinated>true</isVaccinated>
            </flightQualification>
            <bidQualification>
                <canBid>true</canBid>
                <canBeAwarded>true</canBeAwarded>
            </bidQualification>
        </crewMember>
        <crewMember>
            <id>22222</id>
            <firstName>JANE</firstName>
            <middleName>R</middleName>
            <lastName>DOE</lastName>
            <seniorityNumber>8</seniorityNumber>
            <base>DFW</base>
            <division>I</division>
            <seat>CA</seat>
            <flightQualification>
                <hitcities>
                    <hitcity>
                        <code>OAX</code>
                    </hitcity>
                    <hitcity>
                        <code>MSO</code>
                    </hitcity>
                    <hitcity>
                        <code>US</code>
                    </hitcity>
                </hitcities>
                <monthlyMax>7642</monthlyMax>
                <volMax>0</volMax>
                <restricted75HrCities/>
                <equipmentTypes>
                    <equipment>
                        <type>787</type>
                    </equipment>
                </equipmentTypes>
                <green>false</green>
                <trainerLineCheck>false</trainerLineCheck>
                <trainerLineIndoctrination>false</trainerLineIndoctrination>
                <isVaccinated>true</isVaccinated>
            </flightQualification>
            <bidQualification>
                <canBid>true</canBid>
                <canBeAwarded>true</canBeAwarded>
            </bidQualification>
        </crewMember>
   </crewMembers>
</pbsCrew>
'''

import pandas as pd

stylesheet = '''<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output indent="yes"/>
    <xsl:template match="/">
        <members>
            <xsl:apply-templates select="//crewMember"/>
        </members>
    </xsl:template>
    <xsl:template match="//crewMember">
        <crewMember>
            <ID>
                <xsl:value-of select="./id"/>
            </ID>
            <firstname>
                <xsl:value-of select="./firstName"/>
            </firstname>
            <lastname>
                <xsl:value-of select="./lastName"/>
            </lastname>
            <isVaccinated>
                <xsl:value-of select="./flightQualification/isVaccinated"/>
            </isVaccinated>
        </crewMember>
    </xsl:template>
</xsl:stylesheet>'''

df = pd.read_xml(xml, xpath="//crewMember", stylesheet = stylesheet)
print(df)

此处样式表将原始XML转换为以下形式:

<?xml version="1.0" encoding="UTF-8"?>
<members>
   <crewMember>
      <ID>12345</ID>
      <firstname>John</firstname>
      <lastname>Doe</lastname>
      <isVaccinated>true</isVaccinated>
   </crewMember>
   <crewMember>
      <ID>22222</ID>
      <firstname>JANE</firstname>
      <lastname>DOE</lastname>
      <isVaccinated>true</isVaccinated>
   </crewMember>
</members>

这允许我们使用XPATH // CrewMember“crewmember

Although it can be done with elementtree or lxml library using the loop, same goal can be achieved using pandas read_xml() method using the stylesheet attribute. This attribute specifies an XSL stylesheet used to transform original xml. In your case it's needed to flatten it, making all elements of interest to be contained within one parent element serving as a row.

Here is the example:

xml = '''<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<pbsCrew>
    <schedulePeriod>
        <startDate>2022-06-02</startDate>
        <endDate>2022-07-01</endDate>
    </schedulePeriod>
    <crewMembers>
        <crewMember>
            <id>12345</id>
            <firstName>John</firstName>
            <middleName>S</middleName>
            <lastName>Doe</lastName>
            <seniorityNumber>2</seniorityNumber>
            <base>DFW</base>
            <division>I</division>
            <seat>CA</seat>
            <flightQualification>
                <hitcities/>
                <monthlyMax>9983</monthlyMax>
                <volMax>0</volMax>
                <restricted75HrCities>
                    <city>
                        <code>GUC</code>
                    </city>
                    <city>
                        <code>MSO</code>
                    </city>
                </restricted75HrCities>
                <equipmentTypes>
                    <equipment>
                        <type>777</type>
                    </equipment>
                </equipmentTypes>
                <green>false</green>
                <isVaccinated>true</isVaccinated>
            </flightQualification>
            <bidQualification>
                <canBid>true</canBid>
                <canBeAwarded>true</canBeAwarded>
            </bidQualification>
        </crewMember>
        <crewMember>
            <id>22222</id>
            <firstName>JANE</firstName>
            <middleName>R</middleName>
            <lastName>DOE</lastName>
            <seniorityNumber>8</seniorityNumber>
            <base>DFW</base>
            <division>I</division>
            <seat>CA</seat>
            <flightQualification>
                <hitcities>
                    <hitcity>
                        <code>OAX</code>
                    </hitcity>
                    <hitcity>
                        <code>MSO</code>
                    </hitcity>
                    <hitcity>
                        <code>US</code>
                    </hitcity>
                </hitcities>
                <monthlyMax>7642</monthlyMax>
                <volMax>0</volMax>
                <restricted75HrCities/>
                <equipmentTypes>
                    <equipment>
                        <type>787</type>
                    </equipment>
                </equipmentTypes>
                <green>false</green>
                <trainerLineCheck>false</trainerLineCheck>
                <trainerLineIndoctrination>false</trainerLineIndoctrination>
                <isVaccinated>true</isVaccinated>
            </flightQualification>
            <bidQualification>
                <canBid>true</canBid>
                <canBeAwarded>true</canBeAwarded>
            </bidQualification>
        </crewMember>
   </crewMembers>
</pbsCrew>
'''

import pandas as pd

stylesheet = '''<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output indent="yes"/>
    <xsl:template match="/">
        <members>
            <xsl:apply-templates select="//crewMember"/>
        </members>
    </xsl:template>
    <xsl:template match="//crewMember">
        <crewMember>
            <ID>
                <xsl:value-of select="./id"/>
            </ID>
            <firstname>
                <xsl:value-of select="./firstName"/>
            </firstname>
            <lastname>
                <xsl:value-of select="./lastName"/>
            </lastname>
            <isVaccinated>
                <xsl:value-of select="./flightQualification/isVaccinated"/>
            </isVaccinated>
        </crewMember>
    </xsl:template>
</xsl:stylesheet>'''

df = pd.read_xml(xml, xpath="//crewMember", stylesheet = stylesheet)
print(df)

Here stylesheet transforms original XML to the following form:

<?xml version="1.0" encoding="UTF-8"?>
<members>
   <crewMember>
      <ID>12345</ID>
      <firstname>John</firstname>
      <lastname>Doe</lastname>
      <isVaccinated>true</isVaccinated>
   </crewMember>
   <crewMember>
      <ID>22222</ID>
      <firstname>JANE</firstname>
      <lastname>DOE</lastname>
      <isVaccinated>true</isVaccinated>
   </crewMember>
</members>

This allows us to use xpath //crewMember" taking children of a crewMember element as dataframe row fields.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文