将XML标签解析到pandas dataframe
我有以下XML文件。
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<pbsCrew>
<schedulePeriod>
<startDate>2022-06-02</startDate>
<endDate>2022-07-01</endDate>
</schedulePeriod>
<crewMembers>
<crewMember>
<id>12345</id>
<firstName>John</firstName>
<middleName>S</middleName>
<lastName>Doe</lastName>
<seniorityNumber>2</seniorityNumber>
<base>DFW</base>
<division>I</division>
<seat>CA</seat>
<flightQualification>
<hitcities/>
<monthlyMax>9983</monthlyMax>
<volMax>0</volMax>
<restricted75HrCities>
<city>
<code>GUC</code>
</city>
<city>
<code>MSO</code>
</city>
</city>
</restricted75HrCities>
<equipmentTypes>
<equipment>
<type>777</type>
</equipment>
</equipmentTypes>
<green>false</green>
<isVaccinated>true</isVaccinated>
</flightQualification>
<bidQualification>
<canBid>true</canBid>
<canBeAwarded>true</canBeAwarded>
</bidQualification>
</crewMember>
<crewMember>
<id>22222</id>
<firstName>JANE</firstName>
<middleName>R</middleName>
<lastName>DOE</lastName>
<seniorityNumber>8</seniorityNumber>
<base>DFW</base>
<division>I</division>
<seat>CA</seat>
<flightQualification>
<hitcities>
<hitcity>
<code>OAX</code>
</hitcity>
<hitcity>
<code>MSO</code>
</hitcity>
<hitcity>
<code>US</code>
</hitcity>
</hitcities>
<monthlyMax>7642</monthlyMax>
<volMax>0</volMax>
<restricted75HrCities/>
<equipmentTypes>
<equipment>
<type>787</type>
</equipment>
</equipmentTypes>
<green>false</green>
<trainerLineCheck>false</trainerLineCheck>
<trainerLineIndoctrination>false</trainerLineIndoctrination>
<isVaccinated>true</isVaccinated>
</flightQualification>
<bidQualification>
<canBid>true</canBid>
<canBeAwarded>true</canBeAwarded>
</bidQualification>
</crewMember>
</crewMembers>
</pbsCrew>
我想解析 crewmember 标签,并从那里获得 iD,firstName,lastname ,以及 flight> Flight Qualification 标签 Strong>全部变成一个pandas dataframe,如下所示:
ID | 名称 | lastname | isVaccinated |
---|---|---|---|
12345 | John | Doe | True |
22222 | Jane | Doe | True |
我知道这将涉及一个用于检索一个名称和姓氏值的循环,但是很难提取我需要的所有值。有什么快速的方法吗?
I have XML file as below.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<pbsCrew>
<schedulePeriod>
<startDate>2022-06-02</startDate>
<endDate>2022-07-01</endDate>
</schedulePeriod>
<crewMembers>
<crewMember>
<id>12345</id>
<firstName>John</firstName>
<middleName>S</middleName>
<lastName>Doe</lastName>
<seniorityNumber>2</seniorityNumber>
<base>DFW</base>
<division>I</division>
<seat>CA</seat>
<flightQualification>
<hitcities/>
<monthlyMax>9983</monthlyMax>
<volMax>0</volMax>
<restricted75HrCities>
<city>
<code>GUC</code>
</city>
<city>
<code>MSO</code>
</city>
</city>
</restricted75HrCities>
<equipmentTypes>
<equipment>
<type>777</type>
</equipment>
</equipmentTypes>
<green>false</green>
<isVaccinated>true</isVaccinated>
</flightQualification>
<bidQualification>
<canBid>true</canBid>
<canBeAwarded>true</canBeAwarded>
</bidQualification>
</crewMember>
<crewMember>
<id>22222</id>
<firstName>JANE</firstName>
<middleName>R</middleName>
<lastName>DOE</lastName>
<seniorityNumber>8</seniorityNumber>
<base>DFW</base>
<division>I</division>
<seat>CA</seat>
<flightQualification>
<hitcities>
<hitcity>
<code>OAX</code>
</hitcity>
<hitcity>
<code>MSO</code>
</hitcity>
<hitcity>
<code>US</code>
</hitcity>
</hitcities>
<monthlyMax>7642</monthlyMax>
<volMax>0</volMax>
<restricted75HrCities/>
<equipmentTypes>
<equipment>
<type>787</type>
</equipment>
</equipmentTypes>
<green>false</green>
<trainerLineCheck>false</trainerLineCheck>
<trainerLineIndoctrination>false</trainerLineIndoctrination>
<isVaccinated>true</isVaccinated>
</flightQualification>
<bidQualification>
<canBid>true</canBid>
<canBeAwarded>true</canBeAwarded>
</bidQualification>
</crewMember>
</crewMembers>
</pbsCrew>
I would like to parse the crewMember tag and get the id, firstName, lastName from there and the isVaccinated tag from flightQualification all into one Pandas dataframe like below:
ID | firstname | lastname | isVaccinated |
---|---|---|---|
12345 | John | Doe | True |
22222 | Jane | Doe | True |
I know this will involve a for loop to retrieve the firstName and lastName values, but having trouble extracting all the values I need. Is there a quick way to do this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
尽管可以使用LOOP使用ElementTree或LXML库来完成,但是使用PANDAS READ_XML()方法可以使用StylesHeet属性实现相同的目标。此属性指定用于转换原始XML的XSL样式表。在您的情况下,需要将其弄平,使所有感兴趣的要素都包含在一个母元素中。
这是一个示例:
此处样式表将原始XML转换为以下形式:
这允许我们使用XPATH
// CrewMember“
将crewmember
Although it can be done with elementtree or lxml library using the loop, same goal can be achieved using pandas read_xml() method using the stylesheet attribute. This attribute specifies an XSL stylesheet used to transform original xml. In your case it's needed to flatten it, making all elements of interest to be contained within one parent element serving as a row.
Here is the example:
Here stylesheet transforms original XML to the following form:
This allows us to use xpath
//crewMember"
taking children of acrewMember
element as dataframe row fields.