如何用ElementTree解析XML将其在大熊猫中使用

发布于 2025-02-08 08:55:10 字数 8876 浏览 1 评论 0原文

我有一个相当大的XML文件,具有多个不同的元素,类似于一个bellow:

<adrmsg:ADRMessage xmlns:adrmsg="http://www.eurocontrol.int/cfmu/b2b/ADRMessage"
    xmlns:gml="http://www.opengis.net/gml/3.2" gml:id="ID_197112_1650420171084_1"
    xmlns:adrext="http://www.aixm.aero/schema/5.1.1/extensions/EUR/ADR"
    xmlns:aixm="http://www.aixm.aero/schema/5.1.1"
    xmlns:xlink="http://www.w3.org/1999/xlink">
    <adrmsg:hasMember>
        <aixm:Airspace gml:id="ID_197112_1650420171084_93332">
            <gml:identifier codeSpace="urn:uuid:">3271922d-6b7a-4953-a6ff-599b17ab785e</gml:identifier>
            <aixm:timeSlice>
                <aixm:AirspaceTimeSlice gml:id="ID_197112_1650420171084_93333">
                    <gml:validTime>
                        <gml:TimePeriod gml:id="ID_197112_1650420171084_93334">
                            <gml:beginPosition>2021-10-07T00:00:00</gml:beginPosition>
                            <gml:endPosition indeterminatePosition="unknown"/>
                        </gml:TimePeriod>
                    </gml:validTime>
                    <aixm:interpretation>BASELINE</aixm:interpretation>
                    <aixm:featureLifetime>
                        <gml:TimePeriod gml:id="ID_197112_1650420171084_93335">
                            <gml:beginPosition>2021-10-07T00:00:00</gml:beginPosition>
                            <gml:endPosition indeterminatePosition="unknown"/>
                        </gml:TimePeriod>
                    </aixm:featureLifetime>
                    <aixm:type>RAS</aixm:type>
                    <aixm:designator>EDGGNFRA</aixm:designator>
                    <aixm:name>EDGG NON FRA</aixm:name>
                    <aixm:designatorICAO>NO</aixm:designatorICAO>
                    <aixm:geometryComponent>
                        <aixm:AirspaceGeometryComponent gml:id="ID_197112_1650420171084_93336">
                            <aixm:operation>BASE</aixm:operation>
                            <aixm:theAirspaceVolume>
                                <aixm:AirspaceVolume gml:id="ID_197112_1650420171084_93337">
                                    <aixm:upperLimit uom="FL">265</aixm:upperLimit>
                                    <aixm:upperLimitReference>STD</aixm:upperLimitReference>
                                    <aixm:lowerLimit uom="FL">245</aixm:lowerLimit>
                                    <aixm:lowerLimitReference>STD</aixm:lowerLimitReference>
                                    <aixm:contributorAirspace>
                                        <aixm:AirspaceVolumeDependency gml:id="ID_197112_1650420171084_93338">
                                            <aixm:dependency>HORZ_PROJECTION</aixm:dependency>
                                            <aixm:theAirspace xlink:href="urn:uuid:5831b5a2-4861-4bf5-ae99-d31413234cdb"/>
                                        </aixm:AirspaceVolumeDependency>
                                    </aixm:contributorAirspace>
                                </aixm:AirspaceVolume>
                            </aixm:theAirspaceVolume>
                        </aixm:AirspaceGeometryComponent>
                    </aixm:geometryComponent>
                    <aixm:geometryComponent>
                        <aixm:AirspaceGeometryComponent gml:id="ID_197112_1650420171084_93339">
                            <aixm:operation>UNION</aixm:operation>
                            <aixm:theAirspaceVolume>
                                <aixm:AirspaceVolume gml:id="ID_197112_1650420171084_93340">
                                    <aixm:upperLimit uom="FL">255</aixm:upperLimit>
                                    <aixm:upperLimitReference>STD</aixm:upperLimitReference>
                                    <aixm:lowerLimit uom="FL">245</aixm:lowerLimit>
                                    <aixm:lowerLimitReference>STD</aixm:lowerLimitReference>
                                    <aixm:contributorAirspace>
                                        <aixm:AirspaceVolumeDependency gml:id="ID_197112_1650420171084_93341">
                                            <aixm:dependency>HORZ_PROJECTION</aixm:dependency>
                                            <aixm:theAirspace xlink:href="urn:uuid:dcd8301c-de12-4e6c-992f-fd8de781ab58"/>
                                        </aixm:AirspaceVolumeDependency>
                                    </aixm:contributorAirspace>
                                </aixm:AirspaceVolume>
                            </aixm:theAirspaceVolume>
                        </aixm:AirspaceGeometryComponent>
                    </aixm:geometryComponent>
                    <aixm:extension>
                        <adrext:AirspaceExtension gml:id="ID_197112_1650420171084_93342">
                            <adrext:usage>OPERATIONAL</adrext:usage>
                        </adrext:AirspaceExtension>
                    </aixm:extension>
                </aixm:AirspaceTimeSlice>
            </aixm:timeSlice>
        </aixm:Airspace>
    </adrmsg:hasMember>
.... many other <adrmsg:hasMember>
</adrmsg:ADRMessage>

我只添加了其中一个元素 +名称空间。

我对代码的尝试:

import xml.etree.ElementTree as ET
import pandas as pd

ab = {"adrmsg":"http://www.eurocontrol.int/cfmu/b2b/ADRMessage",
    "gml":"http://www.opengis.net/gml/3.2",
    "adrext":"http://www.aixm.aero/schema/5.1.1/extensions/EUR/ADR",
    "aixm": "http://www.aixm.aero/schema/5.1.1",
    "xlink":"http://www.w3.org/1999/xlink",
    "id":"http://www.opengis.net/gml/3.2",
    "href":"http://www.w3.org/1999/xlink"
}

root_node = ET.parse('Airspace.xml').getroot()

pipare = []
verate = []
for tag in root_node.findall(".//aixm:Airspace" , ab):
    value = tag.find("gml:identifier", ab)
    for char in tag.findall(".//aixm:AirspaceTimeSlice", ab):
        for per in char.findall(".//aixm:type",ab):
            for ir in char.findall(".//aixm:name",ab):
                for epa in char.findall(".//aixm:designator", ab):
                    for op in char.findall(".//aixm:theAirspace[@xlink:href]", ab):
                        pipare = [value.text, char.attrib,per.text,ir.text,epa.text,op.attrib]
                        verate.append(pipare)
                        
                       
xml_todf = pd.DataFrame(verate, columns=['uuid','id','type','name','designator','contributorAirspace'])

您可能会看到,我以一种非常“粗糙”的方式试图解析XML,提取我感兴趣的元素,最后将它们放入熊猫数据框架中。

当我“捕获” .TEXT时,提取的数据是我想要的,但是当涉及到捕获属性时,结果不仅是值,而且是命名空间...我不知道该怎么办来解决此问题。 Let me share how the pandas DataFrame displays that data:

uuididtypenamedesignatorcontributorAirspace
3271922d-6b7a-4953-a6ff-599b17ab785e{'{http://www.opengis.net/gml/3.2}id': 'ID_197112_1650420171084_93333'}rasedgg non fraedggnfra{'{http://www.w3.org/1999/xlink} href':'urn:uuid:5831b5a2-4861-4861-4bf5-4bf5-4bf5-ae99-d314141413234cdb'} 32792-192-192-192-2A7.192
5e{'{http://www.opengis.net/gml/3.2} id':'ID_197112_1650420420420171084_93333333333'}rasedgg non fraedggnfra{' URN:UUID:DCD8301C-DE12-4E6C-992F-FD8DE781AB58'}

我想拥有这样的东西:

UUIDID类型类型的名称名称指定贡献者贡献者
32719222-6B7A-6B7A-4953-4953-A6FF-599999B171717171717171717171717171917171717171717171717171717171717171717.1785E 3'}rasedgg non fraedggnfra5831b5a2 ,DCD8301C-DE12-4E6C-992F-FD8DE781AB58

可以帮助我达到这一点,我将非常

如果有人-AE99- D31413234CDB但是感激-4861-4BF5
ID_197112_1650420171084_93333, }rasedgg non fra-ae99-d31413234cdb
3271922D-6B7A-4953-4953-A6FF-A6FF-599999B1785E CD8301C-DE12-4E6C'感谢您的edggnfra5831b5a2-4861-4bf5-992F-FD8DE781AB58

帮助

I have a rather big xml file with multiple different elements, similar to the one bellow:

<adrmsg:ADRMessage xmlns:adrmsg="http://www.eurocontrol.int/cfmu/b2b/ADRMessage"
    xmlns:gml="http://www.opengis.net/gml/3.2" gml:id="ID_197112_1650420171084_1"
    xmlns:adrext="http://www.aixm.aero/schema/5.1.1/extensions/EUR/ADR"
    xmlns:aixm="http://www.aixm.aero/schema/5.1.1"
    xmlns:xlink="http://www.w3.org/1999/xlink">
    <adrmsg:hasMember>
        <aixm:Airspace gml:id="ID_197112_1650420171084_93332">
            <gml:identifier codeSpace="urn:uuid:">3271922d-6b7a-4953-a6ff-599b17ab785e</gml:identifier>
            <aixm:timeSlice>
                <aixm:AirspaceTimeSlice gml:id="ID_197112_1650420171084_93333">
                    <gml:validTime>
                        <gml:TimePeriod gml:id="ID_197112_1650420171084_93334">
                            <gml:beginPosition>2021-10-07T00:00:00</gml:beginPosition>
                            <gml:endPosition indeterminatePosition="unknown"/>
                        </gml:TimePeriod>
                    </gml:validTime>
                    <aixm:interpretation>BASELINE</aixm:interpretation>
                    <aixm:featureLifetime>
                        <gml:TimePeriod gml:id="ID_197112_1650420171084_93335">
                            <gml:beginPosition>2021-10-07T00:00:00</gml:beginPosition>
                            <gml:endPosition indeterminatePosition="unknown"/>
                        </gml:TimePeriod>
                    </aixm:featureLifetime>
                    <aixm:type>RAS</aixm:type>
                    <aixm:designator>EDGGNFRA</aixm:designator>
                    <aixm:name>EDGG NON FRA</aixm:name>
                    <aixm:designatorICAO>NO</aixm:designatorICAO>
                    <aixm:geometryComponent>
                        <aixm:AirspaceGeometryComponent gml:id="ID_197112_1650420171084_93336">
                            <aixm:operation>BASE</aixm:operation>
                            <aixm:theAirspaceVolume>
                                <aixm:AirspaceVolume gml:id="ID_197112_1650420171084_93337">
                                    <aixm:upperLimit uom="FL">265</aixm:upperLimit>
                                    <aixm:upperLimitReference>STD</aixm:upperLimitReference>
                                    <aixm:lowerLimit uom="FL">245</aixm:lowerLimit>
                                    <aixm:lowerLimitReference>STD</aixm:lowerLimitReference>
                                    <aixm:contributorAirspace>
                                        <aixm:AirspaceVolumeDependency gml:id="ID_197112_1650420171084_93338">
                                            <aixm:dependency>HORZ_PROJECTION</aixm:dependency>
                                            <aixm:theAirspace xlink:href="urn:uuid:5831b5a2-4861-4bf5-ae99-d31413234cdb"/>
                                        </aixm:AirspaceVolumeDependency>
                                    </aixm:contributorAirspace>
                                </aixm:AirspaceVolume>
                            </aixm:theAirspaceVolume>
                        </aixm:AirspaceGeometryComponent>
                    </aixm:geometryComponent>
                    <aixm:geometryComponent>
                        <aixm:AirspaceGeometryComponent gml:id="ID_197112_1650420171084_93339">
                            <aixm:operation>UNION</aixm:operation>
                            <aixm:theAirspaceVolume>
                                <aixm:AirspaceVolume gml:id="ID_197112_1650420171084_93340">
                                    <aixm:upperLimit uom="FL">255</aixm:upperLimit>
                                    <aixm:upperLimitReference>STD</aixm:upperLimitReference>
                                    <aixm:lowerLimit uom="FL">245</aixm:lowerLimit>
                                    <aixm:lowerLimitReference>STD</aixm:lowerLimitReference>
                                    <aixm:contributorAirspace>
                                        <aixm:AirspaceVolumeDependency gml:id="ID_197112_1650420171084_93341">
                                            <aixm:dependency>HORZ_PROJECTION</aixm:dependency>
                                            <aixm:theAirspace xlink:href="urn:uuid:dcd8301c-de12-4e6c-992f-fd8de781ab58"/>
                                        </aixm:AirspaceVolumeDependency>
                                    </aixm:contributorAirspace>
                                </aixm:AirspaceVolume>
                            </aixm:theAirspaceVolume>
                        </aixm:AirspaceGeometryComponent>
                    </aixm:geometryComponent>
                    <aixm:extension>
                        <adrext:AirspaceExtension gml:id="ID_197112_1650420171084_93342">
                            <adrext:usage>OPERATIONAL</adrext:usage>
                        </adrext:AirspaceExtension>
                    </aixm:extension>
                </aixm:AirspaceTimeSlice>
            </aixm:timeSlice>
        </aixm:Airspace>
    </adrmsg:hasMember>
.... many other <adrmsg:hasMember>
</adrmsg:ADRMessage>

I only added one of those elements + the namespaces .

My attempt of code :

import xml.etree.ElementTree as ET
import pandas as pd

ab = {"adrmsg":"http://www.eurocontrol.int/cfmu/b2b/ADRMessage",
    "gml":"http://www.opengis.net/gml/3.2",
    "adrext":"http://www.aixm.aero/schema/5.1.1/extensions/EUR/ADR",
    "aixm": "http://www.aixm.aero/schema/5.1.1",
    "xlink":"http://www.w3.org/1999/xlink",
    "id":"http://www.opengis.net/gml/3.2",
    "href":"http://www.w3.org/1999/xlink"
}

root_node = ET.parse('Airspace.xml').getroot()

pipare = []
verate = []
for tag in root_node.findall(".//aixm:Airspace" , ab):
    value = tag.find("gml:identifier", ab)
    for char in tag.findall(".//aixm:AirspaceTimeSlice", ab):
        for per in char.findall(".//aixm:type",ab):
            for ir in char.findall(".//aixm:name",ab):
                for epa in char.findall(".//aixm:designator", ab):
                    for op in char.findall(".//aixm:theAirspace[@xlink:href]", ab):
                        pipare = [value.text, char.attrib,per.text,ir.text,epa.text,op.attrib]
                        verate.append(pipare)
                        
                       
xml_todf = pd.DataFrame(verate, columns=['uuid','id','type','name','designator','contributorAirspace'])

As you could probably see, I am in a very 'rough' way trying to parse that XML, extract the elements that I am interested in and finally put them into a pandas DataFrame.

When I am 'capturing' the .text the data extracted is what I want, but when it comes about capturing the attributes, the result is not only the values but also the namespaces...I dont know what to do to solve this.
Let me share how the pandas DataFrame displays that data:

uuididtypenamedesignatorcontributorAirspace
3271922d-6b7a-4953-a6ff-599b17ab785e{'{http://www.opengis.net/gml/3.2}id': 'ID_197112_1650420171084_93333'}RASEDGG NON FRAEDGGNFRA{'{http://www.w3.org/1999/xlink}href': 'urn:uuid:5831b5a2-4861-4bf5-ae99-d31413234cdb'}
3271922d-6b7a-4953-a6ff-599b17ab785e{'{http://www.opengis.net/gml/3.2}id': 'ID_197112_1650420171084_93333'}RASEDGG NON FRAEDGGNFRA{'{http://www.w3.org/1999/xlink}href': 'urn:uuid:dcd8301c-de12-4e6c-992f-fd8de781ab58'}

I would like to have ideally something like this:

uuididtypenamedesignatorcontributorAirspace
3271922d-6b7a-4953-a6ff-599b17ab785e'ID_197112_1650420171084_93333'}RASEDGG NON FRAEDGGNFRA5831b5a2-4861-4bf5-ae99-d31413234cdb , dcd8301c-de12-4e6c-992f-fd8de781ab58

but I would be very grateful if somebody could help me reach this point:

uuididtypenamedesignatorcontributorAirspace
3271922d-6b7a-4953-a6ff-599b17ab785e'ID_197112_1650420171084_93333'}RASEDGG NON FRAEDGGNFRA5831b5a2-4861-4bf5-ae99-d31413234cdb
3271922d-6b7a-4953-a6ff-599b17ab785e'ID_197112_1650420171084_93333'}RASEDGG NON FRAEDGGNFRAdcd8301c-de12-4e6c-992f-fd8de781ab58

Thanks for your help

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

早乙女 2025-02-15 08:55:10

Python ElementTree需要以其合格名称(即名称空间 +属性名称)来解决使用名称空间的属性。引用char.attribop.attrib一个包含所有元素属性的字典带有其值。这是属性值检索的示例:

import xml.etree.ElementTree as ET
import pandas as pd
from collections import defaultdict

ab = {"adrmsg":"http://www.eurocontrol.int/cfmu/b2b/ADRMessage",
    "gml":"http://www.opengis.net/gml/3.2",
    "adrext":"http://www.aixm.aero/schema/5.1.1/extensions/EUR/ADR",
    "aixm": "http://www.aixm.aero/schema/5.1.1",
    "xlink":"http://www.w3.org/1999/xlink",
    "id":"http://www.opengis.net/gml/3.2",
    "href":"http://www.w3.org/1999/xlink"
}

# parse XML
root_node = ET.fromstring(xml)

# create dictionary to store parsed data
data = defaultdict(list)

for tag in root_node.findall(".//aixm:Airspace" , ab):
    value = tag.find("gml:identifier", ab)
    for char in tag.findall(".//aixm:AirspaceTimeSlice", ab):
        for per in char.findall(".//aixm:type",ab):
            for ir in char.findall(".//aixm:name",ab):
                for epa in char.findall(".//aixm:designator", ab):
                    for op in char.findall(".//aixm:theAirspace[@xlink:href]", ab):
                        data['uuid'].append(value.text)
                        data['id'].append(char.attrib['{http://www.opengis.net/gml/3.2}id'])
                        data['type'].append(per.text)
                        data['name'].append(ir.text)
                        data['designator'].append(epa.text)
                        #data['contributorAirspace'].append(op.attrib['{http://www.w3.org/1999/xlink}href'])
                        
                       
df = pd.DataFrame(data)

注意expressions char.attrib ['{http://www.opengis.net/gml/3.2} id'] id'] and code> op.attrib [ '{http://www.w3.org/1999/xlink} href'] 。他们使用合格的名称和检索attrbute值来解决attrbute。

此外,此示例也使用默认数据而不是两个列表,但这是品味问题。

Python elementtree requires to address attribute with the namespace by its qualified name (i.e. namespace + attribute name). When referring char.attrib or op.attrib a dictionary containing all element attributes with their values is retrieved. Here is an example of attribute value retrieval:

import xml.etree.ElementTree as ET
import pandas as pd
from collections import defaultdict

ab = {"adrmsg":"http://www.eurocontrol.int/cfmu/b2b/ADRMessage",
    "gml":"http://www.opengis.net/gml/3.2",
    "adrext":"http://www.aixm.aero/schema/5.1.1/extensions/EUR/ADR",
    "aixm": "http://www.aixm.aero/schema/5.1.1",
    "xlink":"http://www.w3.org/1999/xlink",
    "id":"http://www.opengis.net/gml/3.2",
    "href":"http://www.w3.org/1999/xlink"
}

# parse XML
root_node = ET.fromstring(xml)

# create dictionary to store parsed data
data = defaultdict(list)

for tag in root_node.findall(".//aixm:Airspace" , ab):
    value = tag.find("gml:identifier", ab)
    for char in tag.findall(".//aixm:AirspaceTimeSlice", ab):
        for per in char.findall(".//aixm:type",ab):
            for ir in char.findall(".//aixm:name",ab):
                for epa in char.findall(".//aixm:designator", ab):
                    for op in char.findall(".//aixm:theAirspace[@xlink:href]", ab):
                        data['uuid'].append(value.text)
                        data['id'].append(char.attrib['{http://www.opengis.net/gml/3.2}id'])
                        data['type'].append(per.text)
                        data['name'].append(ir.text)
                        data['designator'].append(epa.text)
                        #data['contributorAirspace'].append(op.attrib['{http://www.w3.org/1999/xlink}href'])
                        
                       
df = pd.DataFrame(data)

Note the expressions char.attrib['{http://www.opengis.net/gml/3.2}id'] and op.attrib['{http://www.w3.org/1999/xlink}href']. They address attrbutes using qualified names and retrieve attrbute values.

Also this example uses defaultdict instead of two lists, but that's matter of taste.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文