在Python中解析大XML

发布于 2025-02-06 21:39:25 字数 4785 浏览 2 评论 0 原文

我有一个非常大的XML文件（约100MB），其多个元素与本示例类似的元素相似，

<adrmsg:hasMember>
    <aixm:DesignatedPoint gml:id="ID_197095_1650420151927_74256">
        <gml:identifier codeSpace="urn:uuid:">084e1bb6-94f7-450f-a88e-44eb465cd5a6</gml:identifier>
        <aixm:timeSlice>
            <aixm:DesignatedPointTimeSlice gml:id="ID_197095_1650420151927_74257">
                <gml:validTime>
                    <gml:TimePeriod gml:id="ID_197095_1650420151927_74258">
                        <gml:beginPosition>2020-12-31T00:00:00</gml:beginPosition>
                        <gml:endPosition indeterminatePosition="unknown"/>
                    </gml:TimePeriod>
                </gml:validTime>
                <aixm:interpretation>BASELINE</aixm:interpretation>
                <aixm:featureLifetime>
                    <gml:TimePeriod gml:id="ID_197095_1650420151927_74259">
                        <gml:beginPosition>2020-12-31T00:00:00</gml:beginPosition>
                        <gml:endPosition indeterminatePosition="unknown"/>
                    </gml:TimePeriod>
                </aixm:featureLifetime>
                <aixm:designator>BITLA</aixm:designator>
                <aixm:type>ICAO</aixm:type>
                <aixm:location>
                    <aixm:Point gml:id="ID_197095_1650420151927_74260">
                        <gml:pos srsName="urn:ogc:def:crs:EPSG::4326">40.87555555555556 21.358055555555556</gml:pos>
                    </aixm:Point>
                </aixm:location>
                <aixm:extension>
                    <adrext:DesignatedPointExtension gml:id="ID_197095_1650420151927_74261">
                        <adrext:pointUsage>
                            <adrext:PointUsage gml:id="ID_197095_1650420151927_74262">
                                <adrext:role>FRA_ENTRY</adrext:role>
                                <adrext:reference_border>
                                    <adrext:AirspaceBorderCrossingObject gml:id="ID_197095_1650420151927_74263">
                                        <adrext:exitedAirspace xlink:href="urn:uuid:78447f69-9671-41c5-a7b7-bdd82c60e978"/>
                                        <adrext:enteredAirspace xlink:href="urn:uuid:afb35b5b-6626-43ff-9d92-875bbd882c05"/>
                                    </adrext:AirspaceBorderCrossingObject>
                                </adrext:reference_border>
                            </adrext:PointUsage>
                        </adrext:pointUsage>
                        <adrext:pointUsage>
                            <adrext:PointUsage gml:id="ID_197095_1650420151927_74264">
                                <adrext:role>FRA_EXIT</adrext:role>
                                <adrext:reference_border>
                                    <adrext:AirspaceBorderCrossingObject gml:id="ID_197095_1650420151927_74265">
                                        <adrext:exitedAirspace xlink:href="urn:uuid:78447f69-9671-41c5-a7b7-bdd82c60e978"/>
                                        <adrext:enteredAirspace xlink:href="urn:uuid:afb35b5b-6626-43ff-9d92-875bbd882c05"/>
                                    </adrext:AirspaceBorderCrossingObject>
                                </adrext:reference_border>
                            </adrext:PointUsage>
                        </adrext:pointUsage>
                    </adrext:DesignatedPointExtension>
                </aixm:extension>
            </aixm:DesignatedPointTimeSlice>
        </aixm:timeSlice>
    </aixm:DesignatedPoint>
</adrmsg:hasMember>

最终的目标是在此非常大的XML文件中使用PANDAS DataFrame解析的数据。

到目前为止，我无法“捕获”我要寻找的数据。我只能从该大型XML文件中的最后一个元素中“捕获”最后一个数据。

import xml.etree.ElementTree as ET

tree = ET.parse('file.xml')
root = tree.getroot()

ab = {'aixm':'http://www.aixm.aero/schema/5.1.1', 'adrext':'http://www.aixm.aero/schema/5.1.1/extensions/EUR/ADR', 'gml':'http://www.opengis.net/gml/3.2'}
for point in root.findall('.//aixm:DesignatedPointTimeSlice', ab):
    designator = point.find('.//aixm:designator', ab)
    d = point.find('.//{http://www.aixm.aero/schema/5.1.1}type', ab)
for pos in point.findall('.//gml:pos', ab):
    print(designator.text, pos.text, d.text)

打印语句返回我想要拥有的数据，但如上所述，仅是针对文件的最后一个元素，而我想返回结果的所有数据，

ZIFSA 54.02111111111111 27.823888888888888 ICAO

我可以在我应该遵循的路径上做建议吗？我需要一些帮助非常感谢

原文

I have a very large xml file (about 100mb) with multiple elements similar to the one in this example

<adrmsg:hasMember>
    <aixm:DesignatedPoint gml:id="ID_197095_1650420151927_74256">
        <gml:identifier codeSpace="urn:uuid:">084e1bb6-94f7-450f-a88e-44eb465cd5a6</gml:identifier>
        <aixm:timeSlice>
            <aixm:DesignatedPointTimeSlice gml:id="ID_197095_1650420151927_74257">
                <gml:validTime>
                    <gml:TimePeriod gml:id="ID_197095_1650420151927_74258">
                        <gml:beginPosition>2020-12-31T00:00:00</gml:beginPosition>
                        <gml:endPosition indeterminatePosition="unknown"/>
                    </gml:TimePeriod>
                </gml:validTime>
                <aixm:interpretation>BASELINE</aixm:interpretation>
                <aixm:featureLifetime>
                    <gml:TimePeriod gml:id="ID_197095_1650420151927_74259">
                        <gml:beginPosition>2020-12-31T00:00:00</gml:beginPosition>
                        <gml:endPosition indeterminatePosition="unknown"/>
                    </gml:TimePeriod>
                </aixm:featureLifetime>
                <aixm:designator>BITLA</aixm:designator>
                <aixm:type>ICAO</aixm:type>
                <aixm:location>
                    <aixm:Point gml:id="ID_197095_1650420151927_74260">
                        <gml:pos srsName="urn:ogc:def:crs:EPSG::4326">40.87555555555556 21.358055555555556</gml:pos>
                    </aixm:Point>
                </aixm:location>
                <aixm:extension>
                    <adrext:DesignatedPointExtension gml:id="ID_197095_1650420151927_74261">
                        <adrext:pointUsage>
                            <adrext:PointUsage gml:id="ID_197095_1650420151927_74262">
                                <adrext:role>FRA_ENTRY</adrext:role>
                                <adrext:reference_border>
                                    <adrext:AirspaceBorderCrossingObject gml:id="ID_197095_1650420151927_74263">
                                        <adrext:exitedAirspace xlink:href="urn:uuid:78447f69-9671-41c5-a7b7-bdd82c60e978"/>
                                        <adrext:enteredAirspace xlink:href="urn:uuid:afb35b5b-6626-43ff-9d92-875bbd882c05"/>
                                    </adrext:AirspaceBorderCrossingObject>
                                </adrext:reference_border>
                            </adrext:PointUsage>
                        </adrext:pointUsage>
                        <adrext:pointUsage>
                            <adrext:PointUsage gml:id="ID_197095_1650420151927_74264">
                                <adrext:role>FRA_EXIT</adrext:role>
                                <adrext:reference_border>
                                    <adrext:AirspaceBorderCrossingObject gml:id="ID_197095_1650420151927_74265">
                                        <adrext:exitedAirspace xlink:href="urn:uuid:78447f69-9671-41c5-a7b7-bdd82c60e978"/>
                                        <adrext:enteredAirspace xlink:href="urn:uuid:afb35b5b-6626-43ff-9d92-875bbd882c05"/>
                                    </adrext:AirspaceBorderCrossingObject>
                                </adrext:reference_border>
                            </adrext:PointUsage>
                        </adrext:pointUsage>
                    </adrext:DesignatedPointExtension>
                </aixm:extension>
            </aixm:DesignatedPointTimeSlice>
        </aixm:timeSlice>
    </aixm:DesignatedPoint>
</adrmsg:hasMember>

The ultimate goal is to have in a pandas DataFrame parsed data from this very big xml file.

So far I cannot 'capture' the data that I am looking for.
I manage only to 'capture' the last data from the very last element in that large xml file.

import xml.etree.ElementTree as ET

tree = ET.parse('file.xml')
root = tree.getroot()

ab = {'aixm':'http://www.aixm.aero/schema/5.1.1', 'adrext':'http://www.aixm.aero/schema/5.1.1/extensions/EUR/ADR', 'gml':'http://www.opengis.net/gml/3.2'}
for point in root.findall('.//aixm:DesignatedPointTimeSlice', ab):
    designator = point.find('.//aixm:designator', ab)
    d = point.find('.//{http://www.aixm.aero/schema/5.1.1}type', ab)
for pos in point.findall('.//gml:pos', ab):
    print(designator.text, pos.text, d.text)

the print statement returns the data that I would like to have but as mentioned, only for the very last element of the file whereas I would like to have the result returned for all of them

ZIFSA 54.02111111111111 27.823888888888888 ICAO

Could I be pls advice on the path I should follow? I need some help pls
Thank you very much

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

我还不会笑 2025-02-13 21:39:25

假设所有三个所需的节点（ aixm：指定器，aixm：type，和 gml：pos ）始终存在，请考虑解析父节点， aixm：nevelpointpointtimeslice < /em>和 axim：point ，然后 加入 他们。最后，选择所需的三个最终列。

import pandas as pd

ab = {
    'aixm':'http://www.aixm.aero/schema/5.1.1', 
    'adrext':'http://www.aixm.aero/schema/5.1.1/extensions/EUR/ADR',
    'gml':'http://www.opengis.net/gml/3.2'
}

time_slice_df = pd.read_xml(
    'file.xml', xpath=".//aixm:DesignatedPointTimeSlice", namespaces=ab
).add_prefix("time_slice_")

point_df  = pd.read_xml(
    'file.xml', xpath=".//aixm:Point", namespaces=ab
).add_prefix("point_")

time_slice_df = (
    time_slice_df.join(point_df)
    .reindex(
        ["time_slice_designator", "time_slice_type", "point_pos"], 
        axis="columns"
    )
)

在即将出版的Pandas 1.5中，将支持 iterparse 允许检索不限于xpath表达式的后代节点：

time_slice_df = pd.read_xml(
    'file.xml', 
    namespaces = ab, 
    iterparse = {"aixm:DesignatedPointTimeSlice": 
        ["aixm:designator", "axim:type", "aixm:Point"]
    }
)

Assuming all three needed nodes (aixm:designator, aixm:type, and gml:pos) are always present, consider parsing the parent nodes, aixm:DesignatedPointTimeSlice and axim:Point and then join them. Finally, select the three final columns needed.

import pandas as pd

ab = {
    'aixm':'http://www.aixm.aero/schema/5.1.1', 
    'adrext':'http://www.aixm.aero/schema/5.1.1/extensions/EUR/ADR',
    'gml':'http://www.opengis.net/gml/3.2'
}

time_slice_df = pd.read_xml(
    'file.xml', xpath=".//aixm:DesignatedPointTimeSlice", namespaces=ab
).add_prefix("time_slice_")

point_df  = pd.read_xml(
    'file.xml', xpath=".//aixm:Point", namespaces=ab
).add_prefix("point_")

time_slice_df = (
    time_slice_df.join(point_df)
    .reindex(
        ["time_slice_designator", "time_slice_type", "point_pos"], 
        axis="columns"
    )
)

And in forthcoming pandas 1.5, read_xml will support iterparse allowing retrieval of descendant nodes not limited to XPath expressions:

time_slice_df = pd.read_xml(
    'file.xml', 
    namespaces = ab, 
    iterparse = {"aixm:DesignatedPointTimeSlice": 
        ["aixm:designator", "axim:type", "aixm:Point"]
    }
)

回复收藏 0 原文

~没有更多了~