解析大型 XML 并使用 XML 中的数据生成数据帧（使用 python 或其他）的最佳方法是什么？

发布于 2025-01-14 20:38:12 字数 5340 浏览 4 评论 0原文

我尝试根据 XML 文件的信息制作一个表格（或 csv，我使用 pandas 数据框）。

文件位于此处（.zip 为 14 MB，XML 约为 370MB），https://nvd.nist.gov/feeds/xml/cpe/dictionary/official-cpe-dictionary_v2.3.xml.zip 。它具有不同语言的包信息 - node.js、python、java 等。又名美国政府组织 NVD 的 CPE 2.3 列表。

前 30 行是这样的：

<cpe-list xmlns:config="http://scap.nist.gov/schema/configuration/0.1" xmlns="http://cpe.mitre.org/dictionary/2.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:scap-core="http://scap.nist.gov/schema/scap-core/0.3" xmlns:cpe-23="http://scap.nist.gov/schema/cpe-extension/2.3" xmlns:ns6="http://scap.nist.gov/schema/scap-core/0.1" xmlns:meta="http://scap.nist.gov/schema/cpe-dictionary-metadata/0.2" xsi:schemaLocation="http://scap.nist.gov/schema/cpe-extension/2.3 https://scap.nist.gov/schema/cpe/2.3/cpe-dictionary-extension_2.3.xsd http://cpe.mitre.org/dictionary/2.0 https://scap.nist.gov/schema/cpe/2.3/cpe-dictionary_2.3.xsd http://scap.nist.gov/schema/cpe-dictionary-metadata/0.2 https://scap.nist.gov/schema/cpe/2.1/cpe-dictionary-metadata_0.2.xsd http://scap.nist.gov/schema/scap-core/0.3 https://scap.nist.gov/schema/nvd/scap-core_0.3.xsd http://scap.nist.gov/schema/configuration/0.1 https://scap.nist.gov/schema/nvd/configuration_0.1.xsd http://scap.nist.gov/schema/scap-core/0.1 https://scap.nist.gov/schema/nvd/scap-core_0.1.xsd">
  <generator>
    <product_name>National Vulnerability Database (NVD)</product_name>
    <product_version>4.9</product_version>
    <schema_version>2.3</schema_version>
    <timestamp>2022-03-17T03:51:01.909Z</timestamp>
  </generator>
  <cpe-item name="cpe:/a:%240.99_kindle_books_project:%240.99_kindle_books:6::~~~android~~">
    <title xml:lang="en-US">$0.99 Kindle Books project $0.99 Kindle Books (aka com.kindle.books.for99) for android 6.0</title>
    <references>
      <reference href="https://play.google.com/store/apps/details?id=com.kindle.books.for99">Product information</reference>
      <reference href="https://docs.google.com/spreadsheets/d/1t5GXwjw82SyunALVJb2w0zi3FoLRIkfGPc7AMjRF0r4/edit?pli=1#gid=1053404143">Government Advisory</reference>
    </references>
    <cpe-23:cpe23-item name="cpe:2.3:a:\$0.99_kindle_books_project:\$0.99_kindle_books:6:*:*:*:*:android:*:*"/>
  </cpe-item>

XML 文件的树结构非常简单，根是“cpe-list”，子元素是“cpe-item”，孙元素是“title” 、“参考文献”和“cpe23-item”。

从“标题”，我想要元素中的文本；从“cpe23-item”中，我想要属性“name”；从“引用”中，我想要来自其曾孙“引用”的属性“href”。

数据框应该如下所示：

  | cpe23_name   | title_text            | ref1  | ref2  | ref3  | ref_other
0 | 'cpe23name 1'| 'this is a python pkg'| 'url1'| 'url2'| NaN   | NaN
1 | 'cpe23name 2'| 'this is a java pkg'  | 'url1'| 'url2'| NaN   | NaN
...

我的代码在这里，在〜100秒内完成：

import xml.etree.ElementTree as et

xtree = et.parse("official-cpe-dictionary_v2.3.xml")
xroot = xtree.getroot()

import time
start_time = time.time()

df_cols = ["cpe", "text", "vendor", "product", "version", "changelog", "advisory", 'others']

title      = '{http://cpe.mitre.org/dictionary/2.0}title'
ref        = '{http://cpe.mitre.org/dictionary/2.0}references'
cpe_item   = '{http://scap.nist.gov/schema/cpe-extension/2.3}cpe23-item'

p_cpe = None
p_text = None
p_vend = None
p_prod = None
p_vers = None
p_chan = None
p_advi = None
p_othe = None
rows = []

i = 0

while i < len(xroot):
    for elm in xroot[i]:
        if elm.tag == title:
                p_text = elm.text
                #assign p_text
       
        elif elm.tag == ref:
            for nn in elm:
                s = nn.text.lower()
                #check the lower text in refs
                
                if 'version' in s:
                    p_vers = nn.attrib.get('href')
                    #assign p_vers
                elif 'advisor' in s:
                    p_advi = nn.attrib.get('href')
                    #assign p_advi
                elif 'product' in s:
                    p_prod = nn.attrib.get('href')
                    #assign p_prod
                elif 'vendor' in s:
                    p_vend = nn.attrib.get('href')
                    #assign p_vend
                elif 'change' in s:
                    p_chan = nn.attrib.get('href')
                    #assign p_vend
                else:
                    p_othe = nn.attrib.get('href')
            
        elif elm.tag == cpe_item:
            p_cpe = elm.attrib.get("name")
            #assign p_cpe
           
        else:
            print(elm.tag)
    row = [p_cpe, p_text, p_vend, p_prod, p_vers, p_chan, p_advi, p_othe]
    rows.append(row)        
    
    p_cpe = None
    p_text = None
    p_vend = None
    p_prod = None
    p_vers = None
    p_chan = None
    p_advi = None
    p_othe = None
    print(len(rows)) #this shows how far I got during the running time
    i+=1
    
out_df1 = pd.DataFrame(rows, columns = df_cols)# move this part outside the loop by removing the indent

print("---853k rows take %s seconds ---" % (time.time() - start_time))

更新：更快的方法是将倒数第二行移出循环。由于“行”已经在每个循环中获取信息，因此无需每次都创建新的数据帧。现在的运行时间是 136.0491042137146 秒。耶！

原文

I try to make a table (or csv, I'm using pandas dataframe) from the information of an XML file.

The file is here (.zip is 14 MB, XML is ~370MB), https://nvd.nist.gov/feeds/xml/cpe/dictionary/official-cpe-dictionary_v2.3.xml.zip . It has package information of different languages - node.js, python, java etc. aka, CPE 2.3 list by the US government org NVD.

this is how it looks like in the first 30 rows:

<cpe-list xmlns:config="http://scap.nist.gov/schema/configuration/0.1" xmlns="http://cpe.mitre.org/dictionary/2.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:scap-core="http://scap.nist.gov/schema/scap-core/0.3" xmlns:cpe-23="http://scap.nist.gov/schema/cpe-extension/2.3" xmlns:ns6="http://scap.nist.gov/schema/scap-core/0.1" xmlns:meta="http://scap.nist.gov/schema/cpe-dictionary-metadata/0.2" xsi:schemaLocation="http://scap.nist.gov/schema/cpe-extension/2.3 https://scap.nist.gov/schema/cpe/2.3/cpe-dictionary-extension_2.3.xsd http://cpe.mitre.org/dictionary/2.0 https://scap.nist.gov/schema/cpe/2.3/cpe-dictionary_2.3.xsd http://scap.nist.gov/schema/cpe-dictionary-metadata/0.2 https://scap.nist.gov/schema/cpe/2.1/cpe-dictionary-metadata_0.2.xsd http://scap.nist.gov/schema/scap-core/0.3 https://scap.nist.gov/schema/nvd/scap-core_0.3.xsd http://scap.nist.gov/schema/configuration/0.1 https://scap.nist.gov/schema/nvd/configuration_0.1.xsd http://scap.nist.gov/schema/scap-core/0.1 https://scap.nist.gov/schema/nvd/scap-core_0.1.xsd">
  <generator>
    <product_name>National Vulnerability Database (NVD)</product_name>
    <product_version>4.9</product_version>
    <schema_version>2.3</schema_version>
    <timestamp>2022-03-17T03:51:01.909Z</timestamp>
  </generator>
  <cpe-item name="cpe:/a:%240.99_kindle_books_project:%240.99_kindle_books:6::~~~android~~">
    <title xml:lang="en-US">$0.99 Kindle Books project $0.99 Kindle Books (aka com.kindle.books.for99) for android 6.0</title>
    <references>
      <reference href="https://play.google.com/store/apps/details?id=com.kindle.books.for99">Product information</reference>
      <reference href="https://docs.google.com/spreadsheets/d/1t5GXwjw82SyunALVJb2w0zi3FoLRIkfGPc7AMjRF0r4/edit?pli=1#gid=1053404143">Government Advisory</reference>
    </references>
    <cpe-23:cpe23-item name="cpe:2.3:a:\$0.99_kindle_books_project:\$0.99_kindle_books:6:*:*:*:*:android:*:*"/>
  </cpe-item>

The tree structure of the XML file is quite simple, the root is 'cpe-list', the child element is 'cpe-item', and the grandchild elements are 'title', 'references' and 'cpe23-item'.

From 'title', I want the text in the element;
From 'cpe23-item', I want the attribute 'name';
From 'references', I want the attributes 'href' from its great-grandchildren, 'reference'.

The dataframe should look like this:

  | cpe23_name   | title_text            | ref1  | ref2  | ref3  | ref_other
0 | 'cpe23name 1'| 'this is a python pkg'| 'url1'| 'url2'| NaN   | NaN
1 | 'cpe23name 2'| 'this is a java pkg'  | 'url1'| 'url2'| NaN   | NaN
...

my code is here，finished in ~100sec：

import xml.etree.ElementTree as et

xtree = et.parse("official-cpe-dictionary_v2.3.xml")
xroot = xtree.getroot()

import time
start_time = time.time()

df_cols = ["cpe", "text", "vendor", "product", "version", "changelog", "advisory", 'others']

title      = '{http://cpe.mitre.org/dictionary/2.0}title'
ref        = '{http://cpe.mitre.org/dictionary/2.0}references'
cpe_item   = '{http://scap.nist.gov/schema/cpe-extension/2.3}cpe23-item'

p_cpe = None
p_text = None
p_vend = None
p_prod = None
p_vers = None
p_chan = None
p_advi = None
p_othe = None
rows = []

i = 0

while i < len(xroot):
    for elm in xroot[i]:
        if elm.tag == title:
                p_text = elm.text
                #assign p_text
       
        elif elm.tag == ref:
            for nn in elm:
                s = nn.text.lower()
                #check the lower text in refs
                
                if 'version' in s:
                    p_vers = nn.attrib.get('href')
                    #assign p_vers
                elif 'advisor' in s:
                    p_advi = nn.attrib.get('href')
                    #assign p_advi
                elif 'product' in s:
                    p_prod = nn.attrib.get('href')
                    #assign p_prod
                elif 'vendor' in s:
                    p_vend = nn.attrib.get('href')
                    #assign p_vend
                elif 'change' in s:
                    p_chan = nn.attrib.get('href')
                    #assign p_vend
                else:
                    p_othe = nn.attrib.get('href')
            
        elif elm.tag == cpe_item:
            p_cpe = elm.attrib.get("name")
            #assign p_cpe
           
        else:
            print(elm.tag)
    row = [p_cpe, p_text, p_vend, p_prod, p_vers, p_chan, p_advi, p_othe]
    rows.append(row)        
    
    p_cpe = None
    p_text = None
    p_vend = None
    p_prod = None
    p_vers = None
    p_chan = None
    p_advi = None
    p_othe = None
    print(len(rows)) #this shows how far I got during the running time
    i+=1
    
out_df1 = pd.DataFrame(rows, columns = df_cols)# move this part outside the loop by removing the indent

print("---853k rows take %s seconds ---" % (time.time() - start_time))

updated: the faster way is to move the 2nd last row out side the loop. Since 'rows' already get info in each loop, there is no need to make a new dataframe every time.
the running time now is 136.0491042137146 seconds. yay!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

怪我鬧 2025-01-21 20:38:12

由于您的 XML 相当扁平，请考虑最近添加的 IO 模块， pandas.read_xml 在 v1.3 中引入。给定 XML 使用默认命名空间，要引用 xpath 中的元素，请使用 namespaces 参数：

url = "https://nvd.nist.gov/feeds/xml/cpe/dictionary/official-cpe-dictionary_v2.3.xml.zip"

df = pd.read_xml(
    url, xpath=".//doc:cpe-item", namespaces={'doc': 'http://cpe.mitre.org/dictionary/2.0'}
)

如果您没有安装默认解析器 lxml，使用etree解析器：

df = pd.read_xml(
    url, xpath=".//doc:cpe-item", namespaces={'doc': 'http://cpe.mitre.org/dictionary/2.0'}, parser="etree"
)

Since your XML is fairly flat, consider the recently added IO module, pandas.read_xml introduced in v1.3. Given XML uses a default namespace, to reference elements in xpath use namespaces argument:

url = "https://nvd.nist.gov/feeds/xml/cpe/dictionary/official-cpe-dictionary_v2.3.xml.zip"

df = pd.read_xml(
    url, xpath=".//doc:cpe-item", namespaces={'doc': 'http://cpe.mitre.org/dictionary/2.0'}
)

If you do not have the default parser, lxml, installed, use the etree parser:

df = pd.read_xml(
    url, xpath=".//doc:cpe-item", namespaces={'doc': 'http://cpe.mitre.org/dictionary/2.0'}, parser="etree"
)

回复收藏 0 原文

~没有更多了~

关于作者

短暂陪伴

暂无简介

文章

25 人气

关注发私信

友情链接

文江博客

解析大型 XML 并使用 XML 中的数据生成数据帧（使用 python 或其他）的最佳方法是什么？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

櫻之舞

弥枳

m2429

寻找一个思念的角度

野却迷人

我怀念的。

友情链接

解析大型 XML 并使用 XML 中的数据生成数据帧（使用 python 或其他）的最佳方法是什么？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

櫻之舞

弥枳

m2429

寻找一个思念的角度

野却迷人

我怀念的。

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。