解析大型 XML 并使用 XML 中的数据生成数据帧(使用 python 或其他)的最佳方法是什么?

发布于 2025-01-14 20:38:12 字数 5340 浏览 4 评论 0原文

我尝试根据 XML 文件的信息制作一个表格(或 csv,我使用 pandas 数据框)。

文件位于此处(.zip 为 14 MB,XML 约为 370MB),https://nvd.nist.gov/feeds/xml/cpe/dictionary/official-cpe-dictionary_v2.3.xml.zip 。它具有不同语言的包信息 - node.js、python、java 等。又名美国政府组织 NVD 的 CPE 2.3 列表。

前 30 行是这样的:

<cpe-list xmlns:config="http://scap.nist.gov/schema/configuration/0.1" xmlns="http://cpe.mitre.org/dictionary/2.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:scap-core="http://scap.nist.gov/schema/scap-core/0.3" xmlns:cpe-23="http://scap.nist.gov/schema/cpe-extension/2.3" xmlns:ns6="http://scap.nist.gov/schema/scap-core/0.1" xmlns:meta="http://scap.nist.gov/schema/cpe-dictionary-metadata/0.2" xsi:schemaLocation="http://scap.nist.gov/schema/cpe-extension/2.3 https://scap.nist.gov/schema/cpe/2.3/cpe-dictionary-extension_2.3.xsd http://cpe.mitre.org/dictionary/2.0 https://scap.nist.gov/schema/cpe/2.3/cpe-dictionary_2.3.xsd http://scap.nist.gov/schema/cpe-dictionary-metadata/0.2 https://scap.nist.gov/schema/cpe/2.1/cpe-dictionary-metadata_0.2.xsd http://scap.nist.gov/schema/scap-core/0.3 https://scap.nist.gov/schema/nvd/scap-core_0.3.xsd http://scap.nist.gov/schema/configuration/0.1 https://scap.nist.gov/schema/nvd/configuration_0.1.xsd http://scap.nist.gov/schema/scap-core/0.1 https://scap.nist.gov/schema/nvd/scap-core_0.1.xsd">
  <generator>
    <product_name>National Vulnerability Database (NVD)</product_name>
    <product_version>4.9</product_version>
    <schema_version>2.3</schema_version>
    <timestamp>2022-03-17T03:51:01.909Z</timestamp>
  </generator>
  <cpe-item name="cpe:/a:%240.99_kindle_books_project:%240.99_kindle_books:6::~~~android~~">
    <title xml:lang="en-US">$0.99 Kindle Books project $0.99 Kindle Books (aka com.kindle.books.for99) for android 6.0</title>
    <references>
      <reference href="https://play.google.com/store/apps/details?id=com.kindle.books.for99">Product information</reference>
      <reference href="https://docs.google.com/spreadsheets/d/1t5GXwjw82SyunALVJb2w0zi3FoLRIkfGPc7AMjRF0r4/edit?pli=1#gid=1053404143">Government Advisory</reference>
    </references>
    <cpe-23:cpe23-item name="cpe:2.3:a:\$0.99_kindle_books_project:\$0.99_kindle_books:6:*:*:*:*:android:*:*"/>
  </cpe-item>

XML 文件的树结构非常简单,根是“cpe-list”,子元素是“cpe-item”,孙元素是“title” 、“参考文献”和“cpe23-item”。

从“标题”,我想要元素中的文本; 从“cpe23-item”中,我想要属性“name”; 从“引用”中,我想要来自其曾孙“引用”的属性“href”。

数据框应该如下所示:

  | cpe23_name   | title_text            | ref1  | ref2  | ref3  | ref_other
0 | 'cpe23name 1'| 'this is a python pkg'| 'url1'| 'url2'| NaN   | NaN
1 | 'cpe23name 2'| 'this is a java pkg'  | 'url1'| 'url2'| NaN   | NaN
... 

我的代码在这里,在〜100秒内完成:

import xml.etree.ElementTree as et

xtree = et.parse("official-cpe-dictionary_v2.3.xml")
xroot = xtree.getroot()

import time
start_time = time.time()

df_cols = ["cpe", "text", "vendor", "product", "version", "changelog", "advisory", 'others']

title      = '{http://cpe.mitre.org/dictionary/2.0}title'
ref        = '{http://cpe.mitre.org/dictionary/2.0}references'
cpe_item   = '{http://scap.nist.gov/schema/cpe-extension/2.3}cpe23-item'

p_cpe = None
p_text = None
p_vend = None
p_prod = None
p_vers = None
p_chan = None
p_advi = None
p_othe = None
rows = []

i = 0

while i < len(xroot):
    for elm in xroot[i]:
        if elm.tag == title:
                p_text = elm.text
                #assign p_text
       
        elif elm.tag == ref:
            for nn in elm:
                s = nn.text.lower()
                #check the lower text in refs
                
                if 'version' in s:
                    p_vers = nn.attrib.get('href')
                    #assign p_vers
                elif 'advisor' in s:
                    p_advi = nn.attrib.get('href')
                    #assign p_advi
                elif 'product' in s:
                    p_prod = nn.attrib.get('href')
                    #assign p_prod
                elif 'vendor' in s:
                    p_vend = nn.attrib.get('href')
                    #assign p_vend
                elif 'change' in s:
                    p_chan = nn.attrib.get('href')
                    #assign p_vend
                else:
                    p_othe = nn.attrib.get('href')
            
        elif elm.tag == cpe_item:
            p_cpe = elm.attrib.get("name")
            #assign p_cpe
           
        else:
            print(elm.tag)
    row = [p_cpe, p_text, p_vend, p_prod, p_vers, p_chan, p_advi, p_othe]
    rows.append(row)        
    
    p_cpe = None
    p_text = None
    p_vend = None
    p_prod = None
    p_vers = None
    p_chan = None
    p_advi = None
    p_othe = None
    print(len(rows)) #this shows how far I got during the running time
    i+=1
    
out_df1 = pd.DataFrame(rows, columns = df_cols)# move this part outside the loop by removing the indent

print("---853k rows take %s seconds ---" % (time.time() - start_time))
  • 更新:更快的方法是将倒数第二行移出循环。由于“行”已经在每个循环中获取信息,因此无需每次都创建新的数据帧。 现在的运行时间是 136.0491042137146 秒。耶!

I try to make a table (or csv, I'm using pandas dataframe) from the information of an XML file.

The file is here (.zip is 14 MB, XML is ~370MB), https://nvd.nist.gov/feeds/xml/cpe/dictionary/official-cpe-dictionary_v2.3.xml.zip . It has package information of different languages - node.js, python, java etc. aka, CPE 2.3 list by the US government org NVD.

this is how it looks like in the first 30 rows:

<cpe-list xmlns:config="http://scap.nist.gov/schema/configuration/0.1" xmlns="http://cpe.mitre.org/dictionary/2.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:scap-core="http://scap.nist.gov/schema/scap-core/0.3" xmlns:cpe-23="http://scap.nist.gov/schema/cpe-extension/2.3" xmlns:ns6="http://scap.nist.gov/schema/scap-core/0.1" xmlns:meta="http://scap.nist.gov/schema/cpe-dictionary-metadata/0.2" xsi:schemaLocation="http://scap.nist.gov/schema/cpe-extension/2.3 https://scap.nist.gov/schema/cpe/2.3/cpe-dictionary-extension_2.3.xsd http://cpe.mitre.org/dictionary/2.0 https://scap.nist.gov/schema/cpe/2.3/cpe-dictionary_2.3.xsd http://scap.nist.gov/schema/cpe-dictionary-metadata/0.2 https://scap.nist.gov/schema/cpe/2.1/cpe-dictionary-metadata_0.2.xsd http://scap.nist.gov/schema/scap-core/0.3 https://scap.nist.gov/schema/nvd/scap-core_0.3.xsd http://scap.nist.gov/schema/configuration/0.1 https://scap.nist.gov/schema/nvd/configuration_0.1.xsd http://scap.nist.gov/schema/scap-core/0.1 https://scap.nist.gov/schema/nvd/scap-core_0.1.xsd">
  <generator>
    <product_name>National Vulnerability Database (NVD)</product_name>
    <product_version>4.9</product_version>
    <schema_version>2.3</schema_version>
    <timestamp>2022-03-17T03:51:01.909Z</timestamp>
  </generator>
  <cpe-item name="cpe:/a:%240.99_kindle_books_project:%240.99_kindle_books:6::~~~android~~">
    <title xml:lang="en-US">$0.99 Kindle Books project $0.99 Kindle Books (aka com.kindle.books.for99) for android 6.0</title>
    <references>
      <reference href="https://play.google.com/store/apps/details?id=com.kindle.books.for99">Product information</reference>
      <reference href="https://docs.google.com/spreadsheets/d/1t5GXwjw82SyunALVJb2w0zi3FoLRIkfGPc7AMjRF0r4/edit?pli=1#gid=1053404143">Government Advisory</reference>
    </references>
    <cpe-23:cpe23-item name="cpe:2.3:a:\$0.99_kindle_books_project:\$0.99_kindle_books:6:*:*:*:*:android:*:*"/>
  </cpe-item>

The tree structure of the XML file is quite simple, the root is 'cpe-list', the child element is 'cpe-item', and the grandchild elements are 'title', 'references' and 'cpe23-item'.

From 'title', I want the text in the element;
From 'cpe23-item', I want the attribute 'name';
From 'references', I want the attributes 'href' from its great-grandchildren, 'reference'.

The dataframe should look like this:

  | cpe23_name   | title_text            | ref1  | ref2  | ref3  | ref_other
0 | 'cpe23name 1'| 'this is a python pkg'| 'url1'| 'url2'| NaN   | NaN
1 | 'cpe23name 2'| 'this is a java pkg'  | 'url1'| 'url2'| NaN   | NaN
... 

my code is here,finished in ~100sec:

import xml.etree.ElementTree as et

xtree = et.parse("official-cpe-dictionary_v2.3.xml")
xroot = xtree.getroot()

import time
start_time = time.time()

df_cols = ["cpe", "text", "vendor", "product", "version", "changelog", "advisory", 'others']

title      = '{http://cpe.mitre.org/dictionary/2.0}title'
ref        = '{http://cpe.mitre.org/dictionary/2.0}references'
cpe_item   = '{http://scap.nist.gov/schema/cpe-extension/2.3}cpe23-item'

p_cpe = None
p_text = None
p_vend = None
p_prod = None
p_vers = None
p_chan = None
p_advi = None
p_othe = None
rows = []

i = 0

while i < len(xroot):
    for elm in xroot[i]:
        if elm.tag == title:
                p_text = elm.text
                #assign p_text
       
        elif elm.tag == ref:
            for nn in elm:
                s = nn.text.lower()
                #check the lower text in refs
                
                if 'version' in s:
                    p_vers = nn.attrib.get('href')
                    #assign p_vers
                elif 'advisor' in s:
                    p_advi = nn.attrib.get('href')
                    #assign p_advi
                elif 'product' in s:
                    p_prod = nn.attrib.get('href')
                    #assign p_prod
                elif 'vendor' in s:
                    p_vend = nn.attrib.get('href')
                    #assign p_vend
                elif 'change' in s:
                    p_chan = nn.attrib.get('href')
                    #assign p_vend
                else:
                    p_othe = nn.attrib.get('href')
            
        elif elm.tag == cpe_item:
            p_cpe = elm.attrib.get("name")
            #assign p_cpe
           
        else:
            print(elm.tag)
    row = [p_cpe, p_text, p_vend, p_prod, p_vers, p_chan, p_advi, p_othe]
    rows.append(row)        
    
    p_cpe = None
    p_text = None
    p_vend = None
    p_prod = None
    p_vers = None
    p_chan = None
    p_advi = None
    p_othe = None
    print(len(rows)) #this shows how far I got during the running time
    i+=1
    
out_df1 = pd.DataFrame(rows, columns = df_cols)# move this part outside the loop by removing the indent

print("---853k rows take %s seconds ---" % (time.time() - start_time))
  • updated: the faster way is to move the 2nd last row out side the loop. Since 'rows' already get info in each loop, there is no need to make a new dataframe every time.
    the running time now is 136.0491042137146 seconds. yay!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

怪我鬧 2025-01-21 20:38:12

由于您的 XML 相当扁平,请考虑最近添加的 IO 模块, pandas.read_xml 在 v1.3 中引入。给定 XML 使用默认命名空间,要引用 xpath 中的元素,请使用 namespaces 参数:

url = "https://nvd.nist.gov/feeds/xml/cpe/dictionary/official-cpe-dictionary_v2.3.xml.zip"

df = pd.read_xml(
    url, xpath=".//doc:cpe-item", namespaces={'doc': 'http://cpe.mitre.org/dictionary/2.0'}
)

如果您没有安装默认解析器 lxml,使用etree解析器:

df = pd.read_xml(
    url, xpath=".//doc:cpe-item", namespaces={'doc': 'http://cpe.mitre.org/dictionary/2.0'}, parser="etree"
)

Since your XML is fairly flat, consider the recently added IO module, pandas.read_xml introduced in v1.3. Given XML uses a default namespace, to reference elements in xpath use namespaces argument:

url = "https://nvd.nist.gov/feeds/xml/cpe/dictionary/official-cpe-dictionary_v2.3.xml.zip"

df = pd.read_xml(
    url, xpath=".//doc:cpe-item", namespaces={'doc': 'http://cpe.mitre.org/dictionary/2.0'}
)

If you do not have the default parser, lxml, installed, use the etree parser:

df = pd.read_xml(
    url, xpath=".//doc:cpe-item", namespaces={'doc': 'http://cpe.mitre.org/dictionary/2.0'}, parser="etree"
)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文