使用不同的模式分析多个(n)XML文件,然后返回多个(n)dataframes

发布于 2025-02-06 08:58:07 字数 3176 浏览 0 评论 0原文

我有三个具有特定标签和不同架构的XML文件,如下所示。

file:1

<LETADA-LOOK_TYP>
 <COLUMNS>  Long    Short   Value   </COLUMNS>
 <DATA> XC  XC  10670003039 </DATA>
 <DATA> GH  GH  10450003040 </DATA>
 <DATA> HJ  HJ  10220002989 </DATA>
 <DATA> FF  FF  10990002988 </DATA>
 <DATA> DD  DD  10660003041 </DATA>
 <DATA> FE  FE  10660002991 </DATA>
 <DATA> SS  SS  10090003042 </DATA>
 <DATA> LL  LL  10100002990 </DATA>
</LETADA-LOOK_TYP>

file:2
    <LETADA-LOOK_TYP>
     <COLUMNS>  Long    Name    Value   </COLUMNS>
     <DATA> LD  ER  10670045039 </DATA>
     <DATA> FR  RT  10450065040 </DATA>
     <DATA> YT  VG  10220090989 </DATA>
     <DATA> QW  TY  10990023988 </DATA>
     <DATA> WE  ER  10660034041 </DATA>
     <DATA> ER  FG  10660045991 </DATA>
     <DATA> ER  ER  10090067042 </DATA>
     <DATA> PO  PO  10100044990 </DATA>
    </LETADA-LOOK_TYP>


file:3
    <LETADA-LOOK_TYP>
     <COLUMNS> Punt GrubName    Value   </COLUMNS>
     <DATA> GF  ER  10689045039 </DATA>
     <DATA> TY  RT  10434065040 </DATA>
     <DATA> JJ  VG  10212090989 </DATA>
     <DATA> QW  TY  10989023988 </DATA>
     <DATA> TY  ER  10676034041 </DATA>
     <DATA> II  FG  10609045991 </DATA>
     <DATA> OI  ER  10023067042 </DATA>
     <DATA> OW  PO  10145044990 </DATA>
    </LETADA-LOOK_TYP>

因此,我已经编写了一个Python脚本来解析这些文件,但是由于这些文件具有不同的模式,因此我必须手动为每个文件编写一个脚本,并从中获取数据框架,然后将所有所达到的数据范围置为以获取所需的输出。

parsing files:
import pandas as pd
import numpy as np
 import xml.etree.ElementTree as et

data1 =[]
tree = et.parse(file1)
root = tree.getroot()
for mt in root.findall("LETADA-LOOK_TYP"):
  column = mt.find('COLUMNS').text
  column1 = column.split('\t')
for ms in root.findall("LETADA-LOOK_TYP"):
  datatab = ms.findall("DATA")
  for dat in datatab: 
    data = dat.text.split('\t')
    data1.append(data)
dataframe1 = pd.DataFrame(data1, columns = column1)

Parsing file 2 in a different cell:

data2 =[]
tree = et.parse(file2)
root = tree.getroot()
for mt in root.findall("LETADA-LOOK_TYP"):
  column = mt.find('COLUMNS').text
  column2 = column.split('\t')
for ms in root.findall("LETADA-LOOK_TYP"):
  datatab = ms.findall("DATA")
  for dat in datatab: 
    data = dat.text.split('\t')
    data2.append(data)
dataframe2 = pd.DataFrame(data2, columns = column2)


Parsing file 3 in a different cell:

data3 =[]
tree = et.parse(file3)
root = tree.getroot()
for mt in root.findall("LETADA-LOOK_TYP"):
  column = mt.find('COLUMNS').text
  column3 = column.split('\t')
for ms in root.findall("LETADA-LOOK_TYP"):
  datatab = ms.findall("DATA")
  for dat in datatab: 
    data = dat.text.split('\t')
    data3.append(data)
dataframe3 = pd.DataFrame(data3, columns = column3)

list_df =[dataframe1,dataframe2,dataframe3]
final_df = pd.concat(list_df).reset_index(drop = True)

使用上述多行代码,我可以获得所需的输出

I have three xml files with a certain tag and different schema as shown below.

file:1

<LETADA-LOOK_TYP>
 <COLUMNS>  Long    Short   Value   </COLUMNS>
 <DATA> XC  XC  10670003039 </DATA>
 <DATA> GH  GH  10450003040 </DATA>
 <DATA> HJ  HJ  10220002989 </DATA>
 <DATA> FF  FF  10990002988 </DATA>
 <DATA> DD  DD  10660003041 </DATA>
 <DATA> FE  FE  10660002991 </DATA>
 <DATA> SS  SS  10090003042 </DATA>
 <DATA> LL  LL  10100002990 </DATA>
</LETADA-LOOK_TYP>

file:2
    <LETADA-LOOK_TYP>
     <COLUMNS>  Long    Name    Value   </COLUMNS>
     <DATA> LD  ER  10670045039 </DATA>
     <DATA> FR  RT  10450065040 </DATA>
     <DATA> YT  VG  10220090989 </DATA>
     <DATA> QW  TY  10990023988 </DATA>
     <DATA> WE  ER  10660034041 </DATA>
     <DATA> ER  FG  10660045991 </DATA>
     <DATA> ER  ER  10090067042 </DATA>
     <DATA> PO  PO  10100044990 </DATA>
    </LETADA-LOOK_TYP>


file:3
    <LETADA-LOOK_TYP>
     <COLUMNS> Punt GrubName    Value   </COLUMNS>
     <DATA> GF  ER  10689045039 </DATA>
     <DATA> TY  RT  10434065040 </DATA>
     <DATA> JJ  VG  10212090989 </DATA>
     <DATA> QW  TY  10989023988 </DATA>
     <DATA> TY  ER  10676034041 </DATA>
     <DATA> II  FG  10609045991 </DATA>
     <DATA> OI  ER  10023067042 </DATA>
     <DATA> OW  PO  10145044990 </DATA>
    </LETADA-LOOK_TYP>

so I have written a python script to parse these files but because these files have different schema I had to manual write a script for each file and get dataframe out of it and then concat all the dataframes achieved to get the desired output.

parsing files:
import pandas as pd
import numpy as np
 import xml.etree.ElementTree as et

data1 =[]
tree = et.parse(file1)
root = tree.getroot()
for mt in root.findall("LETADA-LOOK_TYP"):
  column = mt.find('COLUMNS').text
  column1 = column.split('\t')
for ms in root.findall("LETADA-LOOK_TYP"):
  datatab = ms.findall("DATA")
  for dat in datatab: 
    data = dat.text.split('\t')
    data1.append(data)
dataframe1 = pd.DataFrame(data1, columns = column1)

Parsing file 2 in a different cell:

data2 =[]
tree = et.parse(file2)
root = tree.getroot()
for mt in root.findall("LETADA-LOOK_TYP"):
  column = mt.find('COLUMNS').text
  column2 = column.split('\t')
for ms in root.findall("LETADA-LOOK_TYP"):
  datatab = ms.findall("DATA")
  for dat in datatab: 
    data = dat.text.split('\t')
    data2.append(data)
dataframe2 = pd.DataFrame(data2, columns = column2)


Parsing file 3 in a different cell:

data3 =[]
tree = et.parse(file3)
root = tree.getroot()
for mt in root.findall("LETADA-LOOK_TYP"):
  column = mt.find('COLUMNS').text
  column3 = column.split('\t')
for ms in root.findall("LETADA-LOOK_TYP"):
  datatab = ms.findall("DATA")
  for dat in datatab: 
    data = dat.text.split('\t')
    data3.append(data)
dataframe3 = pd.DataFrame(data3, columns = column3)

list_df =[dataframe1,dataframe2,dataframe3]
final_df = pd.concat(list_df).reset_index(drop = True)

using the above multiple lines of code I can get the desired output but is there a way to parse multiple files with different schema and return multiple dataframes and then concat them to get a final output

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文