Python美丽的汤XML有条件查询(混合标签和属性)解析
我有两个版本的XML文件需要从中提取内容。两者都具有两种不同格式的相同信息(不仅是不同的标签,而是不同的结构):
- 第一个具有主动元素和非活动元素,在以下方面定义了:
< activeElementsUbsTance>>>/activeElementsubsStance ubsubstance>
and Code> and Code> and Code> and Code> and Code> and Code> and Code> and Code> and Code>< intacriveElementsUbstance>/intactiveElementsUbstance>
- 第二个在:
< element classcode =“ iai actcode =“ iai actcode”之间定义了有效的和不活动的元素。
和< element classCode =“ actim”></element classCode =“ actim”>
我如何解决这两种情况以提取不活动和活动元素?<<<<<<<<< /strong>(有几个示例,标签具有同义词,但是我没有看到任何实际结构与这种情况不同的地方),
我想出了以下代码(不是很干净)以从第二种情况下提取(提取提取元素及其代码):
activeElements = soup.findAll('Element', attrs={'classCode': 'ACTIM'})
for i in activeElements:
aiName = i.find('name')
aiCode = str(i.find('code'))
print(aiName.text)
print( re.findall(r'"(.*?)"', aiCode)[0] )
print('\nInactive Elements\n')
inactiveElements = soup.findAll('Element', attrs={'classCode': 'IACT'})
for i in inactiveElements:
aiName = i.find('name')
print(aiName.text)
aiCode = i.find('code')['code']
print(aiCode)
XML文件的示例如下:
第一种类型(使用格式&element classCode =“ iAiact”&gt;&lt;/element classCode =“ iaiact”&gt;
and code>和&lt; element classCode =“ actim”&gt;&lt;/element classCode =“ actim”&gt;
):
<?xml version="1.0" encoding="UTF-8"?>
<document>
<manufacturedProduct>
<element classCode="IACT">
<elementSubstance>
<code code="36SFW2JZ" codeSystem="33590coding"/>
<name>HYPROMELLOSE 2910 (15 MPA.S)</name>
</elementSubstance>
</element>
<element classCode="IACT">
<elementSubstance>
<code code="70097M6I" codeSystem="33590coding"/>
<name>MAGNESIUM STEARATE</name>
</elementSubstance>
</element>
<elementSubstance>
<code code="XHX3C3X6" codeSystem="33590coding"/>
<name>TRIACETIN</name>
</elementSubstance>
</element>
<element classCode="ACTIM">
<quantity>
<numerator unit="mg" value="250"/>
<denominator unit="1" value="1"/>
</quantity>
<elementSubstance>
<code code="JTE4MNN1" codeSystem="33590coding"/>
<name>AZITHROMYCIN MONOHYDRATE</name>
</elementSubstance>
</element>
</manufacturedProduct>
</document>
第二类(使用格式&lt; activeElementsUblementsUbstance&gt;&gt;&gt;&lt;/activeElementsUblementsUbstance usubstance&gt; 和
&lt; inativeElementsUbstance&gt;&lt;/inativeElementsUbstance&gt;
):
<?xml version="1.0" encoding="UTF-8"?>
<document>
<manufacturedProduct>
<activeelementSubstance>
<code code="VB0R961H" codeSystem="33590coding" codeSystemName="USDA" />
<name>Prednisone</name>
</activeelementSubstance>
</activeelement>
<inactiveelement>
<inactiveelementSubstance>
<code code="776XM704" codeSystem="33590coding" codeSystemName="USDA" />
<name>calcium stearate</name>
</inactiveelementSubstance>
</inactiveelement>
<inactiveelement>
<inactiveelementSubstance>
<name>corn starch</name>
</inactiveelementSubstance>
</inactiveelement>
</manufacturedProduct>
</document>
XML文件被深深嵌套(这部分是为什么我使用漂亮的汤),我尝试清洁和提取它们的相关部分。
I have two versions of XML files that I need to extract content from. Both have the same information in two different formats as follows (not just different tags, but different structure):
- The first one has active and inactive elements defined between:
<activeelementSubstance></activeelementSubstance>
and<inactiveelementSubstance></inactiveelementSubstance>
- The second has active and inactive elements defined between:
<element classCode="IACT"></element classCode="IACT">
and<element classCode="ACTIM"></element classCode="ACTIM">
How do I address both situations to extract Inactive and Active Elements ? (there are several examples where tags have synonyms, but I have not seen any where the actual structure is different as in this case)
I came up with the following code (not very clean) to extract from the second case (extracting the element and its code):
activeElements = soup.findAll('Element', attrs={'classCode': 'ACTIM'})
for i in activeElements:
aiName = i.find('name')
aiCode = str(i.find('code'))
print(aiName.text)
print( re.findall(r'"(.*?)"', aiCode)[0] )
print('\nInactive Elements\n')
inactiveElements = soup.findAll('Element', attrs={'classCode': 'IACT'})
for i in inactiveElements:
aiName = i.find('name')
print(aiName.text)
aiCode = i.find('code')['code']
print(aiCode)
Examples of XML files are as follows:
First type (with the format <element classCode="IACT"></element classCode="IACT">
and <element classCode="ACTIM"></element classCode="ACTIM">
):
<?xml version="1.0" encoding="UTF-8"?>
<document>
<manufacturedProduct>
<element classCode="IACT">
<elementSubstance>
<code code="36SFW2JZ" codeSystem="33590coding"/>
<name>HYPROMELLOSE 2910 (15 MPA.S)</name>
</elementSubstance>
</element>
<element classCode="IACT">
<elementSubstance>
<code code="70097M6I" codeSystem="33590coding"/>
<name>MAGNESIUM STEARATE</name>
</elementSubstance>
</element>
<elementSubstance>
<code code="XHX3C3X6" codeSystem="33590coding"/>
<name>TRIACETIN</name>
</elementSubstance>
</element>
<element classCode="ACTIM">
<quantity>
<numerator unit="mg" value="250"/>
<denominator unit="1" value="1"/>
</quantity>
<elementSubstance>
<code code="JTE4MNN1" codeSystem="33590coding"/>
<name>AZITHROMYCIN MONOHYDRATE</name>
</elementSubstance>
</element>
</manufacturedProduct>
</document>
Second type (with the format<activeelementSubstance></activeelementSubstance>
and <inactiveelementSubstance></inactiveelementSubstance>
):
<?xml version="1.0" encoding="UTF-8"?>
<document>
<manufacturedProduct>
<activeelementSubstance>
<code code="VB0R961H" codeSystem="33590coding" codeSystemName="USDA" />
<name>Prednisone</name>
</activeelementSubstance>
</activeelement>
<inactiveelement>
<inactiveelementSubstance>
<code code="776XM704" codeSystem="33590coding" codeSystemName="USDA" />
<name>calcium stearate</name>
</inactiveelementSubstance>
</inactiveelement>
<inactiveelement>
<inactiveelementSubstance>
<name>corn starch</name>
</inactiveelementSubstance>
</inactiveelement>
</manufacturedProduct>
</document>
The XML files are deeply nested (This is in part why I am using Beautiful soup), I tried cleaning and extracting the relevant portion of them.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可以使用
,
的CSS选择器。例如:打印:
或在活动和非活动中分开:
You can use CSS selectors with
,
. For example:Prints:
Or split between active and inactive:
您可以在CSS选择器列表中使用CSS或语法,而不是添加条件逻辑来确定在哪里分配提取的结果。以下可以使用重构来减少嵌套,例如将某些零件移动到自己的功能中,但它给出了一个起点。
You could use CSS OR syntax within a CSS selector list than add conditional logic to determine where to assign extracted results. The below could do with a refactor to reduce nesting e.g. move certain parts into their own functions, but it gives a starting point.