如何在 XML 文件中查找特定标签，然后使用 Python 和 minidom 访问其父标签

发布于 2024-10-16 13:27:12 字数 1484 浏览 2 评论 0原文

我正在尝试编写一些代码，用于在文章的 XML 文件中搜索标签中包含的特定 DOI。当它找到正确的 DOI 后，我希望它能够访问与该 DOI 关联的文章的 </code> 和 <code><abstract></code> 文本。

我的 XML 文件采用以下格式：

<root>
 <article>
  <number>
   0 
  </number>
  <DOI>
   10.1016/B978-0-12-381015-1.00004-6 
  </DOI>
  <title>
   The patagonian toothfish biology, ecology and fishery. 
  </title>
  <abstract>
   lots of abstract text
  </abstract>
 </article>
 <article>
  ...All the article tags as shown above...
 </article>
</root>

我希望脚本能够找到带有 DOI 10.1016/B978-0-12-381015-1.00004-6 的文章（例如），然后让我能够访问 </code> 和 <code><abstract></code> 标签位于相应的 <code><article></code> 标签内。

到目前为止，我已经尝试改编中的代码这个问题：

from xml.dom import minidom

datasource = open('/Users/philgw/Dropbox/PW-Honours-Project/Code/processed.xml')
xmldoc = minidom.parse(datasource)   

#looking for: 10.1016/B978-0-12-381015-1.00004-6

matchingNodes = [node for node in xmldoc.getElementsByTagName("DOI") if node.firstChild.nodeValue == '10.1016/B978-0-12-381015-1.00004-6']

for i in range(len(matchingNodes)):
    DOI = str(matchingNodes[i])
    print DOI

但我不完全确定我在做什么！

感谢您的任何帮助。

原文

I'm trying to write some code that will search through an XML file of articles for a particular DOI contained within a tag. When it has found the correct DOI I'd like it to then access the <title> and <abstract> text for the article associated with that DOI.

My XML file is in this format:

<root>
 <article>
  <number>
   0 
  </number>
  <DOI>
   10.1016/B978-0-12-381015-1.00004-6 
  </DOI>
  <title>
   The patagonian toothfish biology, ecology and fishery. 
  </title>
  <abstract>
   lots of abstract text
  </abstract>
 </article>
 <article>
  ...All the article tags as shown above...
 </article>
</root>

I'd like the script to find the article with the DOI 10.1016/B978-0-12-381015-1.00004-6 (for example) and then for me to be able to access the <title> and <abstract> tags within the corresponding <article> tag.

So far I've tried to adapt code from this question:

from xml.dom import minidom

datasource = open('/Users/philgw/Dropbox/PW-Honours-Project/Code/processed.xml')
xmldoc = minidom.parse(datasource)   

#looking for: 10.1016/B978-0-12-381015-1.00004-6

matchingNodes = [node for node in xmldoc.getElementsByTagName("DOI") if node.firstChild.nodeValue == '10.1016/B978-0-12-381015-1.00004-6']

for i in range(len(matchingNodes)):
    DOI = str(matchingNodes[i])
    print DOI

But I'm not entirely sure what I'm doing!

Thanks for any help.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

萌无敌 2024-10-23 13:27:12

迷你统治是必要条件吗？使用 lxml 和 XPath 解析它会很容易。

from lxml import etree
datasource = open('/Users/philgw/Dropbox/PW-Honours-Project/Code/processed.xml').read()
tree = etree.fromstring(datasource)
path = tree.xpath("//article[DOI="10.1016/B978-0-12-381015-1.00004-6")

这将为您提供指定了 DOI 的文章。

另外，标签之间似乎有空格。我不知道这是否是由于 Stackoverflow 格式造成的。这可能就是为什么你不能将它与 minidom 匹配的原因。

Is minidom a requirement? It would be quite easy to parse it with lxml and XPath.

from lxml import etree
datasource = open('/Users/philgw/Dropbox/PW-Honours-Project/Code/processed.xml').read()
tree = etree.fromstring(datasource)
path = tree.xpath("//article[DOI="10.1016/B978-0-12-381015-1.00004-6")

This will get you the article with the DOI specified.

Also, it seems that there is whitespace between the tags. I dunno if this because of the Stackoverflow formatting or not. This is probably why you cannot match it with minidom.

回复收藏 0 原文

鹿! 2024-10-23 13:27:12

恕我直言 - 只需在 python 文档中查找即可！
试试这个（未测试）：

from xml.dom import minidom

xmldoc = minidom.parse(datasource)   

def get_xmltext(parent, subnode_name):
    node = parent.getElementsByTagName(subnode_name)[0]
    return "".join([ch.toxml() for ch in node.childNodes])

matchingNodes = [node for node in xmldoc.getElementsByTagName("article")
           if get_xmltext(node, "DOI") == '10.1016/B978-0-12-381015-1.00004-6']

for node in matchingNodes:
    print "title:", get_xmltext(node, "title")
    print "abstract:", get_xmltext(node, "abstract")

imho - just look it up in the python docs!
try this (not tested):

from xml.dom import minidom

xmldoc = minidom.parse(datasource)   

def get_xmltext(parent, subnode_name):
    node = parent.getElementsByTagName(subnode_name)[0]
    return "".join([ch.toxml() for ch in node.childNodes])

matchingNodes = [node for node in xmldoc.getElementsByTagName("article")
           if get_xmltext(node, "DOI") == '10.1016/B978-0-12-381015-1.00004-6']

for node in matchingNodes:
    print "title:", get_xmltext(node, "title")
    print "abstract:", get_xmltext(node, "abstract")

回复收藏 0 原文

~没有更多了~