如何在 XML 文件中查找特定标签,然后使用 Python 和 minidom 访问其父标签
我正在尝试编写一些代码,用于在文章的 XML 文件中搜索标签中包含的特定 DOI。当它找到正确的 DOI 后,我希望它能够访问与该 DOI 关联的文章的
我的 XML 文件采用以下格式:
<root>
<article>
<number>
0
</number>
<DOI>
10.1016/B978-0-12-381015-1.00004-6
</DOI>
<title>
The patagonian toothfish biology, ecology and fishery.
</title>
<abstract>
lots of abstract text
</abstract>
</article>
<article>
...All the article tags as shown above...
</article>
</root>
我希望脚本能够找到带有 DOI 10.1016/B978-0-12-381015-1.00004-6 的文章(例如),然后让我能够访问
到目前为止,我已经尝试改编 中的代码这个问题:
from xml.dom import minidom
datasource = open('/Users/philgw/Dropbox/PW-Honours-Project/Code/processed.xml')
xmldoc = minidom.parse(datasource)
#looking for: 10.1016/B978-0-12-381015-1.00004-6
matchingNodes = [node for node in xmldoc.getElementsByTagName("DOI") if node.firstChild.nodeValue == '10.1016/B978-0-12-381015-1.00004-6']
for i in range(len(matchingNodes)):
DOI = str(matchingNodes[i])
print DOI
但我不完全确定我在做什么!
感谢您的任何帮助。
I'm trying to write some code that will search through an XML file of articles for a particular DOI contained within a tag. When it has found the correct DOI I'd like it to then access the <title>
and <abstract>
text for the article associated with that DOI.
My XML file is in this format:
<root>
<article>
<number>
0
</number>
<DOI>
10.1016/B978-0-12-381015-1.00004-6
</DOI>
<title>
The patagonian toothfish biology, ecology and fishery.
</title>
<abstract>
lots of abstract text
</abstract>
</article>
<article>
...All the article tags as shown above...
</article>
</root>
I'd like the script to find the article with the DOI 10.1016/B978-0-12-381015-1.00004-6 (for example) and then for me to be able to access the <title>
and <abstract>
tags within the corresponding <article>
tag.
So far I've tried to adapt code from this question:
from xml.dom import minidom
datasource = open('/Users/philgw/Dropbox/PW-Honours-Project/Code/processed.xml')
xmldoc = minidom.parse(datasource)
#looking for: 10.1016/B978-0-12-381015-1.00004-6
matchingNodes = [node for node in xmldoc.getElementsByTagName("DOI") if node.firstChild.nodeValue == '10.1016/B978-0-12-381015-1.00004-6']
for i in range(len(matchingNodes)):
DOI = str(matchingNodes[i])
print DOI
But I'm not entirely sure what I'm doing!
Thanks for any help.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
迷你统治是必要条件吗?使用 lxml 和 XPath 解析它会很容易。
这将为您提供指定了 DOI 的文章。
另外,标签之间似乎有空格。我不知道这是否是由于 Stackoverflow 格式造成的。这可能就是为什么你不能将它与 minidom 匹配的原因。
Is minidom a requirement? It would be quite easy to parse it with lxml and XPath.
This will get you the article with the DOI specified.
Also, it seems that there is whitespace between the tags. I dunno if this because of the Stackoverflow formatting or not. This is probably why you cannot match it with minidom.
恕我直言 - 只需在 python 文档中查找即可!
试试这个(未测试):
imho - just look it up in the python docs!
try this (not tested):