用python中的名称空间解析XML通过' ElementTree'
我有以下XML,我想使用Python的 ElementTree
:
<rdf:RDF xml:base="http://dbpedia.org/ontology/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:owl="http://www.w3.org/2002/07/owl#"
xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns="http://dbpedia.org/ontology/">
<owl:Class rdf:about="http://dbpedia.org/ontology/BasketballLeague">
<rdfs:label xml:lang="en">basketball league</rdfs:label>
<rdfs:comment xml:lang="en">
a group of sports teams that compete against each other
in Basketball
</rdfs:comment>
</owl:Class>
</rdf:RDF>
我想查找所有 owl:class
标签,然后提取所有 rdfs:rdfs:label的值
内部实例。我使用以下代码:
tree = ET.parse("filename")
root = tree.getroot()
root.findall('owl:Class')
由于名称空间,我会遇到以下错误。
SyntaxError: prefix 'owl' not found in prefix map
我尝试在,但我仍然无法正常工作,因为上述XML具有多个嵌套名称空间。
请让我知道如何更改代码以查找所有 OWL:class
标签。
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
您需要给出
.find()
,findall()
anditerfind()
方法一个显式命名空间词典:前缀是仅是 在
命名空间
您传递的参数中查找。这意味着您可以使用您喜欢的任何名称空间前缀; API将OWL分开:
部分,在namespaces
字典中查找相应的名称空间URL,然后更改搜索以查找XPath Expression{http:http:http:http:http:http:http:http:http:http:http: //www.w3.org/2002/07/owl} class
而不是。您当然也可以自己使用相同的语法:另请参阅 用名称空间 e节解析xml。
如果您可以切换到
lxml
library 情况更好;该库支持相同的ElementTree API,但在.nsmap
属性中为您收集名称空间,并且通常具有较高的命名空间支持。You need to give the
.find()
,findall()
anditerfind()
methods an explicit namespace dictionary:Prefixes are only looked up in the
namespaces
parameter you pass in. This means you can use any namespace prefix you like; the API splits off theowl:
part, looks up the corresponding namespace URL in thenamespaces
dictionary, then changes the search to look for the XPath expression{http://www.w3.org/2002/07/owl}Class
instead. You can use the same syntax yourself too of course:Also see the Parsing XML with Namespaces section of the ElementTree documentation.
If you can switch to the
lxml
library things are better; that library supports the same ElementTree API, but collects namespaces for you in.nsmap
attribute on elements and generally has superior namespaces support.以下是如何使用 lxml 执行此操作,而无需对名称空间进行硬编码或扫描文本(正如 Martijn Pieters 提到的那样):
更新:
5 年后我仍然遇到此问题的变体。正如我上面所示,lxml 有帮助,但并非在所有情况下都有帮助。在合并文档时,评论者对于这种技术可能有正确的观点,但我认为大多数人在简单地搜索文档时都遇到困难。
这是另一种情况以及我的处理方式:
没有前缀的 xmlns 意味着无前缀的标记获得此默认名称空间。这意味着当您搜索 Tag2 时,您需要包含命名空间才能找到它。但是,lxml 创建了一个以 None 作为键的 nsmap 条目,我找不到搜索它的方法。所以,我创建了一个像这样的新名称空间字典
Here's how to do this with lxml without having to hard-code the namespaces or scan the text for them (as Martijn Pieters mentions):
UPDATE:
5 years later I'm still running into variations of this issue. lxml helps as I showed above, but not in every case. The commenters may have a valid point regarding this technique when it comes merging documents, but I think most people are having difficulty simply searching documents.
Here's another case and how I handled it:
xmlns without a prefix means that unprefixed tags get this default namespace. This means when you search for Tag2, you need to include the namespace to find it. However, lxml creates an nsmap entry with None as the key, and I couldn't find a way to search for it. So, I created a new namespace dictionary like this
ElementTree标准库有用的答案
Note :这是对Python的
elementTree.iterParse
函数,仅解析名称空间开始事件( start-ns ):然后可以将字典作为参数传递给搜索函数:
Note: This is an answer useful for Python's ElementTree standard library without using hardcoded namespaces.
To extract namespace's prefixes and URI from XML data you can use
ElementTree.iterparse
function, parsing only namespace start events (start-ns):Then the dictionary can be passed as argument to the search functions:
要以其命名空间格式获取名称空间,例如
{myNamespace}
,您可以执行以下操作:这样,您可以在代码中以后在代码上使用它来查找节点,例如使用字符串插值(Python 3) )。
To get the namespace in its namespace format, e.g.
{myNameSpace}
, you can do the following:This way, you can use it later on in your code to find nodes, e.g using string interpolation (Python 3).
我一直在使用与此类似的代码,并发现它总是值得阅读文档......像往常一样!
findall() 只会查找当前标记的直接子元素的元素。所以,并不是全部。
尝试让代码与以下内容一起工作可能是值得的,特别是如果您正在处理大型且复杂的 xml 文件,以便子子元素(等)也包含在内。
如果您知道元素在 xml 中的位置,那么我想那就没问题了!只是觉得这值得记住。
参考: https://docs.python.org /3/library/xml.etree.elementtree.html#finding-interesting-elements
“Element.findall() 仅查找带有当前元素直接子元素的标记的元素。Element.find() 查找带有特定标记的第一个子元素,Element.text 访问元素的文本内容。Element.get()访问元素的属性:”
I've been using similar code to this and have found it's always worth reading the documentation... as usual!
findall() will only find elements which are direct children of the current tag. So, not really ALL.
It might be worth your while trying to get your code working with the following, especially if you're dealing with big and complex xml files so that that sub-sub-elements (etc.) are also included.
If you know yourself where elements are in your xml, then I suppose it'll be fine! Just thought this was worth remembering.
ref: https://docs.python.org/3/library/xml.etree.elementtree.html#finding-interesting-elements
"Element.findall() finds only elements with a tag which are direct children of the current element. Element.find() finds the first child with a particular tag, and Element.text accesses the element’s text content. Element.get() accesses the element’s attributes:"
这基本上是戴维德·布鲁纳托(Davide Brunato)的答案,但是我发现他的答案存在严重的问题,默认名称空间是空字符串,至少在我的python 3.6安装中。我从他的代码中提取的功能对我有用:
其中
ns0
只是空名称空间的占位符,您可以替换为任何随机字符串。如果我这样做:
它还使用默认名称空间为标签提供了正确的答案。
This is basically Davide Brunato's answer however I found out that his answer had serious problems the default namespace being the empty string, at least on my python 3.6 installation. The function I distilled from his code and that worked for me is the following:
where
ns0
is just a placeholder for the empty namespace and you can replace it by any random string you like.If I then do:
It also produces the correct answer for tags using the default namespace as well.
我的解决方案基于@Martijn Pieters 的评论:
所以这里的技巧是使用不同的字典进行序列化和搜索。
现在,注册所有命名空间以进行解析和写入:
为了搜索(
find()
、findall()
、iterfind()
),我们需要一个非-空前缀。向这些函数传递一个修改后的字典(这里我修改了原始字典,但这必须在命名空间注册后才能进行)。现在,
find()
系列中的函数可以与default
前缀一起使用:但
不对默认命名空间中的元素使用任何前缀。
My solution is based on @Martijn Pieters' comment:
So the trick here is to use different dictionaries for serialization and for searching.
Now, register all namespaces for parsing and writing:
For searching (
find()
,findall()
,iterfind()
) we need a non-empty prefix. Pass these functions a modified dictionary (here I modify the original dictionary, but this must be made only after the namespaces are registered).Now, the functions from the
find()
family can be used with thedefault
prefix:but
does not use any prefixes for elements in the default namespace.
一个稍长的替代方法是创建另一个类
ElementNS
,它继承ET.Element
并包含命名空间,然后为该类创建一个构造函数,并将其传递给解析器:然后 < code>findall 可以在不传递名称空间的情况下使用:
A slightly longer alternative is to create another class
ElementNS
which inheritsET.Element
and includes the namespaces, then create a constructor for this class which is passed onto the parser:Then
findall
can be used without passing namespaces: