用python中的名称空间解析XML通过' ElementTree'

发布于 2025-01-20 02:17:09 字数 1418 浏览 0 评论 0 原文

我有以下XML,我想使用Python的 ElementTree

<rdf:RDF xml:base="http://dbpedia.org/ontology/"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:owl="http://www.w3.org/2002/07/owl#"
    xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
    xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
    xmlns="http://dbpedia.org/ontology/">

    <owl:Class rdf:about="http://dbpedia.org/ontology/BasketballLeague">
        <rdfs:label xml:lang="en">basketball league</rdfs:label>
        <rdfs:comment xml:lang="en">
          a group of sports teams that compete against each other
          in Basketball
        </rdfs:comment>
    </owl:Class>

</rdf:RDF>

我想查找所有 owl:class 标签,然后提取所有 rdfs:rdfs:label的值内部实例。我使用以下代码:

tree = ET.parse("filename")
root = tree.getroot()
root.findall('owl:Class')

由于名称空间,我会遇到以下错误。

SyntaxError: prefix 'owl' not found in prefix map

我尝试在,但我仍然无法正常工作,因为上述XML具有多个嵌套名称空间。

请让我知道如何更改代码以查找所有 OWL:class 标签。

I have the following XML which I want to parse using Python's ElementTree:

<rdf:RDF xml:base="http://dbpedia.org/ontology/"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:owl="http://www.w3.org/2002/07/owl#"
    xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
    xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
    xmlns="http://dbpedia.org/ontology/">

    <owl:Class rdf:about="http://dbpedia.org/ontology/BasketballLeague">
        <rdfs:label xml:lang="en">basketball league</rdfs:label>
        <rdfs:comment xml:lang="en">
          a group of sports teams that compete against each other
          in Basketball
        </rdfs:comment>
    </owl:Class>

</rdf:RDF>

I want to find all owl:Class tags and then extract the value of all rdfs:label instances inside them. I am using the following code:

tree = ET.parse("filename")
root = tree.getroot()
root.findall('owl:Class')

Because of the namespace, I am getting the following error.

SyntaxError: prefix 'owl' not found in prefix map

I tried reading the document at http://effbot.org/zone/element-namespaces.htm but I am still not able to get this working since the above XML has multiple nested namespaces.

Kindly let me know how to change the code to find all the owl:Class tags.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

地狱即天堂 2025-01-27 02:17:09

您需要给出 .find() findall() and iterfind()方法一个显式命名空间词典:

namespaces = {'owl': 'http://www.w3.org/2002/07/owl#'} # add more as needed

root.findall('owl:Class', namespaces)

前缀是仅是 命名空间您传递的参数中查找。这意味着您可以使用您喜欢的任何名称空间前缀; API将 OWL分开:部分,在 namespaces 字典中查找相应的名称空间URL,然后更改搜索以查找XPath Expression {http:http:http:http:http:http:http:http:http:http:http: //www.w3.org/2002/07/owl} class 而不是。您当然也可以自己使用相同的语法:

root.findall('{http://www.w3.org/2002/07/owl#}Class')

另请参阅 用名称空间 e节解析xml。

如果您可以切换到 lxml library 情况更好;该库支持相同的ElementTree API,但在 .nsmap 属性中为您收集名称空间,并且通常具有较高的命名空间支持。

You need to give the .find(), findall() and iterfind() methods an explicit namespace dictionary:

namespaces = {'owl': 'http://www.w3.org/2002/07/owl#'} # add more as needed

root.findall('owl:Class', namespaces)

Prefixes are only looked up in the namespaces parameter you pass in. This means you can use any namespace prefix you like; the API splits off the owl: part, looks up the corresponding namespace URL in the namespaces dictionary, then changes the search to look for the XPath expression {http://www.w3.org/2002/07/owl}Class instead. You can use the same syntax yourself too of course:

root.findall('{http://www.w3.org/2002/07/owl#}Class')

Also see the Parsing XML with Namespaces section of the ElementTree documentation.

If you can switch to the lxml library things are better; that library supports the same ElementTree API, but collects namespaces for you in .nsmap attribute on elements and generally has superior namespaces support.

蓝戈者 2025-01-27 02:17:09

以下是如何使用 lxml 执行此操作,而无需对名称空间进行硬编码或扫描文本(正如 Martijn Pieters 提到的那样):

from lxml import etree
tree = etree.parse("filename")
root = tree.getroot()
root.findall('owl:Class', root.nsmap)

更新

5 年后我仍然遇到此问题的变体。正如我上面所示,lxml 有帮助,但并非在所有情况下都有帮助。在合并文档时,评论者对于这种技术可能有正确的观点,但我认为大多数人在简单地搜索文档时都遇到困难。

这是另一种情况以及我的处理方式:

<?xml version="1.0" ?><Tag1 xmlns="http://www.mynamespace.com/prefix">
<Tag2>content</Tag2></Tag1>

没有前缀的 xmlns 意味着无前缀的标记获得此默认名称空间。这意味着当您搜索 Tag2 时,您需要包含命名空间才能找到它。但是,lxml 创建了一个以 None 作为键的 nsmap 条目,我找不到搜索它的方法。所以,我创建了一个像这样的新名称空间字典

namespaces = {}
# response uses a default namespace, and tags don't mention it
# create a new ns map using an identifier of our choice
for k,v in root.nsmap.iteritems():
    if not k:
        namespaces['myprefix'] = v
e = root.find('myprefix:Tag2', namespaces)

Here's how to do this with lxml without having to hard-code the namespaces or scan the text for them (as Martijn Pieters mentions):

from lxml import etree
tree = etree.parse("filename")
root = tree.getroot()
root.findall('owl:Class', root.nsmap)

UPDATE:

5 years later I'm still running into variations of this issue. lxml helps as I showed above, but not in every case. The commenters may have a valid point regarding this technique when it comes merging documents, but I think most people are having difficulty simply searching documents.

Here's another case and how I handled it:

<?xml version="1.0" ?><Tag1 xmlns="http://www.mynamespace.com/prefix">
<Tag2>content</Tag2></Tag1>

xmlns without a prefix means that unprefixed tags get this default namespace. This means when you search for Tag2, you need to include the namespace to find it. However, lxml creates an nsmap entry with None as the key, and I couldn't find a way to search for it. So, I created a new namespace dictionary like this

namespaces = {}
# response uses a default namespace, and tags don't mention it
# create a new ns map using an identifier of our choice
for k,v in root.nsmap.iteritems():
    if not k:
        namespaces['myprefix'] = v
e = root.find('myprefix:Tag2', namespaces)
擦肩而过的背影 2025-01-27 02:17:09

ElementTree标准库有用的答案

Note :这是对Python的 elementTree.iterParse 函数,仅解析名称空间开始事件( start-ns ):

>>> from io import StringIO
>>> from xml.etree import ElementTree
>>> my_schema = u'''<rdf:RDF xml:base="http://dbpedia.org/ontology/"
...     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
...     xmlns:owl="http://www.w3.org/2002/07/owl#"
...     xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
...     xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
...     xmlns="http://dbpedia.org/ontology/">
... 
...     <owl:Class rdf:about="http://dbpedia.org/ontology/BasketballLeague">
...         <rdfs:label xml:lang="en">basketball league</rdfs:label>
...         <rdfs:comment xml:lang="en">
...           a group of sports teams that compete against each other
...           in Basketball
...         </rdfs:comment>
...     </owl:Class>
... 
... </rdf:RDF>'''
>>> my_namespaces = dict([
...     node for _, node in ElementTree.iterparse(
...         StringIO(my_schema), events=['start-ns']
...     )
... ])
>>> from pprint import pprint
>>> pprint(my_namespaces)
{'': 'http://dbpedia.org/ontology/',
 'owl': 'http://www.w3.org/2002/07/owl#',
 'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
 'rdfs': 'http://www.w3.org/2000/01/rdf-schema#',
 'xsd': 'http://www.w3.org/2001/XMLSchema#'}

然后可以将字典作为参数传递给搜索函数:

root.findall('owl:Class', my_namespaces)

Note: This is an answer useful for Python's ElementTree standard library without using hardcoded namespaces.

To extract namespace's prefixes and URI from XML data you can use ElementTree.iterparse function, parsing only namespace start events (start-ns):

>>> from io import StringIO
>>> from xml.etree import ElementTree
>>> my_schema = u'''<rdf:RDF xml:base="http://dbpedia.org/ontology/"
...     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
...     xmlns:owl="http://www.w3.org/2002/07/owl#"
...     xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
...     xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
...     xmlns="http://dbpedia.org/ontology/">
... 
...     <owl:Class rdf:about="http://dbpedia.org/ontology/BasketballLeague">
...         <rdfs:label xml:lang="en">basketball league</rdfs:label>
...         <rdfs:comment xml:lang="en">
...           a group of sports teams that compete against each other
...           in Basketball
...         </rdfs:comment>
...     </owl:Class>
... 
... </rdf:RDF>'''
>>> my_namespaces = dict([
...     node for _, node in ElementTree.iterparse(
...         StringIO(my_schema), events=['start-ns']
...     )
... ])
>>> from pprint import pprint
>>> pprint(my_namespaces)
{'': 'http://dbpedia.org/ontology/',
 'owl': 'http://www.w3.org/2002/07/owl#',
 'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
 'rdfs': 'http://www.w3.org/2000/01/rdf-schema#',
 'xsd': 'http://www.w3.org/2001/XMLSchema#'}

Then the dictionary can be passed as argument to the search functions:

root.findall('owl:Class', my_namespaces)
江心雾 2025-01-27 02:17:09

要以其命名空间格式获取名称空间,例如 {myNamespace} ,您可以执行以下操作:

root = tree.getroot()
ns = re.match(r'{.*}', root.tag).group(0)

这样,您可以在代码中以后在代码上使用它来查找节点,例如使用字符串插值(Python 3) )。

link = root.find(f"{ns}link")

To get the namespace in its namespace format, e.g. {myNameSpace}, you can do the following:

root = tree.getroot()
ns = re.match(r'{.*}', root.tag).group(0)

This way, you can use it later on in your code to find nodes, e.g using string interpolation (Python 3).

link = root.find(f"{ns}link")
ペ泪落弦音 2025-01-27 02:17:09

我一直在使用与此类似的代码,并发现它总是值得阅读文档......像往常一样!

findall() 只会查找当前标记的直接子元素的元素。所以,并不是全部。

尝试让代码与以下内容一起工作可能是值得的,特别是如果您正在处理大型且复杂的 xml 文件,以便子子元素(等)也包含在内。
如果您知道元素在 xml 中的位置,那么我想那就没问题了!只是觉得这值得记住。

root.iter()

参考: https://docs.python.org /3/library/xml.etree.elementtree.html#finding-interesting-elements
“Element.findall() 仅查找带有当前元素直接子元素的标记的元素。Element.find() 查找带有特定标记的第一个子元素,Element.text 访问元素的文本内容。Element.get()访问元素的属性:”

I've been using similar code to this and have found it's always worth reading the documentation... as usual!

findall() will only find elements which are direct children of the current tag. So, not really ALL.

It might be worth your while trying to get your code working with the following, especially if you're dealing with big and complex xml files so that that sub-sub-elements (etc.) are also included.
If you know yourself where elements are in your xml, then I suppose it'll be fine! Just thought this was worth remembering.

root.iter()

ref: https://docs.python.org/3/library/xml.etree.elementtree.html#finding-interesting-elements
"Element.findall() finds only elements with a tag which are direct children of the current element. Element.find() finds the first child with a particular tag, and Element.text accesses the element’s text content. Element.get() accesses the element’s attributes:"

说好的呢 2025-01-27 02:17:09

这基本上是戴维德·布鲁纳托(Davide Brunato)的答案,但是我发现他的答案存在严重的问题,默认名称空间是空字符串,至少在我的python 3.6安装中。我从他的代码中提取的功能对我有用:

from io import StringIO
from xml.etree import ElementTree
def get_namespaces(xml_string):
    namespaces = dict([
            node for _, node in ElementTree.iterparse(
                StringIO(xml_string), events=['start-ns']
            )
    ])
    namespaces["ns0"] = namespaces[""]
    return namespaces

其中 ns0 只是空名称空间的占位符,您可以替换为任何随机字符串。

如果我这样做:

my_namespaces = get_namespaces(my_schema)
root.findall('ns0:SomeTagWithDefaultNamespace', my_namespaces)

它还使用默认名称空间为标签提供了正确的答案。

This is basically Davide Brunato's answer however I found out that his answer had serious problems the default namespace being the empty string, at least on my python 3.6 installation. The function I distilled from his code and that worked for me is the following:

from io import StringIO
from xml.etree import ElementTree
def get_namespaces(xml_string):
    namespaces = dict([
            node for _, node in ElementTree.iterparse(
                StringIO(xml_string), events=['start-ns']
            )
    ])
    namespaces["ns0"] = namespaces[""]
    return namespaces

where ns0 is just a placeholder for the empty namespace and you can replace it by any random string you like.

If I then do:

my_namespaces = get_namespaces(my_schema)
root.findall('ns0:SomeTagWithDefaultNamespace', my_namespaces)

It also produces the correct answer for tags using the default namespace as well.

谁对谁错谁最难过 2025-01-27 02:17:09

我的解决方案基于@Martijn Pieters 的评论:

register_namespace 仅影响序列化,而不影响搜索。

所以这里的技巧是使用不同的字典进行序列化和搜索。

namespaces = {
    '': 'http://www.example.com/default-schema',
    'spec': 'http://www.example.com/specialized-schema',
}

现在,注册所有命名空间以进行解析和写入:

for name, value in namespaces.items():
    ET.register_namespace(name, value)

为了搜索(find()findall()iterfind()),我们需要一个非-空前缀。向这些函数传递一个修改后的字典(这里我修改了原始字典,但这必须在命名空间注册后才能进行)。

self.namespaces['default'] = self.namespaces['']

现在,find() 系列中的函数可以与 default 前缀一起使用:

print root.find('default:myelem', namespaces)

tree.write(destination)

不对默认命名空间中的元素使用任何前缀。

My solution is based on @Martijn Pieters' comment:

register_namespace only influences serialisation, not search.

So the trick here is to use different dictionaries for serialization and for searching.

namespaces = {
    '': 'http://www.example.com/default-schema',
    'spec': 'http://www.example.com/specialized-schema',
}

Now, register all namespaces for parsing and writing:

for name, value in namespaces.items():
    ET.register_namespace(name, value)

For searching (find(), findall(), iterfind()) we need a non-empty prefix. Pass these functions a modified dictionary (here I modify the original dictionary, but this must be made only after the namespaces are registered).

self.namespaces['default'] = self.namespaces['']

Now, the functions from the find() family can be used with the default prefix:

print root.find('default:myelem', namespaces)

but

tree.write(destination)

does not use any prefixes for elements in the default namespace.

牵强ㄟ 2025-01-27 02:17:09

一个稍长的替代方法是创建另一个类 ElementNS ,它继承 ET.Element 并包含命名空间,然后为该类创建一个构造函数,并将其传递给解析器:

import xml.etree.ElementTree as ET


def parse_namespaces(source):
    return dict(node for _e, node in ET.iterparse(source, events=['start-ns']))


def create_element_factory(namespaces):
    def element_factory(tag, attrib):
        el = ElementNS(tag, attrib)
        el.namespaces = namespaces
        return el
    return element_factory


class ElementNS(ET.Element):
    namespaces = None

    # Patch methods to include namespaces
    def find(self, path):
        return super().find(path, self.namespaces)

    def findtext(self, path, default=None):
        return super().findtext(path, default, self.namespaces)

    def findall(self, path):
        return super().findall(path, self.namespaces)

    def iterfind(self, path):
        return super().iterfind(path, self.namespaces)


def parse(source):
    # Set up parser with namespaced element factory
    namespaces = parse_namespaces(source)
    element_factory = create_element_factory(namespaces)
    tree_builder = ET.TreeBuilder(element_factory=element_factory)
    parser = ET.XMLParser(target=tree_builder)
    element_tree = ET.ElementTree()

    return element_tree.parse(source, parser=parser)

然后 < code>findall 可以在不传递名称空间的情况下使用:

document = parse("filename")
document.findall("owl:Class")

A slightly longer alternative is to create another class ElementNS which inherits ET.Element and includes the namespaces, then create a constructor for this class which is passed onto the parser:

import xml.etree.ElementTree as ET


def parse_namespaces(source):
    return dict(node for _e, node in ET.iterparse(source, events=['start-ns']))


def create_element_factory(namespaces):
    def element_factory(tag, attrib):
        el = ElementNS(tag, attrib)
        el.namespaces = namespaces
        return el
    return element_factory


class ElementNS(ET.Element):
    namespaces = None

    # Patch methods to include namespaces
    def find(self, path):
        return super().find(path, self.namespaces)

    def findtext(self, path, default=None):
        return super().findtext(path, default, self.namespaces)

    def findall(self, path):
        return super().findall(path, self.namespaces)

    def iterfind(self, path):
        return super().iterfind(path, self.namespaces)


def parse(source):
    # Set up parser with namespaced element factory
    namespaces = parse_namespaces(source)
    element_factory = create_element_factory(namespaces)
    tree_builder = ET.TreeBuilder(element_factory=element_factory)
    parser = ET.XMLParser(target=tree_builder)
    element_tree = ET.ElementTree()

    return element_tree.parse(source, parser=parser)

Then findall can be used without passing namespaces:

document = parse("filename")
document.findall("owl:Class")
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文