如何使用 R 的 XML 库进行 xpath 查询?
xml 文件有这个片段:
<?xml version="1.0"?>
<PC-AssayContainer
xmlns="http://www.ncbi.nlm.nih.gov"
xmlns:xs="http://www.w3.org/2001/XMLSchema-instance"
xs:schemaLocation="http://www.ncbi.nlm.nih.gov ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem.xsd"
>
....
<PC-AnnotatedXRef>
<PC-AnnotatedXRef_xref>
<PC-XRefData>
<PC-XRefData_pmid>17959251</PC-XRefData_pmid>
</PC-XRefData>
</PC-AnnotatedXRef_xref>
</PC-AnnotatedXRef>
我尝试使用 xpath 的全局搜索来解析它,并尝试使用一些命名空间:
library('XML')
doc = xmlInternalTreeParse('http://s3.amazonaws.com/tommy_chheng/pubmed/485270.descr.xml')
>xpathApply(doc, "//PC-XRefData_pmid")
list()
attr(,"class")
[1] "XMLNodeSet"
> getNodeSet(doc, "//PC-XRefData_pmid")
list()
attr(,"class")
[1] "XMLNodeSet"
> xpathApply(doc, "//xs:PC-XRefData_pmid", ns="xs")
list()
> xpathApply(doc, "//xs:PC-XRefData_pmid", ns= c(xs = "http://www.w3.org/2001/XMLSchema-instance"))
list()
不应该 xpath 匹配:
<PC-XRefData_pmid>17959251</PC-XRefData_pmid>
The xml file has this snippet:
<?xml version="1.0"?>
<PC-AssayContainer
xmlns="http://www.ncbi.nlm.nih.gov"
xmlns:xs="http://www.w3.org/2001/XMLSchema-instance"
xs:schemaLocation="http://www.ncbi.nlm.nih.gov ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem.xsd"
>
....
<PC-AnnotatedXRef>
<PC-AnnotatedXRef_xref>
<PC-XRefData>
<PC-XRefData_pmid>17959251</PC-XRefData_pmid>
</PC-XRefData>
</PC-AnnotatedXRef_xref>
</PC-AnnotatedXRef>
I tried to parse it using xpath's global search and also tried with some namespacing:
library('XML')
doc = xmlInternalTreeParse('http://s3.amazonaws.com/tommy_chheng/pubmed/485270.descr.xml')
>xpathApply(doc, "//PC-XRefData_pmid")
list()
attr(,"class")
[1] "XMLNodeSet"
> getNodeSet(doc, "//PC-XRefData_pmid")
list()
attr(,"class")
[1] "XMLNodeSet"
> xpathApply(doc, "//xs:PC-XRefData_pmid", ns="xs")
list()
> xpathApply(doc, "//xs:PC-XRefData_pmid", ns= c(xs = "http://www.w3.org/2001/XMLSchema-instance"))
list()
Shouldn't the xpath match:
<PC-XRefData_pmid>17959251</PC-XRefData_pmid>
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这是常见问题解答。
这:
//PC-XRefData_pmid
表示:文档中无命名空间或空命名空间下的任何
PC-XRefData_pmid
这并不意味着默认命名空间下文档中的任何
PC-XRefData_pmid
另外,您的文档示例尚未完成,但看起来您的
PC-XRefData_pmid
元素位于http://www.ncbi.nlm.nih.gov
命名空间This is FAQ.
This:
//PC-XRefData_pmid
Means: any
PC-XRefData_pmid
in document under no namespace or empty namespaceIt doesn't means any
PC-XRefData_pmid
in document under default namespacePlus, your document sample isn't completed, but it looks like your
PC-XRefData_pmid
element is underhttp://www.ncbi.nlm.nih.gov
namespace由于默认命名空间是 NIH 命名空间(其 URI 为“http://www.ncbi.nlm.nih.gov”),
(以及 XML 中的所有其他元素)没有命名空间前缀的文档)位于该 NIH 命名空间中。因此,要将它们与 XPath 匹配,您需要告诉 XPath 处理器您要为 NIH 名称空间使用什么前缀,并且需要在 XPath 中使用该前缀。
因此,在不了解 R 的情况下,我会尝试
否则,
因为后者会绕过名称空间。
仅仅因为 XML 文档将 NIH 命名空间声明为默认命名空间并不意味着 XPath 处理器会知道这一点。在 XML 信息模型中,命名空间前缀并不重要。因此,当我解析 XML 文档时,NIH 命名空间是否绑定到“nih:”前缀或“snizzlefritz”并不重要。 :" 前缀或“”(默认)前缀。 XML 解析器或 XPath 处理器不必知道什么前缀绑定到 XML 文档中的什么名称空间。特别是因为同一文档中不同位置的同一名称空间可能有多个不同的前缀绑定......反之亦然。因此,如果您希望 XPath 表达式与命名空间中的元素匹配,则必须向 XPath 处理器声明该命名空间。
编辑:有一些警告,由 @Jim Pivarski 贡献:
因此,如果“doc”是文档类的实例,则正确的解决方案是:
Since the default namespace is the NIH one (whose URI is "http://www.ncbi.nlm.nih.gov"),
<PC-XRefData_pmid>
(and every other element in your XML document that has no namespace prefix) is in that NIH namespace.So to match them with an XPath, you need to tell your XPath processor what prefix you're going to use for the NIH namespace, and you need to use that prefix in your XPath.
So, without knowing R, I would try
or else
as the latter bypasses namespaces.
Just because the XML document declares the NIH namespace as the default one doesn't mean that the XPath processor will know that. In the XML information model, namespace prefixes are not significant. So when I parse in an XML document, it's not supposed to matter whether the NIH namespace is bound to the "nih:" prefix or the "snizzlefritz:" prefix or the "" (default) prefix. The XML parser or XPath processor is not supposed to have to know what prefix got bound to what namespace in the XML document. Especially since there could be several different prefixes bound to the same namespace at different places in the same document... and vice versa. So if you want to have your XPath expression match an element that's in a namespace, you have to declare that namespace to the XPath processor.
Edit: There are a few caveats, contributed by @Jim Pivarski:
So if "doc" is an instance of a document class, the correct solution is: