使用 HtmlUnit 在 XPath 中选择默认命名空间

发布于 2024-11-09 13:38:08 字数 2685 浏览 0 评论 0原文

我想使用 HtmlUnit 解析 Feedburner 提要。提要是这样的： http://feeds.feedburner.com/alcoanewsreleases

我想从这个提要读取所有 item 节点，因此通常 //item XPath 就可以解决问题。不幸的是，这在这种情况下不起作用。

groovy 代码片段：

def page = webClient.getPage("http://feeds.feedburner.com/alcoanewsreleases")
def elements = page.getByXPath("//item")

XML 提要示例：

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss1full.xsl"?>
<?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?>

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns="http://purl.org/rss/1.0/" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">

[...SNIP...]

<item rdf:about="http://www.alcoa.com/global/en/news/news_detail.asp?newsYear=2011&amp;pageID=20110518006002en">
    <title>Chris L. Ayers Named President, Alcoa Global Primary Products</title>
    <dc:date>2011-05-18</dc:date
    <link>http://feedproxy.google.com/~r/alcoanewsreleases/~3/PawvdhpJrkc/news_detail.asp</link>
    <description>NEW YORK--(BUSINESS WIRE)--Alcoa (NYSE:AA) announced today that Chris L. Ayers has been named President of Alcoa’s Global Primary Products (GPP) business, effective May 18, 2011. Ayers, previously Chief Operating Officer of GPP, succeeds John Thuestad, who will be handling special projects for the Company. Ayers joined Alcoa in February 2010 as Chief Operating Officer of Alcoa Cast, Forged and Extruded Products, a new position. He was elected a Vice President of Alcoa in April 2010 and Executive</description>
    <feedburner:origLink xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">http://www.alcoa.com/global/en/news/news_detail.asp?newsYear=2010&amp;pageID=20100104006194en</feedburner:origLink>
</item>

[...SNIP...]

</rdf:RDF>

我怀疑这是命名空间的问题，因为该文档有 4 个命名空间。命名空间为

（这是默认值） xmlns="http://purl.org/rss/1.0/"
xmlns:rdf="http://www.w3.org/1999/02 /22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0"

我尝试将 Nokogiri 与此一起使用（我用于 ruby 脚本的另一个 XML 解析器）。有了 Nokogiri，我就可以使用 XPath //xmlns:item 来工作并返回提要中的所有节点。

我已经尝试使用 HtmlUnit 使用相同的 XPath，但它不起作用。

所以我想我可以将我的问题表述为：如何使用 HtmlUnit 从默认命名空间中选择节点？

有什么想法吗？

原文

I want to parse a Feedburner feed with HtmlUnit.
The feed is this one: http://feeds.feedburner.com/alcoanewsreleases

From this feed I want to read all item nodes, so normally a //item XPath should do the trick. Unfortunately that does not work in this case.

groovy code snippet:

def page = webClient.getPage("http://feeds.feedburner.com/alcoanewsreleases")
def elements = page.getByXPath("//item")

Sample of the XML feed:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss1full.xsl"?>
<?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?>

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns="http://purl.org/rss/1.0/" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">

[...SNIP...]

<item rdf:about="http://www.alcoa.com/global/en/news/news_detail.asp?newsYear=2011&pageID=20110518006002en">
    <title>Chris L. Ayers Named President, Alcoa Global Primary Products</title>
    <dc:date>2011-05-18</dc:date
    <link>http://feedproxy.google.com/~r/alcoanewsreleases/~3/PawvdhpJrkc/news_detail.asp</link>
    <description>NEW YORK--(BUSINESS WIRE)--Alcoa (NYSE:AA) announced today that Chris L. Ayers has been named President of Alcoa’s Global Primary Products (GPP) business, effective May 18, 2011. Ayers, previously Chief Operating Officer of GPP, succeeds John Thuestad, who will be handling special projects for the Company. Ayers joined Alcoa in February 2010 as Chief Operating Officer of Alcoa Cast, Forged and Extruded Products, a new position. He was elected a Vice President of Alcoa in April 2010 and Executive</description>
    <feedburner:origLink xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">http://www.alcoa.com/global/en/news/news_detail.asp?newsYear=2010&pageID=20100104006194en</feedburner:origLink>
</item>

[...SNIP...]

</rdf:RDF>

I suspect this to be an issue with the namespaces because this document has 4 namespaces. The namespaces are

(this is the default) xmlns="http://purl.org/rss/1.0/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0"

I have tried to use Nokogiri with this (another XML Parser that I use for ruby scripts).
With Nokogiri I could just us the XPath //xmlns:item which works and returns all nodes from the feed.

I have tried the same XPath with HtmlUnit but it does not work.

So I think I can phrase my question as:
How can I select a node from the default namespace with HtmlUnit?

Any ideas?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

后来的我们 2024-11-16 13:38:08

我想阅读此提要中的所有项目
节点，所以通常是 //item XPath
应该可以解决问题。很遗憾
在这种情况下这不起作用。

在 XPath 中，这意味着“选择本地名称为 item 不在命名空间中的所有元素”。在 RSS 中，item 元素必须位于命名空间中。因此，上述内容永远不应该与符合标准的 XML 解析器和 XPath 引擎一起使用。

令人困惑的是，在 XML 中，表示“位于 default 命名空间中名为 item 的元素，即在该位置的范围内的任何默认命名空间。文档;”而在 XPath 中，“item”表示 no 命名空间中的元素。（或者，您可以说，它意味着默认命名空间中的一个元素，但是除非您有办法告诉 XPath 默认命名空间是什么，否则默认命名空间就是无命名空间。通常（总是？）在 XPath 1.0 中没有办法声明 XPath 表达式的默认命名空间。）

对于初学者来说，另一个令人困惑的事情是，XPath 处理器认为源 XML 文档中的命名空间前缀映射并不重要。解析 XML 文档时，会构建一个数据结构，该结构会记住每个元素（以及其他节点）的名称和命名空间。使用的命名空间前缀，包括默认命名空间的空前缀，被认为仅仅是语法上的方便。下面详细介绍这一点...

有了 Nokogiri，我就可以
XPath //xmlns:item 有效并且
返回提要中的所有节点。

不管那是什么，它都不是 XPath。也许它是 Nokogiri 的扩展（一个非常方便的扩展，但它的语法确实违反直觉）。

所以我想我可以表达我的问题
如：如何从
HtmlUnit 的默认命名空间？

让我们将其表述为：如何使用 HtmlUnit 选择 RSS 项目元素？我这样表述是因为 RSS 规范（实际上通常是任何符合 XML 词汇规范的规范）不要求其元素位于默认名称空间中。您收到的示例中恰好是这样，但服务提供商明天可能会更改这一点，但仍然完全符合 RSS。明天，服务提供商可以为该命名空间使用“rss”命名空间前缀；或任何其他任意前缀。 RSS 所做指定的是其元素所在的命名空间：URI 为 http://purl.org/rss/1.0/ 的命名空间。

这有点像问，“我如何编写一个函数（用 Javascript、C、Java 等）来告诉我变量 a 的值？”通常，函数不知道调用者使用什么变量名来做什么。它所知道的只是其参数的值。如果您调用 sqrt(4)，您将得到与 a = 4; 相同的答案。 sqrt(a) 或 rumpelstiltzkin = 4; sqrt（rumpelstiltzkin）。显然，变量参数的名称对函数调用的结果没有直接影响。它只需要是保存正确值的变量的名称。如果编译器因为你写了 b = 4; 而抱怨return sqrt(b) 而不是使用 a，你会认为编译器疯了。只要您使用有效的标识符，它就不应该关心变量名称。

同样，在处理 RSS 时，我们不应该关心使用什么名称空间前缀，只要它是标识正确名称空间的前缀即可。它可以没有前缀（标识默认名称空间）。

在 XPath 2.0 中，您可以使用通配符命名空间。如果您知道不需要命名空间来消除歧义，那么这将非常方便。在这种情况下，您可以选择 //*:item。但是，我不认为 HTMLUnit 支持 XPath 2.0。此外，在 XSLT 2.0 等 XPath 2.0 环境中，您可以为 XPath 表达式指定默认名称空间，但这在 HTMLUnit 中没有帮助。

因此，您有几个选择：

使用忽略名称空间的 XPath 表达式，例如 //*[local-name() = 'item']。

或者

稳健的方法：为 http://purl.org/rss/1.0/ 注册命名空间前缀，并在 XPath 表达式中使用它：//rss:item。那么问题就变成了，如何在 HTMLUnit 中注册名称空间前缀并将其传递给 XPath 处理器？我快速浏览了文档，但没有找到任何执行此操作的工具。

警告：我应该补充一点，以上内容是关于符合 XPath 处理器的。我不知道 HTMLUnit 使用什么 XPath 处理器。有一些 XPath 处理器忽略了规范，让每个人都更加困惑。

我在此处看到有人对默认元素中的元素使用了以下语法HTMLUnit 中的命名空间：

//:item

但我不建议这样做，原因有以下三个：

它不是有效的 XPath，因此您不能指望它能够与其他程序一起使用。
它仅适用于将 RSS 命名空间声明为默认命名空间的 RSS 提要。使用名称空间前缀的 RSS 提要将导致上述失败。
它会阻碍您了解 XML 名称空间的真正工作原理，并且有助于维持不能充分支持名称空间的工具的现状。

HTMLUnit 主要是为 HTML 设计的，因此对 XML 的不完整处理是可以理解的。但声称支持 XPath，然后不提供声明名称空间前缀的方法是错误。 HTMLUnit 使用 XPath 包，该包似乎是 Xalan-J 的一部分。该软件包具有提供到 XPath 的命名空间映射的方法，但我不知道 HTMLUnit 是否公开了该功能。

From this feed I want to read all item
nodes, so normally a //item XPath
should do the trick. Unfortunately
that does not work in this case.

In XPath, that means "select all elements whose local name is item that are in no namespace". In RSS, the item elements must be in a namespace. So the above should never work with a conforming XML parser and XPath engine.

What's confusing is that in XML, <item> means "an element named item that is in the default namespace, i.e. whatever default namespace is in scope at this place in the document;" whereas in XPath, "item" means an element in no namespace. (Or, you could say, it means an element in the default namespace, but unless you have a way to tell XPath what the default namespace is, the default namespace is no namespace. Usually (always?) in XPath 1.0 there is no way to declare the default namespace for XPath expressions.)

The other confusing thing to beginners is that the namespace prefix mappings in the source XML document are not considered significant by the XPath processor. When the XML document is parsed, a data structure is built that remembers the name and namespace of every element (and other nodes). The namespace prefixes used, including the empty prefix of the default namespace, are considered mere syntactic convenience. More on this below...

With Nokogiri I could just us the
XPath //xmlns:item which works and
returns all nodes from the feed.

Whatever that is, it's not XPath. Maybe it's a Nokogiri extension to it (a very convenient one, but its syntax is really counter-intuitive).

So I think I can phrase my question
as: How can I select a node from the
default namespace with HtmlUnit?

Let's phrase it as: How can I select the RSS item elements with HtmlUnit? I phrase it that way because the RSS spec (actually in general any conforming XML vocabulary spec) does not require that its elements will be in the default namespace. That happens to be true in the sample you received, but the service provider could change that tomorrow and still be perfectly conformant to RSS. Tomorrow, the service provider could use the "rss" namespace prefix for that namespace; or any other arbitrary prefix. What RSS does specify is what namespace its elements will be in: the namespace whose URI is http://purl.org/rss/1.0/.

It's kind of like asking, "How do I write a function (in Javascript, C, Java, etc.) that can tell me the value of the variable a?" Usually a function has no idea what variable name was used for what in the caller. All it knows are the values of its arguments. If you call sqrt(4), you'll get the same answer as with a = 4; sqrt(a) or rumpelstiltzkin = 4; sqrt(rumpelstiltzkin). Clearly, the name of the variable argument has no direct effect on the result of the function call. It just needs to be the name of a variable that holds the right value. If a compiler complained because you wrote b = 4; return sqrt(b) instead of using a, you'd think that compiler was nuts. It's not supposed to care about variable names as long as you use valid identifiers.

In the same way, when processing RSS, we're not supposed to care about what namespace prefix is used, as long as it's a prefix that identifies the right namespace. It could be no prefix (which identifies the default namespace).

In XPath 2.0, you can wildcard the namespace. This is very handy if you know you're not going to need namespaces for disambiguation. In that case you can select //*:item. However, I don't think HTMLUnit supports XPath 2.0. Also in XPath 2.0 environments like XSLT 2.0, you can specify a default namespace for XPath expressions, but that won't help you in HTMLUnit.

So you have a couple of choices:

Use an XPath expression that ignores namespaces, such as //*[local-name() = 'item'].

The robust way: Register a namespace prefix for http://purl.org/rss/1.0/ and use it in your XPath expression: //rss:item. The question then becomes, how do you register a namespace prefix in HTMLUnit and pass it to the XPath processor? I took a quick look in the docs and didn't find any facility for doing that.

Caveat: I should add that the above is in regard to conforming XPath processors. I have no idea what XPath processor HTMLUnit uses. There are some XPath processors out there that ignore the specs and make the world more confusing for everybody.

I saw here that someone used the following syntax for elements in the default namespace in HTMLUnit:

//:item

But I wouldn't recommend that, for three reasons:

It's not valid XPath, so you can't expect it to work with other programs.
It will only work on RSS feeds that declare the RSS namespace to be the default namespace. RSS feeds that use a namespace prefix will cause the above to fail.
It will hold you back from learning how XML namespaces really work, and it will help preserve the status quo of tools that don't adequately support namespaces.

HTMLUnit is primarily designed for HTML, so incomplete handling of XML is understandable. But claiming to support XPath and then not providing ways to declare namespace prefixes is a bug. HTMLUnit uses an XPath package that seems to be part of Xalan-J. That package has ways to provide namespace mappings to XPath, but I don't know if HTMLUnit exposes that functionality.

回复收藏 0 原文

暗恋未遂 2024-11-16 13:38:08

这听起来很熟悉，我确信我过去曾在 HtmlUnit 中成功使用过命名空间和 XPath，但我当然找不到代码。我怀疑它一定只适用于 HTML 页面：示例中的 page 引用是 XmlPage 其中有许多特定于命名空间的方法，所有这些方法在使用时都会抛出“尚未实现”异常。 :-(

HtmlUnit 的当前版本 (2.8) 已有近一年的历史，因此可能在此期间已经完成了一些工作来支持 XML 名称空间。 "HtmlUnit Users" 邮件列表将是查找答案的地方。

与此同时，一如既往，有一个解决方法：

final XmlPage page = webClient.getPage("http://feeds.feedburner.com/alcoanewsreleases");

// no good
List elements = page.getByXPath("//item");
System.out.println( elements.size() ) ;

// ugly, but it works
DomElement de = (DomElement)page.getFirstByXPath( "//rdf:RDF" );
List<DomNode> items = new ArrayList<DomNode>() ;
for( DomNode dn : de.getChildNodes() )
{
    String name = dn.getLocalName() ;
    if( ( name != null ) && ( name.equals( "item" ) ) )
        items.add( dn ) ;
}
System.out.println( "found " + items.size() ) ;

哦，天哪，Java 在工作后很痛苦斯卡拉... ;-)

This sounds familiar enough that I'm quite sure I've used namespaces and XPath successfully with HtmlUnit in the past, but of course I can't find the code. I suspect it must have been with HTML pages only: the page reference in your example is an XmlPage which has a number of methods specific to namespaces, all of which throw a "not implemented yet" exception when used. :-(

The current version (2.8) of HtmlUnit is nearly a year old, so it may be that some work has been done in the meantime to support XML namespaces. The "HtmlUnit Users" mailing list would be the place to find out.

In the meantime, as always there is a workaround:

final XmlPage page = webClient.getPage("http://feeds.feedburner.com/alcoanewsreleases");

// no good
List elements = page.getByXPath("//item");
System.out.println( elements.size() ) ;

// ugly, but it works
DomElement de = (DomElement)page.getFirstByXPath( "//rdf:RDF" );
List<DomNode> items = new ArrayList<DomNode>() ;
for( DomNode dn : de.getChildNodes() )
{
    String name = dn.getLocalName() ;
    if( ( name != null ) && ( name.equals( "item" ) ) )
        items.add( dn ) ;
}
System.out.println( "found " + items.size() ) ;

Oh boy Java is painful after working in Scala... ;-)

回复收藏 0 原文

~没有更多了~