XPath 无法通过 id 找到表

发布于 2024-07-19 08:36:36 字数 929 浏览 5 评论 0 原文

我正在使用 WATIJ 进行一些屏幕抓取，但它无法读取 HTML 表（抛出 NullPointerExceptions 或 UnknownObjectExceptions）。为了克服这个问题，我读取了 HTML 并通过 JTidy 运行它以获得格式良好的 XML。

我想用 XPath 解析它，但它无法通过 id 找到 ，即使该表位于 XML 中，如下所示天。这是我的代码：

XPathFactory factory=XPathFactory.newInstance();  
XPath xPath=factory.newXPath();  
InputSource inputSource = new InputSource(new StringReader(tidyHtml));  
XPathExpression xPathExpression=xPath.compile("//table[@id='searchResult']");  
String expression = "//table[@id='searchResult']";
String table = xPath.evaluate(expression, inputSource);
System.out.println("table = " + table);

该表是一个空字符串。

不过，该表位于 XML 中。如果我打印 tidyHtml 字符串，它表明

 <table
   class="ApptableDisplayTag"
   id="searchResult"
   style="WIDTH: 99%">

我以前没有使用过 XPath，所以也许我遗漏了一些东西。

谁能帮我纠正一下吗？谢谢。

原文

I'm doing some screen scraping using WATIJ, but it can't read HTML tables (throws NullPointerExceptions or UnknownObjectExceptions). To overcome this I read the HTML and run it through JTidy to get well-formed XML.

I want to parse it with XPath, but it can't find a <table ...> by id even though the table is there in the XML plain as day. Here is my code:

XPathFactory factory=XPathFactory.newInstance();  
XPath xPath=factory.newXPath();  
InputSource inputSource = new InputSource(new StringReader(tidyHtml));  
XPathExpression xPathExpression=xPath.compile("//table[@id='searchResult']");  
String expression = "//table[@id='searchResult']";
String table = xPath.evaluate(expression, inputSource);
System.out.println("table = " + table);

The table is an empty String.

The table is in the XML, however. If I print the tidyHtml String it shows

 <table
   class="ApptableDisplayTag"
   id="searchResult"
   style="WIDTH: 99%">

I haven't used XPath before so maybe I'm missing something.

Can anyone set me straight? Thanks.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

时光与爱终年不遇 2024-07-26 08:36:36

我对 JTidy 一无所知，但对于 WATIJ，我相信您收到 NullPointer 和 UnknownObject 异常的原因是因为您的 XPATH 使用小写节点。假设您使用“//table[@id='searchResult']”作为 xpath 在 WATIJ 中查找表。这实际上不起作用，因为“table”是小写的。对于 WATIJ，您需要将所有节点名称都大写，例如：“//TABLE[@id='searchResult']”。举个例子，假设您想使用 WATIJ 打印该表的行数，您需要执行以下操作：

import watij.runtime.ie.IE;
import static watij.finders.SymbolFactory.*;

public class Example {
    public static void main(String[] args) {
        IE ie = new IE();
        ie.start("your_url_goes_here");
        System.out.println(ie.table(xpath, "//TABLE[@id='searchResult']").rowCount());
        ie.close();
    }
}

此代码或答案可能不正确，因为我今天才开始使用 WATIJ。虽然我确实在使用 xpaths 时遇到了同样的问题。我花了几个小时进行搜索/测试，然后才注意到此页面上所有 xpath 的大小写： WATIJ 用户指南一旦我更改了 xpath 中的大小写，WATIJ 就能够找到对象，因此这也应该适合您。

I don't know anything about JTidy, but I for WATIJ, I believe the reason you are getting the NullPointer and UnknownObject Exceptions is because your XPATH is using lower cased nodes. So say you are using "//table[@id='searchResult']" as the xpath to lookup the table in WATIJ. That won't actually work because "table" is in lower case. For WATIJ, you need to have all the node names in upper case, eg: "//TABLE[@id='searchResult']". As an example, say you want to print the number of rows of that table using WATIJ, you'd do the following:

import watij.runtime.ie.IE;
import static watij.finders.SymbolFactory.*;

public class Example {
    public static void main(String[] args) {
        IE ie = new IE();
        ie.start("your_url_goes_here");
        System.out.println(ie.table(xpath, "//TABLE[@id='searchResult']").rowCount());
        ie.close();
    }
}

This code or answer may not be right since I've only started using WATIJ today. Though I did run into this same exact problem with xpaths. Took me a couple of hours of searching/testing before I noticed how all the xpaths were cased on this page: WATIJ User Guide Once I changed the casing in my xpaths, WATIJ was able to locate the objects so this should work for you as well.

回复收藏 0 原文

夏末 2024-07-26 08:36:36

你的 xPath 是正确的......无论失败是什么，都不是那个。

回复收藏 0 原文

咿呀咿呀哟 2024-07-26 08:36:36

我从来没有直接使用过Java的XPath API，我总是通过 dom4j 或其他语言（Perl 和 C）。但我对其正常工作方式有很好的了解。首先，您可能应该将输入解析为 DOM 文档，这会有很大帮助。此外，如果您知道您的文档有 ID，您应该通过加载描述它的 DTD 或架构来解析它，这样 XML 解析器将标记并识别具有正确 ID 的节点。完成此操作后，您可以将代码与 DOM 树一起使用。

[XPath.evaluate(表达式, item)](http://java.sun.com/j2se/1.5.0/docs/api/javax/xml/xpath/XPath。 html#evaluate(java.lang.String,%20java.lang.Object) 显示第二个元素应该是 Node 或 NodeList

如果您的 XML 解析器是这样，这可能是您遇到大量 UnknownObjectException 的原因。如果能够识别 ID 元素，那么您可以使用以下 XPath 表达式访问具有 ID 的元素：

XPathExpression xPathExpression=xPath.compile("id('searchResult')");
xPathExpression.evaluate(document); // document is a DOM document instance

使用 XPath 函数 id() 是访问元素最有效的方式，即当元素使用 ID 并且已在 DTD 或 Schema 中以这种方式声明时。

I never used the XPath API of Java directly, I always used it through dom4j or in other languages (Perl and C). But I have a good understanding on how it works normally. At first you should probably parsed the input as a DOM document, this will greatly help. Also if you know that your document has ID you should parse it with loading the DTD or Schema that describes it this way the XML parser will mark and identify the nodes that have proper IDs. Once you have done this you can use your code with the DOM tree.

The documentation of [XPath.evaluate(expression, item)](http://java.sun.com/j2se/1.5.0/docs/api/javax/xml/xpath/XPath.html#evaluate(java.lang.String,%20java.lang.Object) shows that the second element should be a Node or a NodeList. This probably why you're having plenty of UnknownObjectExceptions.

If your XML parser is able to recognize the ID elements then you can access an element having an ID with the following XPath expression:

XPathExpression xPathExpression=xPath.compile("id('searchResult')");
xPathExpression.evaluate(document); // document is a DOM document instance

Using the XPath function id() is the most efficient way for accessing elements, that is when the elements are using an ID and have been declared in such way in the DTD or Schema.

回复收藏 0 原文