XPath 无法通过 id 找到表

发布于 2024-07-19 08:36:36 字数 929 浏览 5 评论 0 原文

我正在使用 WATIJ 进行一些屏幕抓取,但它无法读取 HTML 表(抛出 NullPointerExceptions 或 UnknownObjectExceptions)。 为了克服这个问题,我读取了 HTML 并通过 JTidy 运行它以获得格式良好的 XML。

我想用 XPath 解析它,但它无法通过 id 找到 ,即使该表位于 XML 中,如下所示天。 这是我的代码:

XPathFactory factory=XPathFactory.newInstance();  
XPath xPath=factory.newXPath();  
InputSource inputSource = new InputSource(new StringReader(tidyHtml));  
XPathExpression xPathExpression=xPath.compile("//table[@id='searchResult']");  
String expression = "//table[@id='searchResult']";
String table = xPath.evaluate(expression, inputSource);
System.out.println("table = " + table);

该表是一个空字符串。

不过,该表位于 XML 中。 如果我打印 tidyHtml 字符串,它表明

 <table
   class="ApptableDisplayTag"
   id="searchResult"
   style="WIDTH: 99%">

我以前没有使用过 XPath,所以也许我遗漏了一些东西。

谁能帮我纠正一下吗? 谢谢。

I'm doing some screen scraping using WATIJ, but it can't read HTML tables (throws NullPointerExceptions or UnknownObjectExceptions). To overcome this I read the HTML and run it through JTidy to get well-formed XML.

I want to parse it with XPath, but it can't find a <table ...> by id even though the table is there in the XML plain as day. Here is my code:

XPathFactory factory=XPathFactory.newInstance();  
XPath xPath=factory.newXPath();  
InputSource inputSource = new InputSource(new StringReader(tidyHtml));  
XPathExpression xPathExpression=xPath.compile("//table[@id='searchResult']");  
String expression = "//table[@id='searchResult']";
String table = xPath.evaluate(expression, inputSource);
System.out.println("table = " + table);

The table is an empty String.

The table is in the XML, however. If I print the tidyHtml String it shows

 <table
   class="ApptableDisplayTag"
   id="searchResult"
   style="WIDTH: 99%">

I haven't used XPath before so maybe I'm missing something.

Can anyone set me straight? Thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

时光与爱终年不遇 2024-07-26 08:36:36

我对 JTidy 一无所知,但对于 WATIJ,我相信您收到 NullPointer 和 UnknownObject 异常的原因是因为您的 XPATH 使用小写节点。 假设您使用“//table[@id='searchResult']”作为 xpath 在 WATIJ 中查找表。 这实际上不起作用,因为“table”是小写的。 对于 WATIJ,您需要将所有节点名称都大写,例如:“//TABLE[@id='searchResult']”。 举个例子,假设您想使用 WATIJ 打印该表的行数,您需要执行以下操作:

import watij.runtime.ie.IE;
import static watij.finders.SymbolFactory.*;

public class Example {
    public static void main(String[] args) {
        IE ie = new IE();
        ie.start("your_url_goes_here");
        System.out.println(ie.table(xpath, "//TABLE[@id='searchResult']").rowCount());
        ie.close();
    }
}

此代码或答案可能不正确,因为我今天才开始使用 WATIJ。 虽然我确实在使用 xpaths 时遇到了同样的问题。 我花了几个小时进行搜索/测试,然后才注意到此页面上所有 xpath 的大小写: WATIJ 用户指南 一旦我更改了 xpath 中的大小写,WATIJ 就能够找到对象,因此这也应该适合您。

I don't know anything about JTidy, but I for WATIJ, I believe the reason you are getting the NullPointer and UnknownObject Exceptions is because your XPATH is using lower cased nodes. So say you are using "//table[@id='searchResult']" as the xpath to lookup the table in WATIJ. That won't actually work because "table" is in lower case. For WATIJ, you need to have all the node names in upper case, eg: "//TABLE[@id='searchResult']". As an example, say you want to print the number of rows of that table using WATIJ, you'd do the following:

import watij.runtime.ie.IE;
import static watij.finders.SymbolFactory.*;

public class Example {
    public static void main(String[] args) {
        IE ie = new IE();
        ie.start("your_url_goes_here");
        System.out.println(ie.table(xpath, "//TABLE[@id='searchResult']").rowCount());
        ie.close();
    }
}

This code or answer may not be right since I've only started using WATIJ today. Though I did run into this same exact problem with xpaths. Took me a couple of hours of searching/testing before I noticed how all the xpaths were cased on this page: WATIJ User Guide Once I changed the casing in my xpaths, WATIJ was able to locate the objects so this should work for you as well.

夏末 2024-07-26 08:36:36

你的 xPath 是正确的......无论失败是什么,都不是那个。

youe xPath is correct... whatever it is that's failing, it isn't that.

咿呀咿呀哟 2024-07-26 08:36:36

我从来没有直接使用过Java的XPath API,我总是通过 dom4j 或其他语言(Perl 和 C)。 但我对其正常工作方式有很好的了解。 首先,您可能应该将输入解析为 DOM 文档,这会有很大帮助。 此外,如果您知道您的文档有 ID,您应该通过加载描述它的 DTD 或架构来解析它,这样 XML 解析器将标记并识别具有正确 ID 的节点。 完成此操作后,您可以将代码与 DOM 树一起使用。

[XPath.evaluate(表达式, item)](http://java.sun.com/j2se/1.5.0/docs/api/javax/xml/xpath/XPath。 html#evaluate(java.lang.String,%20java.lang.Object) 显示第二个元素应该是 Node 或 NodeList

如果您的 XML 解析器是这样, 这可能是您遇到大量 UnknownObjectException 的原因。如果能够识别 ID 元素,那么您可以使用以下 XPath 表达式访问具有 ID 的元素:

XPathExpression xPathExpression=xPath.compile("id('searchResult')");
xPathExpression.evaluate(document); // document is a DOM document instance

使用 XPath 函数 id() 是访问元素最有效的方式,即当元素使用 ID 并且已在 DTD 或 Schema 中以这种方式声明时。

I never used the XPath API of Java directly, I always used it through dom4j or in other languages (Perl and C). But I have a good understanding on how it works normally. At first you should probably parsed the input as a DOM document, this will greatly help. Also if you know that your document has ID you should parse it with loading the DTD or Schema that describes it this way the XML parser will mark and identify the nodes that have proper IDs. Once you have done this you can use your code with the DOM tree.

The documentation of [XPath.evaluate(expression, item)](http://java.sun.com/j2se/1.5.0/docs/api/javax/xml/xpath/XPath.html#evaluate(java.lang.String,%20java.lang.Object) shows that the second element should be a Node or a NodeList. This probably why you're having plenty of UnknownObjectExceptions.

If your XML parser is able to recognize the ID elements then you can access an element having an ID with the following XPath expression:

XPathExpression xPathExpression=xPath.compile("id('searchResult')");
xPathExpression.evaluate(document); // document is a DOM document instance

Using the XPath function id() is the most efficient way for accessing elements, that is when the elements are using an ID and have been declared in such way in the DTD or Schema.

濫情▎り 2024-07-26 08:36:36

看起来问题主要出在 JTidy 上。 我可以通过执行以下操作让 xpath 解析 JTidy-ied 结果:

删除所有“<&>nbsp;”。 JTidy 返回带有“<&>nbsp;”的 xhtml 标签之外。
去除
在标签中删除 xmlns=... 属性
删除“头”标签。
(我使用了一些有趣的格式,因为正确键入时 HTML 实体不会显示)

JTidy 还将换行符放在文本内容的中间 if ... 元素。

我得看看其他 HTML -> XML 转换选项。 我快速尝试了一下 Cobra,但它也未能通过 Id 找到我的桌子。 我没有尝试过手动清理 Cobra 的结果,所以我不知道它与 JTidy 相比如何。

如果您知道可以返回良好 XML 的 HTML 解析器,请告诉我。

It looks like the problem is mostly with JTidy. I can get xpath to parse the JTidy-ied result by doing the following:

Remove all "<&>nbsp;". JTidy returns xhtml with "<&>nbsp;" outside of tags.
Remove the
In the tag remove the xmlns=... attribute
Remove the "head" tags.
(I usee some funny formatting because HTML entities won't display when typed properly)

JTidy also puts newlines in the middle of the text content if ... elements.

I'll have to look at other HTML -> XML conversion options. I gave Cobra a quick try, but it also failed to find my table by Id. I haven't tried manually cleaning up the result from Cobra, so I don't know how it compares to JTidy.

If you know of an HTML parser that returns good XML please let me know.

零時差 2024-07-26 08:36:36

解决方案是放弃 WATIJ 并切换到 Google WebDriver。 WebDriver 记录了不同浏览器如何处理 xpath 语句中的大小写。

The solution was to drop WATIJ and switch to Google WebDriver. WebDriver documents how different browsers handle case in xpath statements.

失去的东西太少 2024-07-26 08:36:36

双引号绝对不是必需的,也不是大写的。 命名空间和/或 DTD 更有可能是答案。

Double quotes are definitely not required, and neither is uppercase. Namespaces and/or DTD are more likely the answer.

分开我的手 2024-07-26 08:36:36

Uniue ID 属性需要通过 id( ) 方法访问 id('search')

Uniue ID attributes need to be accessed by the id( ) method id('search')

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文