HtmlUnit getByXpath 返回 null

发布于 2024-10-05 11:21:57 字数 958 浏览 3 评论 0原文

我正在使用 Groovy 进行编码,但是,我不认为这是一组特定于语言的问题。

我实际上有两个问题

第一个问题

我在使用 HtmlUnit 时遇到了问题。它告诉我,我想要抓住的东西是空的。

我正在测试的页面是: http://browse.deviantart.com/resources /applications/psbrushes/?order=9&offset=0#/dbwam4

我的代码:

client = new WebClient(BrowserVersion.FIREFOX_3)
client.javaScriptEnabled = false

page = client.getPage(url)

//coming up as null
title = page.getByXPath("//html/body/div[4]/div/div[3]/div/div/div/div/div/div/div/div/div/div/h1/a")

println title

这只是打印出: []

这是因为页面使用 onclick()< /强>?如果是这样,我该如何解决这个问题?启用 JavaScript 会在我的 cmd 提示符中造成混乱。

第二个问题

我也想获取图像,但遇到了麻烦,因为当我尝试获取 XPath(通过 firebug)时,它显示为://*[@id="gmi -ResViewSizer_img"]

我该如何处理?

I am coding with Groovy, however, I don't believe its a language specific set of questions.

I actually have two questions

First Question

I've run into an issue while using HtmlUnit. It is telling me that what I am trying to grab is null.

The page I'm testing it on is:
http://browse.deviantart.com/resources/applications/psbrushes/?order=9&offset=0#/dbwam4

My code:

client = new WebClient(BrowserVersion.FIREFOX_3)
client.javaScriptEnabled = false

page = client.getPage(url)

//coming up as null
title = page.getByXPath("//html/body/div[4]/div/div[3]/div/div/div/div/div/div/div/div/div/div/h1/a")

println title

This simply prints out: []

Is this because the page uses onclick()? If so, how would I get around that? Enabling javascript creates a mess in my cmd prompt.

Second Question

I am wanting to also get the image but am having trouble because when I attempt to get the XPath (via firebug) it shows up as: //*[@id="gmi-ResViewSizer_img"]

How do I handle that?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

好菇凉咱不稀罕他 2024-10-12 11:21:57

第一个答案:

/html/body/div[3]/div/div[3]/div/div/div/div/div/div/div/div/div/div/h1/a

您的 XPATH 在正文的第 4 个 div 的谓词过滤器中偏离了 1,它应该是第 3 个 div。看起来该网站的 HTML 与您最初使用 Firebug 捕获 XPATH 相比可能/确实发生了变化。您可能需要调整 XPATH 以适应潜在的变化,并且对文档结构中的某些差异不太敏感。

也许是这样的:

/html/body//div/h1/a

第二个答案:您列出的 XPATH 将会起作用。它可能看起来很奇怪/简短(并且可能不是最有效的),但是 // 从根节点开始并遍历树中的每个节点,* 匹配任何元素(包括 img)和 [] 谓词过滤器将其限制为具有值等于“gmi-”的 id 属性的元素ResViewSizer_img”。

XPATH 还有许多其他选项也可以工作。它还取决于 HTML 结构更改的频率。这也适用于选择该 img 所引用的页面:

/html/body/div/div/div/div/img[1]

First Answer:

/html/body/div[3]/div/div[3]/div/div/div/div/div/div/div/div/div/div/h1/a

Your XPATH was off by one in the predicate filter for the 4th div of the body, it should be the 3rd div. It appears the HTML for the site can/does change from when you had origionally snagged the XPATH using Firebug. You may need to adjust your XPATH to accommodate for potential change and be less sensitive to some differences in document structure.

Maybe something like this:

/html/body//div/h1/a

Second Answer: The XPATH that you listed will work. It may look odd/short(and may not be the most efficient), but // starts at the root node and looks throughout every node in the tree, * matches on any element(to include the img) and the [] predicate filter restricts it to those that have an id attribute who's value equals "gmi-ResViewSizer_img".

There are many other options for XPATHs that could work as well. It will also depend on how often the HTML structure changes. This is one that also works for the page referenced to select that img:

/html/body/div/div/div/div/img[1]
深海蓝天 2024-10-12 11:21:57

我遇到了同样的问题,当我在页面上实现 iframe 标签时,我解决了,尝试调用

((HtmlPage)current_page.getFrames()[n].getEnclosedPage()).getElementByXPath(...

其中 n 是 iframe 集合中框架的位置。这对我来说是工作!

多谢。

I had the same problem, I solved when I realize iframe tags on page, try call

((HtmlPage)current_page.getFrames()[n].getEnclosedPage()).getElementByXPath(...

where n is the position in frame in iframe collection. It's work for me !!!

Thanks a lot.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文