HtmlUnit 和 XPath:DOMNode.getByXPath 仅适用于 HtmlPage?
我正在尝试解析 一个页面,其中包含指向重要内容看起来的文章的链接像这样:
<div class="article">
<h1 style="float: none;"><a href="performing-arts">Performing Arts</a></h1>
<a href="/performing-arts/EIF-theatre-review-Sin-Sangre.6517348.jp">
<span class="mth3">
<span id="wctlMiniTemplate1_ctl00_ctl00_ctl01_WctlPremiumContentIcon1">
</span>
EIF theatre review: Sin Sangre | The Man Who Fed Butterflies | Caledonia | Songs Of Ascension | Vieux Carré | The Gospel At Colonus
</span>
<span class="mtp">The EIF's theatre programme wasn't as far-reaching as it could have been, but did find an exoticism in the familiar, writes Mark Fisher </span>
</a>
</div>
这是一个使用 HtmlUnit 和 XPath 的 Java 最小抓取案例(为简洁起见,删除了导入):
public class MinimalTest {
public static void main(String[] args) throws Exception {
WebClient client = new WebClient();
client.setJavaScriptEnabled(false);
client.setCssEnabled(false);
System.out.println("Fetching front page");
HtmlPage frontPage = client.getPage("http://living.scotsman.com/sectionhome.aspx?sectionID=7063");
List<ArticleInfo> articleInfos = extractArticleInfo(frontPage);
for (ArticleInfo info : articleInfos)
{
System.out.println("Title: " + info.getTitle());
System.out.println("Intro: " + info.getFirstPara());
System.out.println("Link: " + info.getLink());
}
}
@SuppressWarnings("unchecked") // xpath returns List<?>
private static List<ArticleInfo> extractArticleInfo(HtmlPage frontPage) {
System.out.println("Extracting article links");
List<HtmlDivision> articleDivs = (List<HtmlDivision>) frontPage.getByXPath("//div[@class='article']");
System.out.println(String.format("Found %d articles", articleDivs.size()));
List<ArticleInfo> articleLinks = new ArrayList<ArticleInfo>(articleDivs.size());
for (HtmlDivision div : articleDivs) {
articleLinks.add(ArticleInfo.constructFromArticleDiv(div));
}
return articleLinks;
}
private static class ArticleInfo {
private final String title;
private final String link;
private final String firstPara;
public ArticleInfo(final String link, final String title, final String firstPara) {
this.link = link;
this.title = title;
this.firstPara = firstPara;
}
public static ArticleInfo constructFromArticleDiv(final HtmlDivision div) {
String link = ((DomText) div.getFirstByXPath("//a/@href/text()")).asText();
String title = ((DomText) div.getFirstByXPath("//span[@class='mth3']/text()")).asText();
String firstPara = ((DomText) div.getFirstByXPath("//span[@class='mtp']/text()")).asText();
return new ArticleInfo(link, title, firstPara);
}
public String getTitle() {
return title;
}
public String getFirstPara() {
return firstPara;
}
public String getLink() {
return link;
}
}
}
我期望的输出:
Title: EIF theatre review: Sin Sangre | The Man Who Fed Butterflies | Caledonia | Songs Of Ascension | Vieux Carré | The Gospel At Colonus
Intro: The EIF's theatre programme wasn't as far-reaching as it could have been, but did find an exoticism in the familiar, writes Mark Fisher
Link: http://living.scotsman.com/performing-arts/EIF-theatre-review-Sin-Sangre.6517348.jp
我得到的结果:
Fetching front page
Extracting article links
Found 24 articles
Exception in thread "main" java.lang.NullPointerException
at com.allthefestivals.app.crawler.MinimalTest$ArticleInfo.constructFromArticleDiv(MinimalTest.java:68)
at com.allthefestivals.app.crawler.MinimalTest.extractArticleInfo(MinimalTest.java:50)
at com.allthefestivals.app.crawler.MinimalTest.main(MinimalTest.java:30)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:115)
调用 getByXPath
在 HtmlPage
上工作正常> 但似乎在任何其他 HtmlElement
上没有返回任何内容。怎么了?这是 HtmlUnit 中的错误或实现差距,还是我遗漏了 XPath 语法的一些微妙之处?
相关问题的解决方案对我不起作用:XPath _relative_ to给定元素在 HTMLUnit/Groovy 中?
I'm trying to parse a page with links to articles whose important content looks like this:
<div class="article">
<h1 style="float: none;"><a href="performing-arts">Performing Arts</a></h1>
<a href="/performing-arts/EIF-theatre-review-Sin-Sangre.6517348.jp">
<span class="mth3">
<span id="wctlMiniTemplate1_ctl00_ctl00_ctl01_WctlPremiumContentIcon1">
</span>
EIF theatre review: Sin Sangre | The Man Who Fed Butterflies | Caledonia | Songs Of Ascension | Vieux Carré | The Gospel At Colonus
</span>
<span class="mtp">The EIF's theatre programme wasn't as far-reaching as it could have been, but did find an exoticism in the familiar, writes Mark Fisher </span>
</a>
</div>
Here is a minimal scraping case in Java using HtmlUnit and XPath (imports removed for brevity):
public class MinimalTest {
public static void main(String[] args) throws Exception {
WebClient client = new WebClient();
client.setJavaScriptEnabled(false);
client.setCssEnabled(false);
System.out.println("Fetching front page");
HtmlPage frontPage = client.getPage("http://living.scotsman.com/sectionhome.aspx?sectionID=7063");
List<ArticleInfo> articleInfos = extractArticleInfo(frontPage);
for (ArticleInfo info : articleInfos)
{
System.out.println("Title: " + info.getTitle());
System.out.println("Intro: " + info.getFirstPara());
System.out.println("Link: " + info.getLink());
}
}
@SuppressWarnings("unchecked") // xpath returns List<?>
private static List<ArticleInfo> extractArticleInfo(HtmlPage frontPage) {
System.out.println("Extracting article links");
List<HtmlDivision> articleDivs = (List<HtmlDivision>) frontPage.getByXPath("//div[@class='article']");
System.out.println(String.format("Found %d articles", articleDivs.size()));
List<ArticleInfo> articleLinks = new ArrayList<ArticleInfo>(articleDivs.size());
for (HtmlDivision div : articleDivs) {
articleLinks.add(ArticleInfo.constructFromArticleDiv(div));
}
return articleLinks;
}
private static class ArticleInfo {
private final String title;
private final String link;
private final String firstPara;
public ArticleInfo(final String link, final String title, final String firstPara) {
this.link = link;
this.title = title;
this.firstPara = firstPara;
}
public static ArticleInfo constructFromArticleDiv(final HtmlDivision div) {
String link = ((DomText) div.getFirstByXPath("//a/@href/text()")).asText();
String title = ((DomText) div.getFirstByXPath("//span[@class='mth3']/text()")).asText();
String firstPara = ((DomText) div.getFirstByXPath("//span[@class='mtp']/text()")).asText();
return new ArticleInfo(link, title, firstPara);
}
public String getTitle() {
return title;
}
public String getFirstPara() {
return firstPara;
}
public String getLink() {
return link;
}
}
}
Output I expect:
Title: EIF theatre review: Sin Sangre | The Man Who Fed Butterflies | Caledonia | Songs Of Ascension | Vieux Carré | The Gospel At Colonus
Intro: The EIF's theatre programme wasn't as far-reaching as it could have been, but did find an exoticism in the familiar, writes Mark Fisher
Link: http://living.scotsman.com/performing-arts/EIF-theatre-review-Sin-Sangre.6517348.jp
What I get:
Fetching front page
Extracting article links
Found 24 articles
Exception in thread "main" java.lang.NullPointerException
at com.allthefestivals.app.crawler.MinimalTest$ArticleInfo.constructFromArticleDiv(MinimalTest.java:68)
at com.allthefestivals.app.crawler.MinimalTest.extractArticleInfo(MinimalTest.java:50)
at com.allthefestivals.app.crawler.MinimalTest.main(MinimalTest.java:30)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:115)
Calling getByXPath
works fine on a HtmlPage
but seems to return nothing on any other HtmlElement
. What's wrong? Is this a bug or implementation gap in HtmlUnit, or am I missing something subtle about XPath syntax?
Related question whose solution didn't work for me: XPath _relative_ to given element in HTMLUnit/Groovy?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您尝试将属性视为元素。试试这个:
然后我得到了
另外,你的 ArticleInfo 类将“link”声明为 String,然后为其分配一些(自定义?)类。我不得不稍微修改一些东西才能让它编译。
You've tried to treat an attribute as an element. Try this instead:
Then I got
Also, your ArticleInfo class declares "link" to be a String, then assigns it some (custom?) class. I had to mangle things a bit just to get it to compile.