HtmlUnit 和 XPath：DOMNode.getByXPath 仅适用于 HtmlPage？

发布于 2024-09-17 20:14:27 字数 5012 浏览 8 评论 0原文

我正在尝试解析一个页面，其中包含指向重要内容看起来的文章的链接像这样：

<div class="article">
  <h1 style="float: none;"><a href="performing-arts">Performing Arts</a></h1>
  <a href="/performing-arts/EIF-theatre-review-Sin-Sangre.6517348.jp">
    <span class="mth3">
      <span id="wctlMiniTemplate1_ctl00_ctl00_ctl01_WctlPremiumContentIcon1">                               
      </span>
      EIF theatre review: Sin Sangre | The Man Who Fed Butterflies | Caledonia | Songs Of Ascension | Vieux Carré | The Gospel At Colonus
    </span>
    <span class="mtp">The EIF&#39;s theatre programme wasn&#39;t as far-reaching as it could have been, but did find an exoticism in the familiar,  writes Mark Fisher </span>
  </a>                  
</div>

这是一个使用 HtmlUnit 和 XPath 的 Java 最小抓取案例（为简洁起见，删除了导入）：

public class MinimalTest {
    public static void main(String[] args) throws Exception {
        WebClient client = new WebClient();
        client.setJavaScriptEnabled(false);
        client.setCssEnabled(false);
        System.out.println("Fetching front page");
        HtmlPage frontPage = client.getPage("http://living.scotsman.com/sectionhome.aspx?sectionID=7063");
        List<ArticleInfo> articleInfos = extractArticleInfo(frontPage);

        for (ArticleInfo info : articleInfos)
        {
            System.out.println("Title: " + info.getTitle());
            System.out.println("Intro: " + info.getFirstPara());
            System.out.println("Link: " + info.getLink());
        }
    }

    @SuppressWarnings("unchecked") // xpath returns List<?>
    private static List<ArticleInfo> extractArticleInfo(HtmlPage frontPage) {
        System.out.println("Extracting article links");
        List<HtmlDivision> articleDivs = (List<HtmlDivision>) frontPage.getByXPath("//div[@class='article']");
        System.out.println(String.format("Found %d articles", articleDivs.size()));
        List<ArticleInfo> articleLinks = new ArrayList<ArticleInfo>(articleDivs.size());
        for (HtmlDivision div : articleDivs) {
            articleLinks.add(ArticleInfo.constructFromArticleDiv(div));
        }
        return articleLinks;
    }

    private static class ArticleInfo {
        private final String title;
        private final String link;
        private final String firstPara;

        public ArticleInfo(final String link, final String title, final String firstPara) {
            this.link = link;
            this.title = title;
            this.firstPara = firstPara;
        }
        public static ArticleInfo constructFromArticleDiv(final HtmlDivision div) {
            String link = ((DomText) div.getFirstByXPath("//a/@href/text()")).asText();
            String title = ((DomText) div.getFirstByXPath("//span[@class='mth3']/text()")).asText();
            String firstPara = ((DomText) div.getFirstByXPath("//span[@class='mtp']/text()")).asText();
            return new ArticleInfo(link, title, firstPara);
        }
        public String getTitle() {
            return title;
        }
        public String getFirstPara() {
            return firstPara;
        }
        public String getLink() {
            return link;
        }
    }  
}

我期望的输出：

Title: EIF theatre review: Sin Sangre | The Man Who Fed Butterflies | Caledonia | Songs Of Ascension | Vieux Carré | The Gospel At Colonus 
Intro: The EIF's theatre programme wasn't as far-reaching as it could have been, but did find an exoticism in the familiar, writes Mark Fisher 
Link: http://living.scotsman.com/performing-arts/EIF-theatre-review-Sin-Sangre.6517348.jp

我得到的结果：

Fetching front page
Extracting article links
Found 24 articles
Exception in thread "main" java.lang.NullPointerException
    at com.allthefestivals.app.crawler.MinimalTest$ArticleInfo.constructFromArticleDiv(MinimalTest.java:68)
    at com.allthefestivals.app.crawler.MinimalTest.extractArticleInfo(MinimalTest.java:50)
    at com.allthefestivals.app.crawler.MinimalTest.main(MinimalTest.java:30)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:115)

调用 getByXPath 在 HtmlPage 上工作正常> 但似乎在任何其他 HtmlElement 上没有返回任何内容。怎么了？这是 HtmlUnit 中的错误或实现差距，还是我遗漏了 XPath 语法的一些微妙之处？

相关问题的解决方案对我不起作用：XPath _relative_ to给定元素在 HTMLUnit/Groovy 中？

原文

I'm trying to parse a page with links to articles whose important content looks like this:

<div class="article">
  <h1 style="float: none;"><a href="performing-arts">Performing Arts</a></h1>
  <a href="/performing-arts/EIF-theatre-review-Sin-Sangre.6517348.jp">
    <span class="mth3">
      <span id="wctlMiniTemplate1_ctl00_ctl00_ctl01_WctlPremiumContentIcon1">                               
      </span>
      EIF theatre review: Sin Sangre | The Man Who Fed Butterflies | Caledonia | Songs Of Ascension | Vieux Carré | The Gospel At Colonus
    </span>
    <span class="mtp">The EIF's theatre programme wasn't as far-reaching as it could have been, but did find an exoticism in the familiar,  writes Mark Fisher </span>
  </a>                  
</div>

Here is a minimal scraping case in Java using HtmlUnit and XPath (imports removed for brevity):

public class MinimalTest {
    public static void main(String[] args) throws Exception {
        WebClient client = new WebClient();
        client.setJavaScriptEnabled(false);
        client.setCssEnabled(false);
        System.out.println("Fetching front page");
        HtmlPage frontPage = client.getPage("http://living.scotsman.com/sectionhome.aspx?sectionID=7063");
        List<ArticleInfo> articleInfos = extractArticleInfo(frontPage);

        for (ArticleInfo info : articleInfos)
        {
            System.out.println("Title: " + info.getTitle());
            System.out.println("Intro: " + info.getFirstPara());
            System.out.println("Link: " + info.getLink());
        }
    }

    @SuppressWarnings("unchecked") // xpath returns List<?>
    private static List<ArticleInfo> extractArticleInfo(HtmlPage frontPage) {
        System.out.println("Extracting article links");
        List<HtmlDivision> articleDivs = (List<HtmlDivision>) frontPage.getByXPath("//div[@class='article']");
        System.out.println(String.format("Found %d articles", articleDivs.size()));
        List<ArticleInfo> articleLinks = new ArrayList<ArticleInfo>(articleDivs.size());
        for (HtmlDivision div : articleDivs) {
            articleLinks.add(ArticleInfo.constructFromArticleDiv(div));
        }
        return articleLinks;
    }

    private static class ArticleInfo {
        private final String title;
        private final String link;
        private final String firstPara;

        public ArticleInfo(final String link, final String title, final String firstPara) {
            this.link = link;
            this.title = title;
            this.firstPara = firstPara;
        }
        public static ArticleInfo constructFromArticleDiv(final HtmlDivision div) {
            String link = ((DomText) div.getFirstByXPath("//a/@href/text()")).asText();
            String title = ((DomText) div.getFirstByXPath("//span[@class='mth3']/text()")).asText();
            String firstPara = ((DomText) div.getFirstByXPath("//span[@class='mtp']/text()")).asText();
            return new ArticleInfo(link, title, firstPara);
        }
        public String getTitle() {
            return title;
        }
        public String getFirstPara() {
            return firstPara;
        }
        public String getLink() {
            return link;
        }
    }  
}

Output I expect:

Title: EIF theatre review: Sin Sangre | The Man Who Fed Butterflies | Caledonia | Songs Of Ascension | Vieux Carré | The Gospel At Colonus 
Intro: The EIF's theatre programme wasn't as far-reaching as it could have been, but did find an exoticism in the familiar, writes Mark Fisher 
Link: http://living.scotsman.com/performing-arts/EIF-theatre-review-Sin-Sangre.6517348.jp

What I get:

Fetching front page
Extracting article links
Found 24 articles
Exception in thread "main" java.lang.NullPointerException
    at com.allthefestivals.app.crawler.MinimalTest$ArticleInfo.constructFromArticleDiv(MinimalTest.java:68)
    at com.allthefestivals.app.crawler.MinimalTest.extractArticleInfo(MinimalTest.java:50)
    at com.allthefestivals.app.crawler.MinimalTest.main(MinimalTest.java:30)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:115)

Calling getByXPath works fine on a HtmlPage but seems to return nothing on any other HtmlElement. What's wrong? Is this a bug or implementation gap in HtmlUnit, or am I missing something subtle about XPath syntax?

Related question whose solution didn't work for me: XPath _relative_ to given element in HTMLUnit/Groovy?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

别念他 2024-09-24 20:14:27

您尝试将属性视为元素。试试这个：

String link = ((DomAttr) div.getFirstByXPath("//a/@href")).getValue();

然后我得到了

Fetching front page
Extracting article links
Found 24 articles
Title: EIF theatre review: Sin Sangre | The Man Who Fed Butterflies | Caledonia | Songs Of Ascension | Vieux Carré | The Gospel At Colonus
Intro: The EIF's theatre programme wasn't as far-reaching as it could have been, but did find an exoticism in the familiar, writes Mark Fisher
Link: /Register.aspx?ReturnURL=http%3a%2f%2fliving.scotsman.com%2fsectionhome.aspx%3fsectionID%3d7063
...

另外，你的 ArticleInfo 类将“link”声明为 String，然后为其分配一些（自定义？）类。我不得不稍微修改一些东西才能让它编译。

You've tried to treat an attribute as an element. Try this instead:

String link = ((DomAttr) div.getFirstByXPath("//a/@href")).getValue();

Then I got

Fetching front page
Extracting article links
Found 24 articles
Title: EIF theatre review: Sin Sangre | The Man Who Fed Butterflies | Caledonia | Songs Of Ascension | Vieux Carré | The Gospel At Colonus
Intro: The EIF's theatre programme wasn't as far-reaching as it could have been, but did find an exoticism in the familiar, writes Mark Fisher
Link: /Register.aspx?ReturnURL=http%3a%2f%2fliving.scotsman.com%2fsectionhome.aspx%3fsectionID%3d7063
...

Also, your ArticleInfo class declares "link" to be a String, then assigns it some (custom?) class. I had to mangle things a bit just to get it to compile.

回复收藏 0 原文

~没有更多了~