Java Web Scraper项目正在返回NULL，而不是普通链接

发布于 2025-01-27 22:37:39 字数 1749 浏览 1 评论 0原文

用于WebCraper的HTMLUNIT依赖性。主要问题是我的刮板返回空而不是链接。我做了一个项目课程。

import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlAnchor;
import com.gargoylesoftware.htmlunit.html.HtmlElement;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import java.util.List;
public class Scraper {

private static final String searchUrl = "https://sfbay.craigslist.org/search/sss?query=iphone%208&sort=rel";

public static void main(String[] args){
        WebClient client = new WebClient();
        client.getOptions().setJavaScriptEnabled(false);
        client.getOptions().setCssEnabled(false);
        client.getOptions().setUseInsecureSSL(true);

        HtmlPage page = client.getPage(searchUrl);
        List<HtmlElement> items = page.getByXPath("//li[@class='result-row']");
        for(HtmlElement htmlItem : items){

             HtmlAnchor itemAnchor = ((HtmlAnchor)htmlItem.getFirstByXPath("//a[@class='result-image gallery']")); //itemAnchor gets the anchor specified by class result-image gallery//
             Item item = new Item();
             String link = itemAnchor.getHrefAttribute(); //link is extracted and initialized in string//
             item.setUrl(link); 
             System.out.println(item.getUrl()); //why don't you work//

}

}

结果：基本上是一条无效的行

*注：putting system.out.println（link）返回一个链接，并重新使用与打印新行相同的链接，在这种情况下，它将只是链接'https：//sfbay.craigslist 。

任何帮助都是有用的。编辑：以防万一，我将在此处包含依赖码

        <dependency>
            <groupId>net.sourceforge.htmlunit</groupId>
            <artifactId>htmlunit</artifactId>
            <version>2.60.0</version>
        </dependency>

原文

Used maven for htmlunit dependency for the webscraper.
The main issue is that my scraper returns null instead of links. I made an item class to set and get.

import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlAnchor;
import com.gargoylesoftware.htmlunit.html.HtmlElement;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import java.util.List;
public class Scraper {

private static final String searchUrl = "https://sfbay.craigslist.org/search/sss?query=iphone%208&sort=rel";

public static void main(String[] args){
        WebClient client = new WebClient();
        client.getOptions().setJavaScriptEnabled(false);
        client.getOptions().setCssEnabled(false);
        client.getOptions().setUseInsecureSSL(true);

        HtmlPage page = client.getPage(searchUrl);
        List<HtmlElement> items = page.getByXPath("//li[@class='result-row']");
        for(HtmlElement htmlItem : items){

             HtmlAnchor itemAnchor = ((HtmlAnchor)htmlItem.getFirstByXPath("//a[@class='result-image gallery']")); //itemAnchor gets the anchor specified by class result-image gallery//
             Item item = new Item();
             String link = itemAnchor.getHrefAttribute(); //link is extracted and initialized in string//
             item.setUrl(link); 
             System.out.println(item.getUrl()); //why don't you work//

}

}

Result:
basically a line of null going down

*note: Putting System.out.println(link) returns one link and reuses that same link as it prints new line, in this case it would be just the link 'https://sfbay.craigslist.org/sby/mob/d/san-jose-iphone-plus-256-gb-black/7482411084.html' going all the way down

I'm a complete beginner in this cruel world. Any help is useful.
edit: I'm going to include the dependency code here just in case, and the code for the Item class likely doesn't need to be here as it is just a set and a get method outlined by setUrl and getUrl

        <dependency>
            <groupId>net.sourceforge.htmlunit</groupId>
            <artifactId>htmlunit</artifactId>
            <version>2.60.0</version>
        </dependency>

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

一身仙ぐ女味 2025-02-03 22:37:39

在这里起作用，

public static void main(String[] args) throws IOException {
    String url = "https://sfbay.craigslist.org/search/sss?query=iphone%208&sort=rel";

    try (final WebClient webClient = new WebClient()) {
        HtmlPage page = webClient.getPage(url);
        // webClient.waitForBackgroundJavaScript(10_000);

        List<HtmlElement> items = page.getByXPath("//li[@class='result-row']");
        for(HtmlElement htmlItem : items){
             HtmlAnchor itemAnchor = ((HtmlAnchor)htmlItem.getFirstByXPath("a[@class='result-image gallery']"));
             if (itemAnchor != null) {
               String link = itemAnchor.getHrefAttribute();
               System.out.println("-> " + link);
             }
        }
    }
}

产生了类似

-> https://sfbay.craigslist.org/eby/pho/d/walnut-creek-original-new-defender/7470991009.html
-> https://sfbay.craigslist.org/eby/pho/d/walnut-creek-original-new-defender/7471913572.html
-> https://sfbay.craigslist.org/eby/pho/d/walnut-creek-original-new-defender/7471010388.html
....

This works here

public static void main(String[] args) throws IOException {
    String url = "https://sfbay.craigslist.org/search/sss?query=iphone%208&sort=rel";

    try (final WebClient webClient = new WebClient()) {
        HtmlPage page = webClient.getPage(url);
        // webClient.waitForBackgroundJavaScript(10_000);

        List<HtmlElement> items = page.getByXPath("//li[@class='result-row']");
        for(HtmlElement htmlItem : items){
             HtmlAnchor itemAnchor = ((HtmlAnchor)htmlItem.getFirstByXPath("a[@class='result-image gallery']"));
             if (itemAnchor != null) {
               String link = itemAnchor.getHrefAttribute();
               System.out.println("-> " + link);
             }
        }
    }
}

producing something like

-> https://sfbay.craigslist.org/eby/pho/d/walnut-creek-original-new-defender/7470991009.html
-> https://sfbay.craigslist.org/eby/pho/d/walnut-creek-original-new-defender/7471913572.html
-> https://sfbay.craigslist.org/eby/pho/d/walnut-creek-original-new-defender/7471010388.html
....

回复收藏 0 原文

~没有更多了~