Java Web Scraper项目正在返回NULL,而不是普通链接
用于WebCraper的HTMLUNIT依赖性。 主要问题是我的刮板返回空而不是链接。我做了一个项目课程。
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlAnchor;
import com.gargoylesoftware.htmlunit.html.HtmlElement;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import java.util.List;
public class Scraper {
private static final String searchUrl = "https://sfbay.craigslist.org/search/sss?query=iphone%208&sort=rel";
public static void main(String[] args){
WebClient client = new WebClient();
client.getOptions().setJavaScriptEnabled(false);
client.getOptions().setCssEnabled(false);
client.getOptions().setUseInsecureSSL(true);
HtmlPage page = client.getPage(searchUrl);
List<HtmlElement> items = page.getByXPath("//li[@class='result-row']");
for(HtmlElement htmlItem : items){
HtmlAnchor itemAnchor = ((HtmlAnchor)htmlItem.getFirstByXPath("//a[@class='result-image gallery']")); //itemAnchor gets the anchor specified by class result-image gallery//
Item item = new Item();
String link = itemAnchor.getHrefAttribute(); //link is extracted and initialized in string//
item.setUrl(link);
System.out.println(item.getUrl()); //why don't you work//
}
}
结果: 基本上是一条无效的行
*注:putting system.out.println(link)返回一个链接,并重新使用与打印新行相同的链接,在这种情况下,它将只是链接'https://sfbay.craigslist 。
任何帮助都是有用的。 编辑:以防万一,我将在此处包含依赖码
<dependency>
<groupId>net.sourceforge.htmlunit</groupId>
<artifactId>htmlunit</artifactId>
<version>2.60.0</version>
</dependency>
Used maven for htmlunit dependency for the webscraper.
The main issue is that my scraper returns null instead of links. I made an item class to set and get.
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlAnchor;
import com.gargoylesoftware.htmlunit.html.HtmlElement;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import java.util.List;
public class Scraper {
private static final String searchUrl = "https://sfbay.craigslist.org/search/sss?query=iphone%208&sort=rel";
public static void main(String[] args){
WebClient client = new WebClient();
client.getOptions().setJavaScriptEnabled(false);
client.getOptions().setCssEnabled(false);
client.getOptions().setUseInsecureSSL(true);
HtmlPage page = client.getPage(searchUrl);
List<HtmlElement> items = page.getByXPath("//li[@class='result-row']");
for(HtmlElement htmlItem : items){
HtmlAnchor itemAnchor = ((HtmlAnchor)htmlItem.getFirstByXPath("//a[@class='result-image gallery']")); //itemAnchor gets the anchor specified by class result-image gallery//
Item item = new Item();
String link = itemAnchor.getHrefAttribute(); //link is extracted and initialized in string//
item.setUrl(link);
System.out.println(item.getUrl()); //why don't you work//
}
}
Result:
basically a line of null going down
*note: Putting System.out.println(link) returns one link and reuses that same link as it prints new line, in this case it would be just the link 'https://sfbay.craigslist.org/sby/mob/d/san-jose-iphone-plus-256-gb-black/7482411084.html' going all the way down
I'm a complete beginner in this cruel world. Any help is useful.
edit: I'm going to include the dependency code here just in case, and the code for the Item class likely doesn't need to be here as it is just a set and a get method outlined by setUrl and getUrl
<dependency>
<groupId>net.sourceforge.htmlunit</groupId>
<artifactId>htmlunit</artifactId>
<version>2.60.0</version>
</dependency>
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
在这里起作用,
产生了类似
This works here
producing something like