使用 Jsoup 进行网页抓取只能发挥一半的作用
我最近一直在研究 Java Jsoup 库,试图更好地理解网络抓取(从网站上提取数据)。但我设法组合的代码似乎只能在部分时间起作用。是我的代码有问题,还是某些网站可能有措施阻止网络抓取?
这是执行所有“魔法”的类:
import java.io.IOException;
import org.jsoup.*;
import org.jsoup.nodes.Document;
public class HTMLParser {
private Document d;
private String url;
private String content;
public HTMLParser(String url){
this.url = url;
connect();
parse();
display();
}
private void connect(){
try{
d = Jsoup.connect(url).get();
}catch(IOException e){}
}
private void parse(){
content = d.body().text();
}
private void display(){
System.out.println(content);
}
}
I've been playing around with the Java Jsoup library lately in an attempt to get a better understanding of web scraping (pulling data off a website). But it would seem that the code I managed to put together only functions part of the time. Is the issue with my code, or is it possible that certain sites have measures to stop web scraping?
Here is the class that does all the 'magic' :
import java.io.IOException;
import org.jsoup.*;
import org.jsoup.nodes.Document;
public class HTMLParser {
private Document d;
private String url;
private String content;
public HTMLParser(String url){
this.url = url;
connect();
parse();
display();
}
private void connect(){
try{
d = Jsoup.connect(url).get();
}catch(IOException e){}
}
private void parse(){
content = d.body().text();
}
private void display(){
System.out.println(content);
}
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如果站点动态加载数据,您也可能会遇到问题。尤其是在这个 AJAX 时代。 JSoup 是否忽略 robots.txt,或者您可以让它这样做吗?
理想情况下,您需要渲染页面,然后抓取它。
该软件显然呈现网页: http://lobobrowser.org/java-browser.jsp
当然还有一个 API,它可以让您查看网页的结构。
You might also have a problem if the site dynamically loads data. Especially in this age of AJAX. Does JSoup ignore robot.txt, or can you make it do so?
Ideally you need to render the page, and THEN scrape it.
This software apparently renders web pages: http://lobobrowser.org/java-browser.jsp
And there's certainly an API, which might allow you to look into the webpage's structure.
您可以使用 https://github.com/subes/invesdwin-webproxy 及其 HtmlUnit Javascript无头浏览器支持在实际进行解析之前等待页面渲染/加载数据/执行 JS/执行 Ajax 魔法。
You can use https://github.com/subes/invesdwin-webproxy with its HtmlUnit Javascript headless browser support to wait for the page to render/load data/execute JS/do its Ajax magic before actually doing the parsing.
您可以在没有 Jsoup 的情况下进行网页抓取。
//必须尝试在这里抓住或扔掉它
You can web scrape without Jsoup.
//HAVE TO HAVE TRY CATCH HERE OR THROW IT