使用 Jsoup 进行网页抓取只能发挥一半的作用

发布于 2024-11-29 06:05:27 字数 704 浏览 0 评论 0原文

我最近一直在研究 Java Jsoup 库，试图更好地理解网络抓取（从网站上提取数据）。但我设法组合的代码似乎只能在部分时间起作用。是我的代码有问题，还是某些网站可能有措施阻止网络抓取？

这是执行所有“魔法”的类：

import java.io.IOException;
import org.jsoup.*;
import org.jsoup.nodes.Document;




public class HTMLParser {

private Document d;
private String url;
private String content;



    public HTMLParser(String url){
    this.url = url; 
     connect();
     parse();
     display();

    }


    private void connect(){ 
        try{
        d = Jsoup.connect(url).get();   
        }catch(IOException e){}
    }

    private void parse(){
        content = d.body().text();

    }

    private void display(){
        System.out.println(content);

    }

}

原文

I've been playing around with the Java Jsoup library lately in an attempt to get a better understanding of web scraping (pulling data off a website). But it would seem that the code I managed to put together only functions part of the time. Is the issue with my code, or is it possible that certain sites have measures to stop web scraping?

Here is the class that does all the 'magic' :

import java.io.IOException;
import org.jsoup.*;
import org.jsoup.nodes.Document;




public class HTMLParser {

private Document d;
private String url;
private String content;



    public HTMLParser(String url){
    this.url = url; 
     connect();
     parse();
     display();

    }


    private void connect(){ 
        try{
        d = Jsoup.connect(url).get();   
        }catch(IOException e){}
    }

    private void parse(){
        content = d.body().text();

    }

    private void display(){
        System.out.println(content);

    }

}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

爱格式化 2024-12-06 06:05:27

如果站点动态加载数据，您也可能会遇到问题。尤其是在这个 AJAX 时代。 JSoup 是否忽略 robots.txt，或者您可以让它这样做吗？

理想情况下，您需要渲染页面，然后抓取它。

该软件显然呈现网页： http://lobobrowser.org/java-browser.jsp
当然还有一个 API，它可以让您查看网页的结构。

回复收藏 0 原文

阪姬 2024-12-06 06:05:27

您可以使用 https://github.com/subes/invesdwin-webproxy 及其 HtmlUnit Javascript无头浏览器支持在实际进行解析之前等待页面渲染/加载数据/执行 JS/执行 Ajax 魔法。

回复收藏 0 原文

北音执念 2024-12-06 06:05:27

您可以在没有 Jsoup 的情况下进行网页抓取。

public class Trick {
public static void main(String[] args) {
String str;
URLConnection con;

//必须尝试在这里抓住或扔掉它

con =  new URL("ANY URL").openConnection();
Scanner scanner = new Scanner(con.getInputStream());
scanner.useDelimiter(INPUT ANY DELIMETER);
str = scanner.next();
scanner.close();



str = str.substring(content.indexOf("NAME OF CLASS OF ID") + INPUT A NUMBER 
WHICH SIGNIFIES HOW MANY INDEXES YOU WANT IT TO NOT CONSIDER STARTING FROM THE 
LEFT);
String wow = str.substring(0, content.indexOf("WHERE YOU WANT IT TO END OR STOP 
SCRAPING"));
System.out.println(wow);
str = str.substring(content.indexOf("WHERE YOU WANT IT TO END OR STOP 
SCRAPING"));
}
//System.out.println(wow);}}

You can web scrape without Jsoup.

public class Trick {
public static void main(String[] args) {
String str;
URLConnection con;

//HAVE TO HAVE TRY CATCH HERE OR THROW IT

con =  new URL("ANY URL").openConnection();
Scanner scanner = new Scanner(con.getInputStream());
scanner.useDelimiter(INPUT ANY DELIMETER);
str = scanner.next();
scanner.close();



str = str.substring(content.indexOf("NAME OF CLASS OF ID") + INPUT A NUMBER 
WHICH SIGNIFIES HOW MANY INDEXES YOU WANT IT TO NOT CONSIDER STARTING FROM THE 
LEFT);
String wow = str.substring(0, content.indexOf("WHERE YOU WANT IT TO END OR STOP 
SCRAPING"));
System.out.println(wow);
str = str.substring(content.indexOf("WHERE YOU WANT IT TO END OR STOP 
SCRAPING"));
}
//System.out.println(wow);}}

回复收藏 0 原文

~没有更多了~