使用 Jsoup 进行网页抓取只能发挥一半的作用

发布于 2024-11-29 06:05:27 字数 704 浏览 0 评论 0原文

我最近一直在研究 Java Jsoup 库,试图更好地理解网络抓取(从网站上提取数据)。但我设法组合的代码似乎只能在部分时间起作用。是我的代码有问题,还是某些网站可能有措施阻止网络抓取?

这是执行所有“魔法”的类:

import java.io.IOException;
import org.jsoup.*;
import org.jsoup.nodes.Document;




public class HTMLParser {

private Document d;
private String url;
private String content;



    public HTMLParser(String url){
    this.url = url; 
     connect();
     parse();
     display();

    }


    private void connect(){ 
        try{
        d = Jsoup.connect(url).get();   
        }catch(IOException e){}
    }

    private void parse(){
        content = d.body().text();

    }

    private void display(){
        System.out.println(content);

    }

}

I've been playing around with the Java Jsoup library lately in an attempt to get a better understanding of web scraping (pulling data off a website). But it would seem that the code I managed to put together only functions part of the time. Is the issue with my code, or is it possible that certain sites have measures to stop web scraping?

Here is the class that does all the 'magic' :

import java.io.IOException;
import org.jsoup.*;
import org.jsoup.nodes.Document;




public class HTMLParser {

private Document d;
private String url;
private String content;



    public HTMLParser(String url){
    this.url = url; 
     connect();
     parse();
     display();

    }


    private void connect(){ 
        try{
        d = Jsoup.connect(url).get();   
        }catch(IOException e){}
    }

    private void parse(){
        content = d.body().text();

    }

    private void display(){
        System.out.println(content);

    }

}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

爱格式化 2024-12-06 06:05:27

如果站点动态加载数据,您也可能会遇到问题。尤其是在这个 AJAX 时代。 JSoup 是否忽略 robots.txt,或者您可以让它这样做吗?

理想情况下,您需要渲染页面,然后抓取它。

该软件显然呈现网页: http://lobobrowser.org/java-browser.jsp
当然还有一个 API,它可以让您查看网页的结构。

You might also have a problem if the site dynamically loads data. Especially in this age of AJAX. Does JSoup ignore robot.txt, or can you make it do so?

Ideally you need to render the page, and THEN scrape it.

This software apparently renders web pages: http://lobobrowser.org/java-browser.jsp
And there's certainly an API, which might allow you to look into the webpage's structure.

阪姬 2024-12-06 06:05:27

您可以使用 https://github.com/subes/invesdwin-webproxy 及其 HtmlUnit Javascript无头浏览器支持在实际进行解析之前等待页面渲染/加载数据/执行 JS/执行 Ajax 魔法。

You can use https://github.com/subes/invesdwin-webproxy with its HtmlUnit Javascript headless browser support to wait for the page to render/load data/execute JS/do its Ajax magic before actually doing the parsing.

北音执念 2024-12-06 06:05:27

您可以在没有 Jsoup 的情况下进行网页抓取。

public class Trick {
public static void main(String[] args) {
String str;
URLConnection con;

//必须尝试在这里抓住或扔掉它

con =  new URL("ANY URL").openConnection();
Scanner scanner = new Scanner(con.getInputStream());
scanner.useDelimiter(INPUT ANY DELIMETER);
str = scanner.next();
scanner.close();



str = str.substring(content.indexOf("NAME OF CLASS OF ID") + INPUT A NUMBER 
WHICH SIGNIFIES HOW MANY INDEXES YOU WANT IT TO NOT CONSIDER STARTING FROM THE 
LEFT);
String wow = str.substring(0, content.indexOf("WHERE YOU WANT IT TO END OR STOP 
SCRAPING"));
System.out.println(wow);
str = str.substring(content.indexOf("WHERE YOU WANT IT TO END OR STOP 
SCRAPING"));
}
//System.out.println(wow);}}

You can web scrape without Jsoup.

public class Trick {
public static void main(String[] args) {
String str;
URLConnection con;

//HAVE TO HAVE TRY CATCH HERE OR THROW IT

con =  new URL("ANY URL").openConnection();
Scanner scanner = new Scanner(con.getInputStream());
scanner.useDelimiter(INPUT ANY DELIMETER);
str = scanner.next();
scanner.close();



str = str.substring(content.indexOf("NAME OF CLASS OF ID") + INPUT A NUMBER 
WHICH SIGNIFIES HOW MANY INDEXES YOU WANT IT TO NOT CONSIDER STARTING FROM THE 
LEFT);
String wow = str.substring(0, content.indexOf("WHERE YOU WANT IT TO END OR STOP 
SCRAPING"));
System.out.println(wow);
str = str.substring(content.indexOf("WHERE YOU WANT IT TO END OR STOP 
SCRAPING"));
}
//System.out.println(wow);}}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文