页面内容是用 JavaScript 加载的,而 Jsoup 看不到它

发布于 2024-12-05 10:49:55 字数 1316 浏览 2 评论 0原文

页面上的一个块由 JavaScript 填充内容,并且在使用 Jsoup 加载页面后,没有任何信息。使用 Jsoup 解析页面时是否有办法获取 JavaScript 生成的内容?

无法在此处粘贴页面代码,因为它太长: http://pastebin.com/qw4Rfqgw

这是元素我需要哪些内容:

我需要用 Java 获取此信息。最好使用 Jsoup。元素是 JavaScript 帮助下的字段:

<div id="tags_list">
    <a href="/tagsc0t20099.html" style="font-size:14;">разведчик</a>
    <a href="/tagsc0t1879.html" style="font-size:14;">Sr</a>
    <a href="/tagsc0t3140.html" style="font-size:14;">стратегический</a>
</div>

Java 代码:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;

public class Test
{
    public static void main( String[] args )
    {
        try
        {
            Document Doc = Jsoup.connect( "http://www.bestreferat.ru/referat-32558.html" ).get();
            Elements Tags = Doc.select( "#tags_list a" );

            for ( Element Tag : Tags )
            {
                System.out.println( Tag.text() );
            }
        }
        catch ( IOException e )
        {
            e.printStackTrace();
        }
    }
}

One block on the page is filled with content by JavaScript and after loading page with Jsoup there is none of that inforamtion. Is there a way to get also JavaScript generated content when parsing page with Jsoup?

Can't paste page code here, since it is too long: http://pastebin.com/qw4Rfqgw

Here's element which content I need: <div id='tags_list'></div>

I need to get this information in Java. Preferably using Jsoup. Element is field with help of JavaScript:

<div id="tags_list">
    <a href="/tagsc0t20099.html" style="font-size:14;">разведчик</a>
    <a href="/tagsc0t1879.html" style="font-size:14;">Sr</a>
    <a href="/tagsc0t3140.html" style="font-size:14;">стратегический</a>
</div>

Java code:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;

public class Test
{
    public static void main( String[] args )
    {
        try
        {
            Document Doc = Jsoup.connect( "http://www.bestreferat.ru/referat-32558.html" ).get();
            Elements Tags = Doc.select( "#tags_list a" );

            for ( Element Tag : Tags )
            {
                System.out.println( Tag.text() );
            }
        }
        catch ( IOException e )
        {
            e.printStackTrace();
        }
    }
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

旧故 2024-12-12 10:49:55

JSoup 是一个 HTML 解析器,而不是某种嵌入式浏览器引擎。这意味着它完全不知道在初始页面加载后由 Javascript 添加到 DOM 的任何内容。

要访问该类型的内容,您将需要一个嵌入式浏览器组件,有很多关于此类组件的讨论,例如 有没有办法在 Java 中嵌入浏览器?

JSoup is an HTML parser, not some kind of embedded browser engine. This means that it's completely unaware of any content that is added to the DOM by Javascript after the initial page load.

To get access to that type of content you will need an embedded browser component, there are a number of discussions on SO regarding that kind of component, eg Is there a way to embed a browser in Java?

愛放△進行李 2024-12-12 10:49:55

使用 com.codeborne.phantomjsdriver 解决了我的情况
注意:这是常规代码。

pom.xml

        <dependency>
          <groupId>com.codeborne</groupId>
          <artifactId>phantomjsdriver</artifactId>
          <version> <here goes last version> </version>
        </dependency>

PhantomJsUtils.groovy

import org.jsoup.Jsoup
import org.jsoup.nodes.Document
import org.openqa.selenium.WebDriver
import org.openqa.selenium.phantomjs.PhantomJSDriver

class PhantomJsUtils {
    private static String filePath = 'data/temp/';

    public static Document renderPage(String filePath) {
        System.setProperty("phantomjs.binary.path", 'libs/phantomjs') // path to bin file. NOTE: platform dependent
        WebDriver ghostDriver = new PhantomJSDriver();
        try {
            ghostDriver.get(filePath);
            return Jsoup.parse(ghostDriver.getPageSource());
        } finally {
            ghostDriver.quit();
        }
    }

    public static Document renderPage(Document doc) {
        String tmpFileName = "$filePath${Calendar.getInstance().timeInMillis}.html";
        FileUtils.writeToFile(tmpFileName, doc.toString());
        return renderPage(tmpFileName);
    }
}

ClassInProject.groovy

Document doc = PhantomJsUtils.renderPage(Jsoup.parse(yourSource))

Solved in my case with com.codeborne.phantomjsdriver
NOTE: it is groovy code.

pom.xml

        <dependency>
          <groupId>com.codeborne</groupId>
          <artifactId>phantomjsdriver</artifactId>
          <version> <here goes last version> </version>
        </dependency>

PhantomJsUtils.groovy

import org.jsoup.Jsoup
import org.jsoup.nodes.Document
import org.openqa.selenium.WebDriver
import org.openqa.selenium.phantomjs.PhantomJSDriver

class PhantomJsUtils {
    private static String filePath = 'data/temp/';

    public static Document renderPage(String filePath) {
        System.setProperty("phantomjs.binary.path", 'libs/phantomjs') // path to bin file. NOTE: platform dependent
        WebDriver ghostDriver = new PhantomJSDriver();
        try {
            ghostDriver.get(filePath);
            return Jsoup.parse(ghostDriver.getPageSource());
        } finally {
            ghostDriver.quit();
        }
    }

    public static Document renderPage(Document doc) {
        String tmpFileName = "$filePath${Calendar.getInstance().timeInMillis}.html";
        FileUtils.writeToFile(tmpFileName, doc.toString());
        return renderPage(tmpFileName);
    }
}

ClassInProject.groovy

Document doc = PhantomJsUtils.renderPage(Jsoup.parse(yourSource))
可爱暴击 2024-12-12 10:49:55

您需要了解发生了什么:

  • 当您从网站查询页面时,无论是使用 Jsoup 还是浏览器,返回给您的是一些 HTML。 Jsoup 能够解析它。
  • 但是,大多数网站在该 HTML 中包含 Javascript,或者从该 HTML 链接,这将用内容填充页面。您的浏览器能够执行 Javascript,从而填充页面。 Jsoup 不是。

理解这一点的方法如下:解析 HTML 代码很容易。执行 Javascript 代码并更新相应的 HTML 代码要复杂得多,并且是浏览器的工作。

以下是此类问题的一些解决方案:

  1. 如果您可以找到 Javascript 代码正在执行的 Ajax 调用是什么(即加载内容),您也许可以将这些调用的 URL 与 Jsoup 一起使用。为此,请使用浏览器中的开发人员工具。但这并不能保证一定有效:

    • 该网址可能是动态的,并且取决于当时页面上的内容
    • 如果内容不公开,就会涉及到cookie,仅仅查询资源URL是不够的
  2. 在这些情况下,您将需要“模拟”浏览器的工作。幸运的是,这样的工具是存在的。我知道并推荐的是 PhantomJS。它适用于 Javascript,您需要通过启动一个新进程来从 Java 启动它。如果您想坚持使用 Java,这篇文章列出了一些 Java 替代方案。

You need to understand what is happening :

  • When you query a page from a website, whether using Jsoup or your browser, what gets sent back to you is some HTML. Jsoup is able to parse that.
  • However, most websites include Javascript in that HTML, or linked from that HTML, which will populate the page with content. Your browser is able to execute the Javascript, and thus populate the page. Jsoup is not.

The way to understand this is the following : parsing HTML code is easy. Executing Javascript code and updating corresponding HTML code is a lot more complex, and is the work of a browser.

Here are some solutions for this kind of problems:

  1. If you can find what are the Ajax calls that Javascript code is making, that is loading content, you might be able to use the URL of these calls with Jsoup. In order to do that, use Developer Tools from your browser. But this is not guaranteed to work:

    • it might be that the url is dynamic, and depends on what is on the page at that time
    • if the content is not public, cookies will be involved, and simply querying the resource URL will not be enough
  2. In these cases, you will need to "simulate" the work of a browser. Fortunately, such tools exist. The one I know, and recommend, is PhantomJS. It works with Javascript, and you would need to launch it from Java by starting a new process. If you want to stick to Java, this post lists some Java alternatives.

背叛残局 2024-12-12 10:49:55

JavaScript 脚本加载完成后,您可以使用 JSoup 和 HtmlUnit 的组合来获取页面内容。

pom.xml

<dependency>
    <groupId>net.sourceforge.htmlunit</groupId>
    <artifactId>htmlunit</artifactId>
    <version>3.35</version>
</dependency>

简单示例 来自文件 https://riptutorial.com/jsoup/example/16274/parsing-javascript- generated-page-with-jsoup-and-htmunit

// load page using HTML Unit and fire scripts
WebClient webClient2 = new WebClient();
HtmlPage myPage = webClient2.getPage(new File("page.html").toURI().toURL());

// convert page to generated HTML and convert to document
Document doc = Jsoup.parse(myPage.asXml());

// iterate row and col
for (Element row : doc.select("table#data > tbody > tr"))
    for (Element col : row.select("td"))
        // print results
        System.out.println(col.ownText());

// clean up resources        
webClient2.close();

复杂示例:加载登录,获取Session和CSRF,然后发布并等待主页完成加载(15秒)

import java.io.IOException;
import java.net.HttpCookie;
import java.net.MalformedURLException;
import java.net.URL;

import org.jsoup.Connection;
import org.jsoup.Connection.Method;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.HttpMethod;
import com.gargoylesoftware.htmlunit.NicelyResynchronizingAjaxController;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.WebRequest;
import com.gargoylesoftware.htmlunit.html.HtmlPage;

//JSoup load Login Page and get Session Details
Connection.Response res = Jsoup.connect("https://loginpage").method(Method.GET).execute();

String sessionId = res.cookie("findSESSION");
String csrf = res.cookie("findCSRF");

HttpCookie cookie = new HttpCookie("findCSRF", csrf);
cookie.setDomain("domain.url");
cookie.setPath("/path");

WebClient webClient = new WebClient();
webClient.addCookie(cookie.toString(),
            new URL("https://url"),
            "https://referrer");

// Add other cookies/ Session ...

webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setUseInsecureSSL(true);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getCookieManager().setCookiesEnabled(true);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
// Wait time
webClient.waitForBackgroundJavaScript(15000);
webClient.getOptions().setThrowExceptionOnScriptError(false);

URL url = new URL("https://login.path");
WebRequest requestSettings = new WebRequest(url, HttpMethod.POST);

requestSettings.setRequestBody("user=234&pass=sdsdc&CSRFToken="+csrf);
HtmlPage page = webClient.getPage(requestSettings);

// Wait
synchronized (page) {
    try {
        page.wait(15000);
    } catch (InterruptedException e) {
        e.printStackTrace();
    }
}

// Parse logged in page as needed
Document doc = Jsoup.parse(page.asXml());

You can use a combination of JSoup and HtmlUnit to get the page contents after JavaScript scripts are done loading.

pom.xml

<dependency>
    <groupId>net.sourceforge.htmlunit</groupId>
    <artifactId>htmlunit</artifactId>
    <version>3.35</version>
</dependency>

Simple Example From file https://riptutorial.com/jsoup/example/16274/parsing-javascript-generated-page-with-jsoup-and-htmunit

// load page using HTML Unit and fire scripts
WebClient webClient2 = new WebClient();
HtmlPage myPage = webClient2.getPage(new File("page.html").toURI().toURL());

// convert page to generated HTML and convert to document
Document doc = Jsoup.parse(myPage.asXml());

// iterate row and col
for (Element row : doc.select("table#data > tbody > tr"))
    for (Element col : row.select("td"))
        // print results
        System.out.println(col.ownText());

// clean up resources        
webClient2.close();

A Complex Example: Load login, get Session and CSRF, then post and wait for home page to finish loading (15 seconds)

import java.io.IOException;
import java.net.HttpCookie;
import java.net.MalformedURLException;
import java.net.URL;

import org.jsoup.Connection;
import org.jsoup.Connection.Method;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.HttpMethod;
import com.gargoylesoftware.htmlunit.NicelyResynchronizingAjaxController;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.WebRequest;
import com.gargoylesoftware.htmlunit.html.HtmlPage;

//JSoup load Login Page and get Session Details
Connection.Response res = Jsoup.connect("https://loginpage").method(Method.GET).execute();

String sessionId = res.cookie("findSESSION");
String csrf = res.cookie("findCSRF");

HttpCookie cookie = new HttpCookie("findCSRF", csrf);
cookie.setDomain("domain.url");
cookie.setPath("/path");

WebClient webClient = new WebClient();
webClient.addCookie(cookie.toString(),
            new URL("https://url"),
            "https://referrer");

// Add other cookies/ Session ...

webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setUseInsecureSSL(true);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getCookieManager().setCookiesEnabled(true);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
// Wait time
webClient.waitForBackgroundJavaScript(15000);
webClient.getOptions().setThrowExceptionOnScriptError(false);

URL url = new URL("https://login.path");
WebRequest requestSettings = new WebRequest(url, HttpMethod.POST);

requestSettings.setRequestBody("user=234&pass=sdsdc&CSRFToken="+csrf);
HtmlPage page = webClient.getPage(requestSettings);

// Wait
synchronized (page) {
    try {
        page.wait(15000);
    } catch (InterruptedException e) {
        e.printStackTrace();
    }
}

// Parse logged in page as needed
Document doc = Jsoup.parse(page.asXml());
素食主义者 2024-12-12 10:49:55

可以通过将 JSoup 与另一个框架相结合来解释网页,在我的示例中,我使用 HtmlUnit

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;

...

WebClient webClient = new WebClient();
HtmlPage myPage = webClient.getPage(URL);

Document document = Jsoup.parse(myPage.asXml());
Elements otherLinks = document.select("a[href]");

It is possible by combining JSoup with another framework to interpret the webpage, in my example here I'm using HtmlUnit.

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;

...

WebClient webClient = new WebClient();
HtmlPage myPage = webClient.getPage(URL);

Document document = Jsoup.parse(myPage.asXml());
Elements otherLinks = document.select("a[href]");
万人眼中万个我 2024-12-12 10:49:55

其实有一个“办法”!也许它更像是“一种解决方法”而不是一种“方式...下面的代码检查元属性“REFRESH”和javascript重定向...如果它们中的任何一个存在,则设置了RedirectedUrl变量。所以你知道你的目标...然后你可以检索目标页面并继续...

    String RedirectedUrl=null;
    Elements meta = page.select("html head meta");
    if (meta.attr("http-equiv").contains("REFRESH")) {
        RedirectedUrl = meta.attr("content").split("=")[1];
    } else {
        if (page.toString().contains("window.location.href")) {
            meta = page.select("script");
            for (Element script:meta) {
                String s = script.data();
                if (!s.isEmpty() && s.startsWith("window.location.href")) {
                    int start = s.indexOf("=");
                    int end = s.indexOf(";");
                    if (start>0 && end >start) {
                        s = s.substring(start+1,end);
                        s =s.replace("'", "").replace("\"", "");        
                        RedirectedUrl = s.trim();
                        break;
                    }
                }
            }
        }
    }

... now retrieve the redirected page again...

I fact there is a "way"! Maybe it is more "a workaround" than a "way... The code below checks both for meta attribute "REFRESH" and javascript redirects... If either of them exists RedirectedUrl variable is set. So you know your target... Then you can retrieve the target page and go on...

    String RedirectedUrl=null;
    Elements meta = page.select("html head meta");
    if (meta.attr("http-equiv").contains("REFRESH")) {
        RedirectedUrl = meta.attr("content").split("=")[1];
    } else {
        if (page.toString().contains("window.location.href")) {
            meta = page.select("script");
            for (Element script:meta) {
                String s = script.data();
                if (!s.isEmpty() && s.startsWith("window.location.href")) {
                    int start = s.indexOf("=");
                    int end = s.indexOf(";");
                    if (start>0 && end >start) {
                        s = s.substring(start+1,end);
                        s =s.replace("'", "").replace("\"", "");        
                        RedirectedUrl = s.trim();
                        break;
                    }
                }
            }
        }
    }

... now retrieve the redirected page again...
感性不性感 2024-12-12 10:49:55

指定用户代理后,我的问题就解决了。

https://github.com/jhy/jsoup/issues/287#issuecomment- 12769155

After specifying user agent, my problem is solved.

https://github.com/jhy/jsoup/issues/287#issuecomment-12769155

两仪 2024-12-12 10:49:55

尝试:

Document Doc = Jsoup.connect(url)
    .header("Accept-Encoding", "gzip, deflate")
    .userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0")
    .maxBodySize(0)
    .timeout(600000)
    .get();

Try:

Document Doc = Jsoup.connect(url)
    .header("Accept-Encoding", "gzip, deflate")
    .userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0")
    .maxBodySize(0)
    .timeout(600000)
    .get();

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文