在(任何)Java 程序中渲染 JavaScript 和 HTML(访问渲染的 DOM 树)?

发布于 2024-08-19 18:08:52 字数 601 浏览 4 评论 0原文

什么是最好的 Java 库来“完全下载任何网页并渲染内置 JavaScript,然后以编程方式访问渲染的网页(即 DOM 树!)并将 DOM 树作为“HTML 源” (

与 firebug 最后所做的类似,它渲染页面,并且我可以访问完全渲染的 DOM 树,就像页面在浏览器中的样子一样!相反,如果我单击“显示源代码”,我只会获得 JavaScript 不是我想要的。我需要访问渲染的页面...)

(渲染我的意思是仅渲染 DOM 树而不视觉渲染...)

源代码。这 不必是一个单独的库,可以有多个库一起完成此任务(一个将下载,一个将渲染......),但由于 JavaScript 的动态特性,很可能 JavaScript 库也必须有一些一种完全渲染任何异步 JS 的下载器...

背景:
在“美好的过去”,HttpClient(Apache 库)是构建您自己的非常简单的爬虫程序所需的一切。 (很多像Nutch或Heretrix这样的爬虫仍然围绕这个核心原则构建,主要关注标准HTML解析,所以我无法向他们学习) 我的问题是,我需要抓取一些严重依赖 JavaScript 的网站,并且我无法使用 HttpClient 解析这些网站,因为我确实需要先执行 JavaScript...

What are the best Java libraries to "fully download any webpage and render the built-in JavaScript(s) and then access the rendered webpage (that is the DOM-Tree !) programmatically and get the DOM Tree as an "HTML-Source"?

(Something similarly what firebug does in the end, it renders the page and I get access to the fully rendered DOM Tree, as the page looks like in the browser! In contrast, if I click "show source" I only get the JavaScript source code. This is not what I want. I need to have access to the rendered page...)

(With rendering I mean only rendering the DOM Tree not a visual rendering...)

This does not have to be one single library, it's ok to have several libraries that can accomplish this together (one will download, one render...), but due to the dynamic nature of JavaScript most likely the JavaScript library will also have to have some kind of downloader to fully render any asynchronous JS...

Background:
In the "good old days" HttpClient (Apache Library) was everything required to build your own very simple crawler. (A lot of cralwers like Nutch or Heretrix are still built around this core princible, mainly focussing on Standard HTML parsing, so I can't learn from them)
My problem is that I need to crawl some websites that rely heavily on JavaScript and that I can't parse with HttpClient as I defenitely need to execute the JavaScripts before...

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

桃扇骨 2024-08-26 18:08:52

您可以使用 JavaFX 2 WebEngine。下载 JavaFX SDK (如果您安装了 JDK7u2 或更高版本,则可能已经拥有它)并尝试下面的代码。

它将使用经过处理的 javascript 打印 html。
您也可以取消中间行的注释来查看渲染。

public class WebLauncher extends Application {

    @Override
    public void start(Stage stage) {
        final WebView webView = new WebView();
        final WebEngine webEngine = webView.getEngine();
        webEngine.load("http://stackoverflow.com");
        //stage.setScene(new Scene(webView));
        //stage.show();

        webEngine.getLoadWorker().workDoneProperty().addListener(new ChangeListener<Number>() {
            @Override
            public void changed(ObservableValue<? extends Number> observable, Number oldValue, Number newValue) {
                if (newValue.intValue() == 100 /*percents*/) {
                    try {
                        org.w3c.dom.Document doc = webEngine.getDocument();
                        new XMLSerializer(System.out, new OutputFormat(doc, "UTF-8", true)).serialize(doc);
                    } catch (IOException ex) { 
                        ex.printStackTrace();
                    }
                }
            }
        });

    }

    public static void main(String[] args) {
        launch();
    }

}

You can use JavaFX 2 WebEngine. Download JavaFX SDK (you may already have it if you installed JDK7u2 or later) and try code below.

It will print html with processed javascript.
You can uncomment lines in the middle to see rendering as well.

public class WebLauncher extends Application {

    @Override
    public void start(Stage stage) {
        final WebView webView = new WebView();
        final WebEngine webEngine = webView.getEngine();
        webEngine.load("http://stackoverflow.com");
        //stage.setScene(new Scene(webView));
        //stage.show();

        webEngine.getLoadWorker().workDoneProperty().addListener(new ChangeListener<Number>() {
            @Override
            public void changed(ObservableValue<? extends Number> observable, Number oldValue, Number newValue) {
                if (newValue.intValue() == 100 /*percents*/) {
                    try {
                        org.w3c.dom.Document doc = webEngine.getDocument();
                        new XMLSerializer(System.out, new OutputFormat(doc, "UTF-8", true)).serialize(doc);
                    } catch (IOException ex) { 
                        ex.printStackTrace();
                    }
                }
            }
        });

    }

    public static void main(String[] args) {
        launch();
    }

}
全部不再 2024-08-26 18:08:52

这有点超出常规,但如果您计划在可以完全控制环境的服务器中运行代码,那么它可能会起作用...

安装 Firefox(或 XulRunner,如果您想保持轻量级)在你的机器上。

使用 Firefox 插件系统,编写一个小插件,加载给定的 URL,等待几秒钟,然后将页面的 DOM 复制到字符串中。

在此插件中,使用 Java LiveConnect API(请参阅 http://jdk6.java.net/plugin2/ liveconnect/https://developer.mozilla.org/en/LiveConnect )将该字符串推送到某些嵌入式 Java 代码中的公共静态函数,该函数可以自行执行所需的处理,也可以将其外包给一些更复杂的代码。

优点:您使用的是大多数应用程序开发人员所针对的浏览器,因此观察到的行为应该具有可比性。您还可以按照正常的升级路径升级浏览器,这样您的库就不会随着 HTML 标准的变化而过时。

缺点:您需要拥有在服务器上启动非无头应用程序的权限。您还需要担心进程间通信的复杂性。

我之前用过插件API调用Java,还是蛮可以实现的。如果您想要一些示例代码,您应该看一下 XQuery 插件 - 它从 DOM 加载 XQuery 代码,将其传递到 Java Saxon 库进行处理,然后将结果推回浏览器。这里有一些关于它的详细信息:

https://developer.mozilla.org/en/XQuery

This is a bit outside of the box, but if you are planning on running your code in a server where you have complete control over your environment, it might work...

Install Firefox (or XulRunner, if you want to keep things lightweight) on your machine.

Using the Firefox plugins system, write a small plugin which takes loads a given URL, waits a few seconds, then copies the page's DOM into a String.

From this plugin, use the Java LiveConnect API (see http://jdk6.java.net/plugin2/liveconnect/ and https://developer.mozilla.org/en/LiveConnect ) to push that string across to a public static function in some embedded Java code, which can either do the required processing itself or farm it out to some more complicated code.

Benefits: You are using a browser that most application developers target, so the observed behavior should be comparable. You can also upgrade the browser along the normal upgrade path, so your library won't become out-of-date as HTML standards change.

Disadvantages: You will need to have permission to start a non-headless application on your server. You'll also have the complexity of inter-process communication to worry about.

I have used the plugin API to call Java before, and it's quite achievable. If you'd like some sample code, you should take a look at the XQuery plugin - it loads XQuery code from the DOM, passes it across to the Java Saxon library for processing, then pushes the result back into the browser. There are some details about it here:

https://developer.mozilla.org/en/XQuery

_失温 2024-08-26 18:08:52

Selenium 库通常用于测试,但确实可以让您远程控制大多数标准浏览器(IE、Firefox 等)以及无头、无浏览器模式(使用 HtmlUnit)。因为它旨在通过页面抓取进行 UI 验证,所以它很可能满足您的目的。

根据我的经验,有时它可能会与非常慢的 JavaScript 发生冲突,但是通过仔细使用“wait”命令,您可以获得非常可靠的结果。

它还具有的好处是您可以实际驱动页面,而不仅仅是抓取它。这意味着,如果您在获取所需数据之前在页面上执行一些操作(单击搜索按钮、单击下一步、立即抓取),那么您可以将其编码到流程中。

我不知道您是否能够从 Selenium 获取可导航形式的完整 DOM,但它确实为页面的各个部分提供 XPath 检索,这就是您想要的通常需要进行刮擦应用。

The Selenium library is normally used for testing, but does give you remote control of most standard browsers (IE, Firefox, etc) as well as a headless, browser free mode (using HtmlUnit). Because it is intended for UI verification by page scraping, it may well serve your purposes.

In my experience it can sometimes struggle with very slow JavaScript, but with careful use of "wait" commands you can get quite reliable results.

It also has the benefit that you can actually drive the page, not just scrape it. That means that if you perform some actions on the page before you get to the data you want (click the search button, click next, now scrape) then you can code that into the process.

I don't know if you'll be able to get the full DOM in a navigable form from Selenium, but it does provide XPath retrieval for the various parts of the page, which is what you'd normally need for a scraping application.

谈下烟灰 2024-08-26 18:08:52

您可以使用 Java、Groovy(带或不带 Grails)。然后使用 Webdriver、Selenium、Spock 和 Geb,这些用于测试目的,但这些库对您的情况很有用。
您可以实现一个爬虫,它不会打开新窗口,而只是打开这些浏览器的运行时。

You can use Java, Groovy with or without Grails. Then use Webdriver, Selenium, Spock and Geb these are for testing purposes, but the libraries are useful for your case.
You can implement a Crawler that won't open a new window but just a runtime of these either browser.

还给你自由 2024-08-26 18:08:52

你可以试试JExplorer。
有关详细信息,请参阅 http://www.teamdev.com/downloads/jexplorer /docs/JExplorer-PGuide.html

您还可以尝试 Cobra,请参阅 http://lobobrowser.org/眼镜蛇.jsp

You can try JExplorer.
For more information see http://www.teamdev.com/downloads/jexplorer/docs/JExplorer-PGuide.html

You can also try Cobra, see http://lobobrowser.org/cobra.jsp

够运 2024-08-26 18:08:52

我还没有尝试过这个项目,但我见过几个包含 javascript dom 操作的 Node.js 实现。

https://github.com/tmpvar/jsdom

I haven't tried this project, but I have seen several implementations for node.js that include javascript dom manipulation.

https://github.com/tmpvar/jsdom

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文