页面内容是用 JavaScript 加载的,而 Jsoup 看不到它
页面上的一个块由 JavaScript 填充内容,并且在使用 Jsoup 加载页面后,没有任何信息。使用 Jsoup
解析页面时是否有办法获取 JavaScript 生成的内容?
无法在此处粘贴页面代码,因为它太长: http://pastebin.com/qw4Rfqgw
这是元素我需要哪些内容:
我需要用 Java 获取此信息。最好使用 Jsoup。元素是 JavaScript 帮助下的字段:
<div id="tags_list">
<a href="/tagsc0t20099.html" style="font-size:14;">разведчик</a>
<a href="/tagsc0t1879.html" style="font-size:14;">Sr</a>
<a href="/tagsc0t3140.html" style="font-size:14;">стратегический</a>
</div>
Java 代码:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
public class Test
{
public static void main( String[] args )
{
try
{
Document Doc = Jsoup.connect( "http://www.bestreferat.ru/referat-32558.html" ).get();
Elements Tags = Doc.select( "#tags_list a" );
for ( Element Tag : Tags )
{
System.out.println( Tag.text() );
}
}
catch ( IOException e )
{
e.printStackTrace();
}
}
}
One block on the page is filled with content by JavaScript and after loading page with Jsoup there is none of that inforamtion. Is there a way to get also JavaScript generated content when parsing page with Jsoup
?
Can't paste page code here, since it is too long: http://pastebin.com/qw4Rfqgw
Here's element which content I need: <div id='tags_list'></div>
I need to get this information in Java. Preferably using Jsoup. Element is field with help of JavaScript:
<div id="tags_list">
<a href="/tagsc0t20099.html" style="font-size:14;">разведчик</a>
<a href="/tagsc0t1879.html" style="font-size:14;">Sr</a>
<a href="/tagsc0t3140.html" style="font-size:14;">стратегический</a>
</div>
Java code:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
public class Test
{
public static void main( String[] args )
{
try
{
Document Doc = Jsoup.connect( "http://www.bestreferat.ru/referat-32558.html" ).get();
Elements Tags = Doc.select( "#tags_list a" );
for ( Element Tag : Tags )
{
System.out.println( Tag.text() );
}
}
catch ( IOException e )
{
e.printStackTrace();
}
}
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
JSoup 是一个 HTML 解析器,而不是某种嵌入式浏览器引擎。这意味着它完全不知道在初始页面加载后由 Javascript 添加到 DOM 的任何内容。
要访问该类型的内容,您将需要一个嵌入式浏览器组件,有很多关于此类组件的讨论,例如 有没有办法在 Java 中嵌入浏览器?
JSoup is an HTML parser, not some kind of embedded browser engine. This means that it's completely unaware of any content that is added to the DOM by Javascript after the initial page load.
To get access to that type of content you will need an embedded browser component, there are a number of discussions on SO regarding that kind of component, eg Is there a way to embed a browser in Java?
使用 com.codeborne.phantomjsdriver 解决了我的情况
注意:这是常规代码。
pom.xml
PhantomJsUtils.groovy
ClassInProject.groovy
Solved in my case with com.codeborne.phantomjsdriver
NOTE: it is groovy code.
pom.xml
PhantomJsUtils.groovy
ClassInProject.groovy
您需要了解发生了什么:
理解这一点的方法如下:解析 HTML 代码很容易。执行 Javascript 代码并更新相应的 HTML 代码要复杂得多,并且是浏览器的工作。
以下是此类问题的一些解决方案:
如果您可以找到 Javascript 代码正在执行的 Ajax 调用是什么(即加载内容),您也许可以将这些调用的 URL 与 Jsoup 一起使用。为此,请使用浏览器中的开发人员工具。但这并不能保证一定有效:
在这些情况下,您将需要“模拟”浏览器的工作。幸运的是,这样的工具是存在的。我知道并推荐的是 PhantomJS。它适用于 Javascript,您需要通过启动一个新进程来从 Java 启动它。如果您想坚持使用 Java,这篇文章列出了一些 Java 替代方案。
You need to understand what is happening :
The way to understand this is the following : parsing HTML code is easy. Executing Javascript code and updating corresponding HTML code is a lot more complex, and is the work of a browser.
Here are some solutions for this kind of problems:
If you can find what are the Ajax calls that Javascript code is making, that is loading content, you might be able to use the URL of these calls with Jsoup. In order to do that, use Developer Tools from your browser. But this is not guaranteed to work:
In these cases, you will need to "simulate" the work of a browser. Fortunately, such tools exist. The one I know, and recommend, is PhantomJS. It works with Javascript, and you would need to launch it from Java by starting a new process. If you want to stick to Java, this post lists some Java alternatives.
JavaScript 脚本加载完成后,您可以使用 JSoup 和 HtmlUnit 的组合来获取页面内容。
pom.xml
简单示例 来自文件 https://riptutorial.com/jsoup/example/16274/parsing-javascript- generated-page-with-jsoup-and-htmunit
复杂示例:加载登录,获取Session和CSRF,然后发布并等待主页完成加载(15秒)
You can use a combination of JSoup and HtmlUnit to get the page contents after JavaScript scripts are done loading.
pom.xml
Simple Example From file https://riptutorial.com/jsoup/example/16274/parsing-javascript-generated-page-with-jsoup-and-htmunit
A Complex Example: Load login, get Session and CSRF, then post and wait for home page to finish loading (15 seconds)
可以通过将
JSoup
与另一个框架相结合来解释网页,在我的示例中,我使用HtmlUnit
。It is possible by combining
JSoup
with another framework to interpret the webpage, in my example here I'm usingHtmlUnit
.其实有一个“办法”!也许它更像是“一种解决方法”而不是一种“方式...下面的代码检查元属性“REFRESH”和javascript重定向...如果它们中的任何一个存在,则设置了
RedirectedUrl
变量。所以你知道你的目标...然后你可以检索目标页面并继续...I fact there is a "way"! Maybe it is more "a workaround" than a "way... The code below checks both for meta attribute "REFRESH" and javascript redirects... If either of them exists
RedirectedUrl
variable is set. So you know your target... Then you can retrieve the target page and go on...指定用户代理后,我的问题就解决了。
https://github.com/jhy/jsoup/issues/287#issuecomment- 12769155
After specifying user agent, my problem is solved.
https://github.com/jhy/jsoup/issues/287#issuecomment-12769155
尝试:
Try: