使用 htmlunit 构建复杂的网站
我正在尝试使用 HTMLUnit 转储某个网站的全部内容,但是当我尝试在某个(相当复杂的)网站中执行此操作时,我得到一个空文件(本身不是一个空文件,但它有一个空文件) head 标签,一个空的 body 标签,仅此而已)。
这是我的代码:
BufferedWriter writer = new BufferedWriter(new FileWriter(fullOutputPath));
HtmlPage page;
final WebClient webClient = new WebClient(BrowserVersion.INTERNET_EXPLORER_8);
webClient.setCssEnabled(false);
webClient.setPopupBlockerEnabled(true);
webClient.setRedirectEnabled(true);
webClient.setThrowExceptionOnScriptError(false);
webClient.setThrowExceptionOnFailingStatusCode(false);
webClient.setUseInsecureSSL(true);
webClient.setJavaScriptEnabled(true);
page = webClient.getPage(url);
dumpString += page.asXml();
writer.write(dumpString);
writer.close();
webClient.closeAllWindows();
有人说我需要在代码中引入暂停,因为在 Google Chrome 中加载页面需要一段时间,但我设置了很长的暂停,但它不起作用。
提前致谢。
I'm trying to dump the whole contents of a certain site using HTMLUnit, but when I try to do this in a certain (rather intrincate) site, I get an empty file (not an empty file per se, but it has an empty head tag, an empty body tag and that's it).
And here's my code:
BufferedWriter writer = new BufferedWriter(new FileWriter(fullOutputPath));
HtmlPage page;
final WebClient webClient = new WebClient(BrowserVersion.INTERNET_EXPLORER_8);
webClient.setCssEnabled(false);
webClient.setPopupBlockerEnabled(true);
webClient.setRedirectEnabled(true);
webClient.setThrowExceptionOnScriptError(false);
webClient.setThrowExceptionOnFailingStatusCode(false);
webClient.setUseInsecureSSL(true);
webClient.setJavaScriptEnabled(true);
page = webClient.getPage(url);
dumpString += page.asXml();
writer.write(dumpString);
writer.close();
webClient.closeAllWindows();
Some people say that I need to introduce a pause in my code, since the page takes a while to load in Google Chrome, but I set long pauses and it doesn't work.
Thanks in advanced.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
只是一些想法...
使用
wget
检索该 URL 将返回一个重要的 HTML 文件。同样,使用webClient.setJavaScriptEnabled(false)
运行代码。所以这肯定与页面中的Javascript有关。启用 Javascript 后,我从日志中看到一堆 Javascript 作业正在排队,并且我看到类似这样的相应错误:
也许这些作业是为了填充您的 HTML?那么当它们失败时,生成的 HTML 是空的?
该错误看起来很奇怪,因为 HtmlUnit 通常对 JQuery 没有问题。我怀疑问题出在调用 JQuery 库的特定行的代码上。
Just some ideas...
Retrieving that URL with
wget
returns a non-trivial HTML file. Likewise running your code withwebClient.setJavaScriptEnabled(false)
. So it's definitely something to do with the Javascript in the page.With Javascript enabled, I see from the logs that a bunch of Javascript jobs are being queued up, and I get see corresponding errors like this:
Maybe those jobs are meant to populate your HTML? So when they fail, the resulting HTML is empty?
The error looks strange, as HtmlUnit usually has no issues with JQuery. I suspect the issue is with the code calling that particular line of the JQuery library.