使用 htmlunit 构建复杂的网站

发布于 2024-12-07 23:18:05 字数 1251 浏览 3 评论 0原文

我正在尝试使用 HTMLUnit 转储某个网站的全部内容,但是当我尝试在某个(相当复杂的)网站中执行此操作时,我得到一个空文件(本身不是一个空文件,但它有一个空文件) head 标签,一个空的 body 标签,仅此而已)。

该网站是 https://www.abcdin.cl/abcdin/abcdin.nsf#https://www.abcdin.cl/abcdin/abcdin.nsf/linea?openpage&cat=Audio&cattxt=TV% 20y%20Audio&catpos=03&linea=LCD&lineatxt=LCD%20&

这是我的代码:

BufferedWriter writer = new BufferedWriter(new FileWriter(fullOutputPath));
HtmlPage page;
final WebClient webClient = new WebClient(BrowserVersion.INTERNET_EXPLORER_8);
webClient.setCssEnabled(false);
webClient.setPopupBlockerEnabled(true);
webClient.setRedirectEnabled(true);
webClient.setThrowExceptionOnScriptError(false);
webClient.setThrowExceptionOnFailingStatusCode(false);
webClient.setUseInsecureSSL(true);
webClient.setJavaScriptEnabled(true);
page = webClient.getPage(url);
dumpString += page.asXml();
writer.write(dumpString);
writer.close();
webClient.closeAllWindows();

有人说我需要在代码中引入暂停,因为在 Google Chrome 中加载页面需要一段时间,但我设置了很长的暂停,但它不起作用。

提前致谢。

I'm trying to dump the whole contents of a certain site using HTMLUnit, but when I try to do this in a certain (rather intrincate) site, I get an empty file (not an empty file per se, but it has an empty head tag, an empty body tag and that's it).

The site is https://www.abcdin.cl/abcdin/abcdin.nsf#https://www.abcdin.cl/abcdin/abcdin.nsf/linea?openpage&cat=Audio&cattxt=TV%20y%20Audio&catpos=03&linea=LCD&lineatxt=LCD%20&

And here's my code:

BufferedWriter writer = new BufferedWriter(new FileWriter(fullOutputPath));
HtmlPage page;
final WebClient webClient = new WebClient(BrowserVersion.INTERNET_EXPLORER_8);
webClient.setCssEnabled(false);
webClient.setPopupBlockerEnabled(true);
webClient.setRedirectEnabled(true);
webClient.setThrowExceptionOnScriptError(false);
webClient.setThrowExceptionOnFailingStatusCode(false);
webClient.setUseInsecureSSL(true);
webClient.setJavaScriptEnabled(true);
page = webClient.getPage(url);
dumpString += page.asXml();
writer.write(dumpString);
writer.close();
webClient.closeAllWindows();

Some people say that I need to introduce a pause in my code, since the page takes a while to load in Google Chrome, but I set long pauses and it doesn't work.

Thanks in advanced.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

贱人配狗天长地久 2024-12-14 23:18:05

只是一些想法...

使用 wget 检索该 URL 将返回一个重要的 HTML 文件。同样,使用 webClient.setJavaScriptEnabled(false) 运行代码。所以这肯定与页面中的Javascript有关。

启用 Javascript 后,我​​从日志中看到一堆 Javascript 作业正在排队,并且我看到类似这样的相应错误:

EcmaError: lineNumber=[49] column=[0] lineSource=[<no source>] name=[TypeError] sourceName=[https://www.abcdin.cl/js/jquery/jquery-1.4.2.min.js] message=[TypeError: Cannot read property "nodeType" from undefined (https://www.abcdin.cl/js/jquery/jquery-1.4.2.min.js#49)]
com.gargoylesoftware.htmlunit.ScriptException: TypeError: Cannot read property "nodeType" from undefined (https://www.abcdin.cl/js/jquery/jquery-1.4.2.min.js#49)
at     
com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:601)

也许这些作业是为了填充您的 HTML?那么当它们失败时,生成的 HTML 是空的?

该错误看起来很奇怪,因为 HtmlUnit 通常对 JQuery 没有问题。我怀疑问题出在调用 JQuery 库的特定行的代码上。

Just some ideas...

Retrieving that URL with wget returns a non-trivial HTML file. Likewise running your code with webClient.setJavaScriptEnabled(false). So it's definitely something to do with the Javascript in the page.

With Javascript enabled, I see from the logs that a bunch of Javascript jobs are being queued up, and I get see corresponding errors like this:

EcmaError: lineNumber=[49] column=[0] lineSource=[<no source>] name=[TypeError] sourceName=[https://www.abcdin.cl/js/jquery/jquery-1.4.2.min.js] message=[TypeError: Cannot read property "nodeType" from undefined (https://www.abcdin.cl/js/jquery/jquery-1.4.2.min.js#49)]
com.gargoylesoftware.htmlunit.ScriptException: TypeError: Cannot read property "nodeType" from undefined (https://www.abcdin.cl/js/jquery/jquery-1.4.2.min.js#49)
at     
com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:601)

Maybe those jobs are meant to populate your HTML? So when they fail, the resulting HTML is empty?

The error looks strange, as HtmlUnit usually has no issues with JQuery. I suspect the issue is with the code calling that particular line of the JQuery library.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文