使用 htmlunit 构建复杂的网站

发布于 2024-12-07 23:18:05 字数 1251 浏览 3 评论 0原文

我正在尝试使用 HTMLUnit 转储某个网站的全部内容，但是当我尝试在某个（相当复杂的）网站中执行此操作时，我得到一个空文件（本身不是一个空文件，但它有一个空文件） head 标签，一个空的 body 标签，仅此而已）。

该网站是 https://www.abcdin.cl/abcdin/abcdin.nsf#https://www.abcdin.cl/abcdin/abcdin.nsf/linea?openpage&cat=Audio&cattxt=TV% 20y%20Audio&catpos=03&linea=LCD&lineatxt=LCD%20&

这是我的代码：

BufferedWriter writer = new BufferedWriter(new FileWriter(fullOutputPath));
HtmlPage page;
final WebClient webClient = new WebClient(BrowserVersion.INTERNET_EXPLORER_8);
webClient.setCssEnabled(false);
webClient.setPopupBlockerEnabled(true);
webClient.setRedirectEnabled(true);
webClient.setThrowExceptionOnScriptError(false);
webClient.setThrowExceptionOnFailingStatusCode(false);
webClient.setUseInsecureSSL(true);
webClient.setJavaScriptEnabled(true);
page = webClient.getPage(url);
dumpString += page.asXml();
writer.write(dumpString);
writer.close();
webClient.closeAllWindows();

有人说我需要在代码中引入暂停，因为在 Google Chrome 中加载页面需要一段时间，但我设置了很长的暂停，但它不起作用。

提前致谢。

原文

I'm trying to dump the whole contents of a certain site using HTMLUnit, but when I try to do this in a certain (rather intrincate) site, I get an empty file (not an empty file per se, but it has an empty head tag, an empty body tag and that's it).

The site is https://www.abcdin.cl/abcdin/abcdin.nsf#https://www.abcdin.cl/abcdin/abcdin.nsf/linea?openpage&cat=Audio&cattxt=TV%20y%20Audio&catpos=03&linea=LCD&lineatxt=LCD%20&

And here's my code:

BufferedWriter writer = new BufferedWriter(new FileWriter(fullOutputPath));
HtmlPage page;
final WebClient webClient = new WebClient(BrowserVersion.INTERNET_EXPLORER_8);
webClient.setCssEnabled(false);
webClient.setPopupBlockerEnabled(true);
webClient.setRedirectEnabled(true);
webClient.setThrowExceptionOnScriptError(false);
webClient.setThrowExceptionOnFailingStatusCode(false);
webClient.setUseInsecureSSL(true);
webClient.setJavaScriptEnabled(true);
page = webClient.getPage(url);
dumpString += page.asXml();
writer.write(dumpString);
writer.close();
webClient.closeAllWindows();

Some people say that I need to introduce a pause in my code, since the page takes a while to load in Google Chrome, but I set long pauses and it doesn't work.

Thanks in advanced.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

贱人配狗天长地久 2024-12-14 23:18:05

只是一些想法...

使用 wget 检索该 URL 将返回一个重要的 HTML 文件。同样，使用 webClient.setJavaScriptEnabled(false) 运行代码。所以这肯定与页面中的Javascript有关。

启用 Javascript 后，我从日志中看到一堆 Javascript 作业正在排队，并且我看到类似这样的相应错误：

EcmaError: lineNumber=[49] column=[0] lineSource=[<no source>] name=[TypeError] sourceName=[https://www.abcdin.cl/js/jquery/jquery-1.4.2.min.js] message=[TypeError: Cannot read property "nodeType" from undefined (https://www.abcdin.cl/js/jquery/jquery-1.4.2.min.js#49)]
com.gargoylesoftware.htmlunit.ScriptException: TypeError: Cannot read property "nodeType" from undefined (https://www.abcdin.cl/js/jquery/jquery-1.4.2.min.js#49)
at     
com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:601)

也许这些作业是为了填充您的 HTML？那么当它们失败时，生成的 HTML 是空的？

该错误看起来很奇怪，因为 HtmlUnit 通常对 JQuery 没有问题。我怀疑问题出在调用 JQuery 库的特定行的代码上。

Just some ideas...

Retrieving that URL with wget returns a non-trivial HTML file. Likewise running your code with webClient.setJavaScriptEnabled(false). So it's definitely something to do with the Javascript in the page.

With Javascript enabled, I see from the logs that a bunch of Javascript jobs are being queued up, and I get see corresponding errors like this:

EcmaError: lineNumber=[49] column=[0] lineSource=[<no source>] name=[TypeError] sourceName=[https://www.abcdin.cl/js/jquery/jquery-1.4.2.min.js] message=[TypeError: Cannot read property "nodeType" from undefined (https://www.abcdin.cl/js/jquery/jquery-1.4.2.min.js#49)]
com.gargoylesoftware.htmlunit.ScriptException: TypeError: Cannot read property "nodeType" from undefined (https://www.abcdin.cl/js/jquery/jquery-1.4.2.min.js#49)
at     
com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:601)

Maybe those jobs are meant to populate your HTML? So when they fail, the resulting HTML is empty?

The error looks strange, as HtmlUnit usually has no issues with JQuery. I suspect the issue is with the code calling that particular line of the JQuery library.

回复收藏 0 原文

~没有更多了~