极其简单的代码在 HtmlUnit 中不起作用
我正在使用 HtmlUnit 2.9(本月发布的稳定版本)。您知道为什么以下代码不起作用吗?
public class Main {
public static void main(String[] args) {
WebClient webClient = new WebClient(BrowserVersion.FIREFOX_3_6);
webClient.setCssEnabled(true);
webClient.setCssErrorHandler(new SilentCssErrorHandler());
webClient.setThrowExceptionOnFailingStatusCode(false);
webClient.setThrowExceptionOnScriptError(false);
webClient.setRedirectEnabled(false);
webClient.setAppletEnabled(false);
webClient.setJavaScriptEnabled(false);
webClient.setPopupBlockerEnabled(true);
webClient.setTimeout(60000);
webClient.setPrintContentOnFailingStatusCode(false);
System.out.println("This is printed on screen");
try {
webClient.getPage("http://www.2cash.info/index.php");
} catch (Exception e) {
e.printStackTrace();
}
System.out.println("This is NEVER printed on screen");
}
}
我还添加了 jstack 的结果。请注意,我标记了一个不断重复的部分:
2011-08-26 03:15:45
Full thread dump Java HotSpot(TM) Server VM (20.1-b02 mixed mode):
"Attach Listener" daemon prio=10 tid=0x09520400 nid=0x5363 waiting on condition [0x00000000]
java.lang.Thread.State: RUNNABLE
"JS executor for com.gargoylesoftware.htmlunit.WebClient@a7c45e" daemon prio=10 tid=0x6feb7400 nid=0x5356 waiting on condition [0x6fcfe000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at com.gargoylesoftware.htmlunit.javascript.background.JavaScriptExecutor.run(JavaScriptExecutor.java:166)
at java.lang.Thread.run(Thread.java:662)
"Low Memory Detector" daemon prio=10 tid=0x70204c00 nid=0x5352 runnable [0x00000000]
java.lang.Thread.State: RUNNABLE
"C2 CompilerThread1" daemon prio=10 tid=0x70202800 nid=0x5351 runnable [0x00000000]
java.lang.Thread.State: RUNNABLE
"C2 CompilerThread0" daemon prio=10 tid=0x70200800 nid=0x5350 waiting on condition [0x00000000]
java.lang.Thread.State: RUNNABLE
"Signal Dispatcher" daemon prio=10 tid=0x09514c00 nid=0x534f runnable [0x00000000]
java.lang.Thread.State: RUNNABLE
"Finalizer" daemon prio=10 tid=0x09503400 nid=0x534e in Object.wait() [0x70798000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x76af2ff0> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
- locked <0x76af2ff0> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
"Reference Handler" daemon prio=10 tid=0x09501c00 nid=0x534d in Object.wait() [0x707e9000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x7675cc58> (a java.lang.ref.Reference$Lock)
at java.lang.Object.wait(Object.java:485)
at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
- locked <0x7675cc58> (a java.lang.ref.Reference$Lock)
"main" prio=10 tid=0x09482400 nid=0x5349 runnable [0xb6c34000]
java.lang.Thread.State: RUNNABLE
at net.sourceforge.htmlunit.corejs.javascript.ScriptableObject.getSlot(ScriptableObject.java:2603)
at net.sourceforge.htmlunit.corejs.javascript.ScriptableObject.defineProperty(ScriptableObject.java:1699)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.configureConstantsPropertiesAndFunctions(JavaScriptEngine.java:350)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.configureClass(JavaScriptEngine.java:330)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.init(JavaScriptEngine.java:199)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.access$000(JavaScriptEngine.java:79)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$1.run(JavaScriptEngine.java:146)
at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:537)
at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:538)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.initialize(JavaScriptEngine.java:157)
at com.gargoylesoftware.htmlunit.WebClient.initialize(WebClient.java:1141)
at com.gargoylesoftware.htmlunit.WebWindowImpl.setEnclosedPage(WebWindowImpl.java:109)
at com.gargoylesoftware.htmlunit.html.FrameWindow.setEnclosedPage(FrameWindow.java:102)
at com.gargoylesoftware.htmlunit.html.HTMLParser.parse(HTMLParser.java:200)
at com.gargoylesoftware.htmlunit.html.HTMLParser.parseHtml(HTMLParser.java:179)
at com.gargoylesoftware.htmlunit.DefaultPageCreator.createHtmlPage(DefaultPageCreator.java:221)
at com.gargoylesoftware.htmlunit.DefaultPageCreator.createPage(DefaultPageCreator.java:106)
at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:433)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:311)
at com.gargoylesoftware.htmlunit.html.BaseFrame.<init>(BaseFrame.java:73)
at com.gargoylesoftware.htmlunit.html.HtmlInlineFrame.<init>(HtmlInlineFrame.java:46)
at com.gargoylesoftware.htmlunit.html.DefaultElementFactory.createElementNS(DefaultElementFactory.java:288)
at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.startElement(HTMLParser.java:506)
at org.apache.xerces.parsers.AbstractSAXParser.startElement(Unknown Source)
at org.cyberneko.html.HTMLTagBalancer.callStartElement(HTMLTagBalancer.java:1136)
at org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalancer.java:742)
at org.cyberneko.html.filters.DefaultFilter.startElement(DefaultFilter.java:136)
at org.cyberneko.html.filters.NamespaceBinder.startElement(NamespaceBinder.java:278)
at org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLScanner.java:2652)
at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:2022)
at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:908)
at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:499)
at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:452)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.parse(HTMLParser.java:789)
at com.gargoylesoftware.htmlunit.html.HTMLParser.parse(HTMLParser.java:225)
at com.gargoylesoftware.htmlunit.html.HTMLParser.parseHtml(HTMLParser.java:179)
at com.gargoylesoftware.htmlunit.DefaultPageCreator.createHtmlPage(DefaultPageCreator.java:221)
at com.gargoylesoftware.htmlunit.DefaultPageCreator.createPage(DefaultPageCreator.java:106)
at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:433)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:311)
<THIS_SECTION_IS_PRINTED_AS_IF_IT_WERE_IN_A_LOOP>
at com.gargoylesoftware.htmlunit.html.BaseFrame.loadInnerPageIfPossible(BaseFrame.java:149)
at com.gargoylesoftware.htmlunit.html.BaseFrame.loadInnerPage(BaseFrame.java:99)
at com.gargoylesoftware.htmlunit.html.HtmlPage.loadFrames(HtmlPage.java:1760)
at com.gargoylesoftware.htmlunit.html.HtmlPage.initialize(HtmlPage.java:194)
at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:440)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:311)
</THIS_SECTION_IS_PRINTED_AS_IF_IT_WERE_IN_A_LOOP>
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:311)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:373)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:358)
at main.Main.<init>(Main.java:42)
at main.Main.main(Main.java:23)
"VM Thread" prio=10 tid=0x094fe000 nid=0x534c runnable
"GC task thread#0 (ParallelGC)" prio=10 tid=0x09489800 nid=0x534a runnable
"GC task thread#1 (ParallelGC)" prio=10 tid=0x0948ac00 nid=0x534b runnable
"VM Periodic Task Thread" prio=10 tid=0x70207000 nid=0x5353 waiting on condition
JNI global references: 1234
我认为存在某种关于自动加载框架的循环。如果是这种情况,有什么方法可以禁用该行为以打破循环吗?
提前致谢!
I'm working with HtmlUnit 2.9 (the stable version that was released this month). Do you have any idea why the following code is not working?
public class Main {
public static void main(String[] args) {
WebClient webClient = new WebClient(BrowserVersion.FIREFOX_3_6);
webClient.setCssEnabled(true);
webClient.setCssErrorHandler(new SilentCssErrorHandler());
webClient.setThrowExceptionOnFailingStatusCode(false);
webClient.setThrowExceptionOnScriptError(false);
webClient.setRedirectEnabled(false);
webClient.setAppletEnabled(false);
webClient.setJavaScriptEnabled(false);
webClient.setPopupBlockerEnabled(true);
webClient.setTimeout(60000);
webClient.setPrintContentOnFailingStatusCode(false);
System.out.println("This is printed on screen");
try {
webClient.getPage("http://www.2cash.info/index.php");
} catch (Exception e) {
e.printStackTrace();
}
System.out.println("This is NEVER printed on screen");
}
}
I'm also adding the result of jstack. Notice I've marked a section that gets repeated constantly:
2011-08-26 03:15:45
Full thread dump Java HotSpot(TM) Server VM (20.1-b02 mixed mode):
"Attach Listener" daemon prio=10 tid=0x09520400 nid=0x5363 waiting on condition [0x00000000]
java.lang.Thread.State: RUNNABLE
"JS executor for com.gargoylesoftware.htmlunit.WebClient@a7c45e" daemon prio=10 tid=0x6feb7400 nid=0x5356 waiting on condition [0x6fcfe000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at com.gargoylesoftware.htmlunit.javascript.background.JavaScriptExecutor.run(JavaScriptExecutor.java:166)
at java.lang.Thread.run(Thread.java:662)
"Low Memory Detector" daemon prio=10 tid=0x70204c00 nid=0x5352 runnable [0x00000000]
java.lang.Thread.State: RUNNABLE
"C2 CompilerThread1" daemon prio=10 tid=0x70202800 nid=0x5351 runnable [0x00000000]
java.lang.Thread.State: RUNNABLE
"C2 CompilerThread0" daemon prio=10 tid=0x70200800 nid=0x5350 waiting on condition [0x00000000]
java.lang.Thread.State: RUNNABLE
"Signal Dispatcher" daemon prio=10 tid=0x09514c00 nid=0x534f runnable [0x00000000]
java.lang.Thread.State: RUNNABLE
"Finalizer" daemon prio=10 tid=0x09503400 nid=0x534e in Object.wait() [0x70798000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x76af2ff0> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
- locked <0x76af2ff0> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
"Reference Handler" daemon prio=10 tid=0x09501c00 nid=0x534d in Object.wait() [0x707e9000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x7675cc58> (a java.lang.ref.Reference$Lock)
at java.lang.Object.wait(Object.java:485)
at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
- locked <0x7675cc58> (a java.lang.ref.Reference$Lock)
"main" prio=10 tid=0x09482400 nid=0x5349 runnable [0xb6c34000]
java.lang.Thread.State: RUNNABLE
at net.sourceforge.htmlunit.corejs.javascript.ScriptableObject.getSlot(ScriptableObject.java:2603)
at net.sourceforge.htmlunit.corejs.javascript.ScriptableObject.defineProperty(ScriptableObject.java:1699)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.configureConstantsPropertiesAndFunctions(JavaScriptEngine.java:350)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.configureClass(JavaScriptEngine.java:330)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.init(JavaScriptEngine.java:199)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.access$000(JavaScriptEngine.java:79)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$1.run(JavaScriptEngine.java:146)
at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:537)
at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:538)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.initialize(JavaScriptEngine.java:157)
at com.gargoylesoftware.htmlunit.WebClient.initialize(WebClient.java:1141)
at com.gargoylesoftware.htmlunit.WebWindowImpl.setEnclosedPage(WebWindowImpl.java:109)
at com.gargoylesoftware.htmlunit.html.FrameWindow.setEnclosedPage(FrameWindow.java:102)
at com.gargoylesoftware.htmlunit.html.HTMLParser.parse(HTMLParser.java:200)
at com.gargoylesoftware.htmlunit.html.HTMLParser.parseHtml(HTMLParser.java:179)
at com.gargoylesoftware.htmlunit.DefaultPageCreator.createHtmlPage(DefaultPageCreator.java:221)
at com.gargoylesoftware.htmlunit.DefaultPageCreator.createPage(DefaultPageCreator.java:106)
at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:433)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:311)
at com.gargoylesoftware.htmlunit.html.BaseFrame.<init>(BaseFrame.java:73)
at com.gargoylesoftware.htmlunit.html.HtmlInlineFrame.<init>(HtmlInlineFrame.java:46)
at com.gargoylesoftware.htmlunit.html.DefaultElementFactory.createElementNS(DefaultElementFactory.java:288)
at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.startElement(HTMLParser.java:506)
at org.apache.xerces.parsers.AbstractSAXParser.startElement(Unknown Source)
at org.cyberneko.html.HTMLTagBalancer.callStartElement(HTMLTagBalancer.java:1136)
at org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalancer.java:742)
at org.cyberneko.html.filters.DefaultFilter.startElement(DefaultFilter.java:136)
at org.cyberneko.html.filters.NamespaceBinder.startElement(NamespaceBinder.java:278)
at org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLScanner.java:2652)
at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:2022)
at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:908)
at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:499)
at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:452)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.parse(HTMLParser.java:789)
at com.gargoylesoftware.htmlunit.html.HTMLParser.parse(HTMLParser.java:225)
at com.gargoylesoftware.htmlunit.html.HTMLParser.parseHtml(HTMLParser.java:179)
at com.gargoylesoftware.htmlunit.DefaultPageCreator.createHtmlPage(DefaultPageCreator.java:221)
at com.gargoylesoftware.htmlunit.DefaultPageCreator.createPage(DefaultPageCreator.java:106)
at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:433)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:311)
<THIS_SECTION_IS_PRINTED_AS_IF_IT_WERE_IN_A_LOOP>
at com.gargoylesoftware.htmlunit.html.BaseFrame.loadInnerPageIfPossible(BaseFrame.java:149)
at com.gargoylesoftware.htmlunit.html.BaseFrame.loadInnerPage(BaseFrame.java:99)
at com.gargoylesoftware.htmlunit.html.HtmlPage.loadFrames(HtmlPage.java:1760)
at com.gargoylesoftware.htmlunit.html.HtmlPage.initialize(HtmlPage.java:194)
at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:440)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:311)
</THIS_SECTION_IS_PRINTED_AS_IF_IT_WERE_IN_A_LOOP>
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:311)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:373)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:358)
at main.Main.<init>(Main.java:42)
at main.Main.main(Main.java:23)
"VM Thread" prio=10 tid=0x094fe000 nid=0x534c runnable
"GC task thread#0 (ParallelGC)" prio=10 tid=0x09489800 nid=0x534a runnable
"GC task thread#1 (ParallelGC)" prio=10 tid=0x0948ac00 nid=0x534b runnable
"VM Periodic Task Thread" prio=10 tid=0x70207000 nid=0x5353 waiting on condition
JNI global references: 1234
I think there is some kind of loop regarding the automatic loading of frames. If that is the case, is there any way to disable that behaviour to break the loop?
Thanks in advance!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
好吧,虽然这是一个可怕的解决方案(实际上是解决方法......),但我最终决定按照 HtmlUnit 开发人员之一的建议禁用 HtmlUnit 中的框架自动加载。这是我详细做的:
loadFrames()
方法的方法体,而不是声明)位于htmlunit-2.9/src/main/java/com/gargoylesoftware/htmlunit/html
mvn -Dmaven.test.skip=true clean编译包
htmlunit-2.9/artifacts
中的新htmlunit-2.9.jar
并替换当前的htmlunit-2.9.jar
库文件你知道我原来的代码是怎样的(看看问题)。这将从页面下载所有框架和 iframe。我添加了一个示例,说明如何获取包含框架的页面,只需加载所需的框架:
此库更改后,一旦
getPage()
方法完成,框架的内容将为空。请注意,它不会为空,看起来它只是返回一个空帧。我们需要做的是手动下载我们感兴趣的框架的内容,这就是我再次执行getPage()
的原因。这就是我如何使用 HtmlUnit 有选择地下载框架和 iframe。任何有关如何改进这一点的想法将不胜感激。不管怎样,我希望将来能添加一些方法来禁用 HtmlUnit 本身加载框架,也许添加诸如
getPage(URL url, boolean downloadFrames)
之类的方法。希望这对那里的人有帮助!
Well, although it is a horrible solution (workaround, actually...), I finally decided to disable the automatic loading of frames in HtmlUnit as adviced by one of the developers of HtmlUnit. This is what I did in detail:
loadFrames()
method of the HtmlPage class located inhtmlunit-2.9/src/main/java/com/gargoylesoftware/htmlunit/html
mvn -Dmaven.test.skip=true clean compile package
htmlunit-2.9.jar
located inhtmlunit-2.9/artifacts
and replaced the currenthtmlunit-2.9.jar
library fileYou know how my original code was (look at the question). That would download all frames and iframes from a page. I'm adding an example on how to get a page with frames just loading the frames you want:
After this library change, the content of the frame will be empty once the
getPage()
method finishes. Notice it won't be null, looks like it is just returning an empty frame. What we need to do is to download the content of the frames we are interested in manually, that's why I'm performing agetPage()
again.Well this is how I managed to selectively download frames and iframes with HtmlUnit. Any ideas on how to improve this will be appreciated. Anyway, I hope there will be added some way to disable the loading of the frames in HtmlUnit itself in the future, maybe adding a method such as
getPage(URL url, boolean downloadFrames)
or something.Hope this helps someone out there!
当我在浏览器中打开此网站时,它永远不会完成页面加载。这也可能是 HtmlUnit 崩溃的原因。使用 Chrome 和 FF 进行测试。
尝试加载一个更简单的站点,您可能会知道此崩溃是否与站点相关。
When I open this site in my browser it does not ever finish loading the page. This might be the problem why HtmlUnit crashes, too. Tested with Chrome and FF.
Try loading a more simple site instead and you may know if this crash is site-depended.