HtmlUnit 下载文件后无法检索页面
我在 Java 中的 HtmlUnit 中遇到了这个奇怪的问题。我正在使用它从网站下载一些数据,过程是这样的:
1 - 登录
2 - 对于每个元素(汽车)
----- 3 搜索汽车
----- 4 从 a 下载 zip 文件链接
代码:
创建网络客户端:
webClient = new WebClient(BrowserVersion.FIREFOX_3_6);
webClient.setJavaScriptEnabled(true);
webClient.setThrowExceptionOnScriptError(false);
DefaultCredentialsProvider provider = new DefaultCredentialsProvider();
provider.addCredentials(USERNAME, PASSWORD);
webClient.setCredentialsProvider(provider);
webClient.setRefreshHandler(new ImmediateRefreshHandler());
登录:
public void login() throws IOException
{
page = (HtmlPage) webClient.getPage(URL);
HtmlForm form = page.getFormByName("formLogin");
String user = USERNAME;
String password = PASSWORD;
// Enter login and password
form.getInputByName("LoginSteps$UserName").setValueAttribute(user);
form.getInputByName("LoginSteps$Password").setValueAttribute(password);
// Click Login Button
page = (HtmlPage) form.getInputByName("LoginSteps$LoginButton").click();
webClient.waitForBackgroundJavaScript(3000);
// Click on Campa area
HtmlAnchor link = (HtmlAnchor) page.getElementById("ctl00_linkCampaNoiH");
page = (HtmlPage) link.click();
webClient.waitForBackgroundJavaScript(3000);
System.out.println(page.asText());
}
在网站中搜索汽车:
private void searchCar(String _regNumber) throws IOException
{
// Open search window
page = page.getElementById("search_gridCampaNoi").click();
webClient.waitForBackgroundJavaScript(3000);
// Write plate number
HtmlInput element = (HtmlInput) page.getElementById("jqg1");
element.setValueAttribute(_regNumber);
webClient.waitForBackgroundJavaScript(3000);
// Click on search
HtmlAnchor anchor = (HtmlAnchor) page.getByXPath("//*[@id=\"fbox_gridCampaNoi_search\"]").get(0);
page = anchor.click();
webClient.waitForBackgroundJavaScript(3000);
System.out.println(page.asText());
}
下载pdf:
try
{
InputStream is = _link.click().getWebResponse().getContentAsStream();
File path = new File(new File(DOWNLOAD_PATH), _regNumber);
if (!path.exists())
{
path.mkdir();
}
writeToFile(is, new File(path, _regNumber + "_pdfs.zip"));
}
catch (Exception e)
{
e.printStackTrace();
}
}
问题:
第一辆车工作正常,pdf已下载,但一旦我搜索对于一辆新车,当我到达这一行时:
page = page.getElementById("search_gridCampaNoi").click();
我得到这个异常:
Exception in thread "main" java.lang.ClassCastException: com.gargoylesoftware.htmlunit.UnexpectedPage cannot be cast to com.gargoylesoftware.htmlunit.html.HtmlPage
调试后,我意识到在我进行此调用时:
InputStream is = _link.click().getWebResponse().getContentAsStream();
page.getElementById("search_gridCampaNoi").click() 的返回类型从HtmlPage 到 WebResponse,因此我没有收到新页面,而是再次收到已下载的文件。
显示这种情况的调试器的几个屏幕截图:
第一次调用,返回类型 OK:
第二次调用,返回类型已更改,我不再收到 HtmlPage:
提前致谢!
I'm having this weird problem with HtmlUnit in Java. I am using it to download some data from a website, the process is something like this:
1 - Login
2 - For each element (cars)
----- 3 Search for car
----- 4 Download zip file from a link
The code:
Creation of the webclient:
webClient = new WebClient(BrowserVersion.FIREFOX_3_6);
webClient.setJavaScriptEnabled(true);
webClient.setThrowExceptionOnScriptError(false);
DefaultCredentialsProvider provider = new DefaultCredentialsProvider();
provider.addCredentials(USERNAME, PASSWORD);
webClient.setCredentialsProvider(provider);
webClient.setRefreshHandler(new ImmediateRefreshHandler());
Login:
public void login() throws IOException
{
page = (HtmlPage) webClient.getPage(URL);
HtmlForm form = page.getFormByName("formLogin");
String user = USERNAME;
String password = PASSWORD;
// Enter login and password
form.getInputByName("LoginSteps$UserName").setValueAttribute(user);
form.getInputByName("LoginSteps$Password").setValueAttribute(password);
// Click Login Button
page = (HtmlPage) form.getInputByName("LoginSteps$LoginButton").click();
webClient.waitForBackgroundJavaScript(3000);
// Click on Campa area
HtmlAnchor link = (HtmlAnchor) page.getElementById("ctl00_linkCampaNoiH");
page = (HtmlPage) link.click();
webClient.waitForBackgroundJavaScript(3000);
System.out.println(page.asText());
}
Search for car in website:
private void searchCar(String _regNumber) throws IOException
{
// Open search window
page = page.getElementById("search_gridCampaNoi").click();
webClient.waitForBackgroundJavaScript(3000);
// Write plate number
HtmlInput element = (HtmlInput) page.getElementById("jqg1");
element.setValueAttribute(_regNumber);
webClient.waitForBackgroundJavaScript(3000);
// Click on search
HtmlAnchor anchor = (HtmlAnchor) page.getByXPath("//*[@id=\"fbox_gridCampaNoi_search\"]").get(0);
page = anchor.click();
webClient.waitForBackgroundJavaScript(3000);
System.out.println(page.asText());
}
Download pdf:
try
{
InputStream is = _link.click().getWebResponse().getContentAsStream();
File path = new File(new File(DOWNLOAD_PATH), _regNumber);
if (!path.exists())
{
path.mkdir();
}
writeToFile(is, new File(path, _regNumber + "_pdfs.zip"));
}
catch (Exception e)
{
e.printStackTrace();
}
}
The problem:
The first car works okay, pdf is downloaded, but as soon as I search for a new car, when I get to this line:
page = page.getElementById("search_gridCampaNoi").click();
I get this exception:
Exception in thread "main" java.lang.ClassCastException: com.gargoylesoftware.htmlunit.UnexpectedPage cannot be cast to com.gargoylesoftware.htmlunit.html.HtmlPage
After debugging, I've realized that the moment I make this call:
InputStream is = _link.click().getWebResponse().getContentAsStream();
the return type of page.getElementById("search_gridCampaNoi").click() changes from HtmlPage to WebResponse, so instead of receiving a new page, I'm receiving again the file that I already downloaded.
A couple of screenshots of the debugger showing this situation:
First call, return type OK:
Second call, return type changed and I no longer receive a HtmlPage:
Thanks in advance!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
以防万一有人遇到同样的问题,我找到了一个解决方法。更改行:
似乎
可以解决问题。我现在在进行多次迭代时遇到问题,有时有效,有时无效,但至少我现在有了一些东西。
Just in case someone encounters the same problem, I found a workaround.Changing the line:
to
seems to do the trick. Im having problems now when doing several iterations, sometimes it works, sometimes it doesn't but at least I have something now.