HtmlUnit 同步问题
我在我的一个 Web 项目中使用 HtmlUnit 来屏幕抓取一些代码。我想知道我需要在多大程度上同步代码。目前,我正在同步使用 WebClient 对象检索页面(即 webClient.getPage(url))的所有代码。我假设如果 webClient.getPage() 不同步,那么“浏览器”可能会尝试一次加载多个页面(如果我错了,请纠正我)。为了解决这个问题,我可能必须打开多个窗口,对吗?
我的问题是关于 HtmlPage、HtmlTable 等类。当我检索一个HtmlPage对象后,我是否需要同步读取该页面和从HtmlPage对象(即HtmlTable)返回的其他对象,或者将整个页面缓存到内存中?我假设如果它没有被缓存,那么如果在我操作之前返回的 HtmlPage 对象时 WebClient 再次调用 getPage(),可能会发生不好的事情。
我想要一个 Connection 类,它具有同步方法来控制对 WebClient 的调用,该 WebClient 将返回 HtmlPage,然后操作页面而不必担心同步。这有什么问题吗?
示例:
public MyConnection {
private final WebClient webClient;
public MyConnection() {
this.webClient = new WebClient();
this.webClient.setTimeout(10 * 1000);
this.webClient.setJavaScriptEnabled(false);
this.webClient.setCssEnabled(false);
}
public synchronized HtmlPage getHtmlPage(String url) {
return webClient.getPage(url);
}
}
public UseConnectionClass {
private MyConnection conn;
public void getAPage(String url) {
return conn.getPage(url);
}
}
public ClientClass {
public void doSomething() {
UseConnectionClass useConn = new UseConnectionClass();
HtmlPage page1 = useConn.getAPage("http://foobar1.com/");
HtmlPage page2 = useConn.getAPage("http://foobar2.com/");
// do something with page1...
// do something with page2...
page1.getElementsByTagName("table");
page2.getElementsByTagName("table");
// etc...
}
}
编辑:我知道 WebClient 不是线程安全的,因此我的示例中的 MyConnection 对象方法 getHtmlPage() 是同步的。
I am using HtmlUnit in one of my web projects to screen scrape some code. I am wondering to what extent I need to synchronize the code. Currently I am synchronizing all code where I'm using the WebClient object to retrieve pages (i.e. webClient.getPage(url)). I assume that if webClient.getPage() is not synchronized, then the 'browser' could possibly try to load multiple pages at once (correct me if I'm wrong). To get around this, I'd probably have to open multiple windows, correct?
My question is concerning the HtmlPage, HtmlTable, etc. classes. After I retrieve an HtmlPage object, do I need to synchronize the reading of that page and other objects returned from the HtmlPage object (i.e. HtmlTable), or is the whole page cached into memory? I assume if it isn't cached, then if the WebClient calls getPage() again while I'm manipulating the previously returned HtmlPage object, bad things could happen.
I'd like to have a Connection class that has synchronized methods controlling calls to the WebClient that will return an HtmlPage and then manipulate the page without having to worry about synchronization. Are there any issues with this?
Example:
public MyConnection {
private final WebClient webClient;
public MyConnection() {
this.webClient = new WebClient();
this.webClient.setTimeout(10 * 1000);
this.webClient.setJavaScriptEnabled(false);
this.webClient.setCssEnabled(false);
}
public synchronized HtmlPage getHtmlPage(String url) {
return webClient.getPage(url);
}
}
public UseConnectionClass {
private MyConnection conn;
public void getAPage(String url) {
return conn.getPage(url);
}
}
public ClientClass {
public void doSomething() {
UseConnectionClass useConn = new UseConnectionClass();
HtmlPage page1 = useConn.getAPage("http://foobar1.com/");
HtmlPage page2 = useConn.getAPage("http://foobar2.com/");
// do something with page1...
// do something with page2...
page1.getElementsByTagName("table");
page2.getElementsByTagName("table");
// etc...
}
}
EDIT: I know that WebClient is not thread-safe, hence the MyConnection object method getHtmlPage() in my example is synchronized.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
正如 javadoc 所说:
每个线程都应该有自己的 WebClient。
As the javadoc says:
Each thread should have its own WebClient.