是否可以仅使用一个线程打开到多个站点的多个连接?
更新
我已经使用了FixedThreadPool
。发生的情况是每个线程为一个站点打开一个连接。我想做的是异步的事情。
- 向服务器发送请求
- 无需等待第一个请求完成即可转到下一个请求
- 建立请求后,执行一些操作,通知另一个线程连接已建立并准备好下载。
我认为这将加快执行速度,因为将使用更少的线程来打开与当前性能相同或更多的连接。
在当前方式中,每个线程等待一段时间,无需等待连接建立。通过这种新方式,它将始终有效。
问题
我想知道是否有一种方法可以仅用一个线程打开与多个站点的连接。
这是因为我正在做一个网络爬虫,我已经做了一个线程来打开一个连接,但是在一定数量的线程之后,这将无济于事,因为处理器共享会增加很多。
我希望这可以加快下载的页面数量。可以这样做吗?如何?
此代码打开连接并进行一些处理。它由打开连接的线程执行
/*
* Open connection to a server
*/
boolean openConnection(Link link) throws Exception {
//set the connection paramenters
HttpURLConnection conn = (HttpURLConnection) new URL(link.getOriginalURL().getURL()).openConnection();
conn.setRequestProperty("User-Agent", ROBOT_NAME);
conn.setInstanceFollowRedirects(true);
conn.setConnectTimeout(READ_TIMEOUT);
conn.setReadTimeout(READ_TIMEOUT);
link.setConnection(conn);
//open the connection
conn.connect();
//check the server answer
if (conn.getResponseCode() != HttpURLConnection.HTTP_OK) {
return false;
}
//analyse the URL of the redirected URL
urlAnalyzer.fillURL(link.getRedirectedURL(), getRedirectedURL(link.getConnection()));
return true;
}
这将执行连接打开器,每个连接打开器都在一个线程中
/*
* Start the execution of the connection openers
*/
private void executeConnectionOpeners() {
LOGGER.info("Starting connection openners.");
/* Execution */
NameThreadFactory ntf = new NameThreadFactory("Connection Opener");
crawlerOpenerExecutor = Executors.newFixedThreadPool(nOpeners, ntf);
for (int i = 0; i < nOpeners; i++) {
crawlerOpenerExecutor.submit(new ConnectionOpener(this));
}
/* End of execution */
LOGGER.info(nOpeners + " connection openers created and running.");
}
Update
I use a FixedThreadPool
already. What happens is that each thread open one connection for one site. What I want to do is something asynchronous.
- Send request to a server
- Go to next request without need to wait the first request to complete
- When a request was established, do something informing another thread that connection was established and ready for download.
I think this will speed up the execution because will use less threads for opening the same or more connection that the currently performance.
In the current way, each thread wait a time without work waiting the connection establishes. In this new way, it will be always working.
The Question
I want to know if there is a way to open connection to multiple sites with only one thread.
This is because I'm doing an webcrawler, I already did a thread to open a connection, but after a certain number of threads, this will not help because the processor sharing will increase a lot.
I want this to speed up the number of pages downloaded. It's possible do this? How?
This code open a connection and do some processing. It's executed by the threads that open a connection
/*
* Open connection to a server
*/
boolean openConnection(Link link) throws Exception {
//set the connection paramenters
HttpURLConnection conn = (HttpURLConnection) new URL(link.getOriginalURL().getURL()).openConnection();
conn.setRequestProperty("User-Agent", ROBOT_NAME);
conn.setInstanceFollowRedirects(true);
conn.setConnectTimeout(READ_TIMEOUT);
conn.setReadTimeout(READ_TIMEOUT);
link.setConnection(conn);
//open the connection
conn.connect();
//check the server answer
if (conn.getResponseCode() != HttpURLConnection.HTTP_OK) {
return false;
}
//analyse the URL of the redirected URL
urlAnalyzer.fillURL(link.getRedirectedURL(), getRedirectedURL(link.getConnection()));
return true;
}
This executes the connection openers, each one in one thread
/*
* Start the execution of the connection openers
*/
private void executeConnectionOpeners() {
LOGGER.info("Starting connection openners.");
/* Execution */
NameThreadFactory ntf = new NameThreadFactory("Connection Opener");
crawlerOpenerExecutor = Executors.newFixedThreadPool(nOpeners, ntf);
for (int i = 0; i < nOpeners; i++) {
crawlerOpenerExecutor.submit(new ConnectionOpener(this));
}
/* End of execution */
LOGGER.info(nOpeners + " connection openers created and running.");
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
获取网页并不是一项特别占用处理器资源的工作:您将花费几乎所有时间等待网络,除非您从非常快的本地连接获取大量小页面。
当然,您应该通过基准测试来查看有多少线程实际上值得使用 - 您可能希望有一组固定的线程在共享生产者/消费者队列上工作。 (您不想为每个请求创建一个真正的新线程。)
现在,如果您可以异步执行获取(可能使用 NIO),那么应该可能只使用很少的线程,但我我会亲自检查“单独线程”方法是否实际上首先最大化您的CPU。它可能会使代码比使用异步简单得多,并且如果瓶颈确实是网络,那么您最终将得到难以维护的代码,而几乎没有(如果有的话)好处。
Fetching web pages isn't a particularly processor-intensive job: you've going to spend almost all of your time waiting for the network unless you're fetching a lot of small pages from very fast local connections.
Of course, you should look at how many threads it's actually worth using, via benchmarking - you'll probably want to have a fixed set of threads working off a shared producer/consumer queue. (You don't want to create a genuine new thread for each request.)
Now it should be possible to use only a very few threads if you can perform the fetch asynchronously (potentially with NIO) but I would personally check whether the "separate threads" approach is actually maxing out your CPU first. It's probably going to make the code much simpler than using asynchrony, and if the bottleneck is really the network, then you'll end up with harder-to-maintain code for little (if any) benefit.
看看您是否喜欢 Java 7 的
AsynchronousSocketChannel
。基本上,您发出读取请求,当字节可用时,它会调用您的回调。当然,回调必须在某个线程上调用;您有一些选项来配置线程策略。Check out and see if you like Java 7's
AsynchronousSocketChannel
. Basically, you issue a read request, and when bytes are available, it'll call your callback. Of course, the callback must be invoked on some thread; you have some options to config the threading policy.我使用 xlightweb 来实现类似的目的,即异步 HTTP。
I've used xlightweb for similar purpose, i.e. asynchronous HTTP.