异步 HttpWeb 请求
我正在开发一个网络爬虫,我想使用 HttpWebRequest。它允许异步操作,例如 BeginGetResponse,但使用 HttpWebRequest.Create 连接不是异步的 - 我想同时建立大约 1,000 个连接,因此使用此方法(带有额外的异步线程)我什至无法获得 2 个连接,因为直到第二个连接连接,第一个连接已经完成下载内容,这几乎就像我一个接一个地连接到网页,而不是同时连接。
我想知道是否有一种好方法可以使用 HttpWebRequest 连接大约 1,000 次,而无需创建大量线程或任何东西......
提前致谢。
编辑: 最终,不是 HttpWebRequest 缓慢且阻塞,而是 BeginGetResponse - 它一直阻塞,直到发送请求标头?我怎样才能绕过这个,使用异步发送以及BeginGetRequestStream?
I'm working on a web crawler and I want to use HttpWebRequest. it allows asynchronous operations such as BeginGetResponse, but connecing using HttpWebRequest.Create isn't asynchronous - and I want to make about 1,000 connections simultaneously, so using this method (with an extra thread for asynchronous) I can't even get 2 connections because until the second one connects the first connection already finished downloading content, and it's almost as if I connected to the web page after page instead of simultaneously.
I was wondering if I there's a good way to connect about 1,000 times using HttpWebRequest without creating tons of threads or anything...
Thanks in advance.
Edit:
Eventually it wasn't the HttpWebRequest that was slow and blocking, it was the BeginGetResponse - it's blocking until the request headers are sent? how can I bypass this, use asynchronous send as well with BeginGetRequestStream?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
所有这些连接都进入同一个域吗?
尝试将其添加到您的 app/web.config
<连接管理>
<添加地址=“*”maxconnection=“1000”/>
Are all these connections going to the same domain?
Try adding this to your app/web.config
<system.net>
<connectionManagement>
<add address="*" maxconnection="1000" />
</connectionManagement>
</system.net>
我不认为你可以在同一个线程上建立多个连接。每个连接需要一个线程。但您可以修改您的设计以使其更具可扩展性。
您可以创建一个控制线程来完成所有繁重的工作(或者可能是其中的几个),并且每个这样的控制线程都会产生几个子线程,这些子线程出去获取数据并将它们放入父类内的某种数组中。然后控制类就可以回收子线程了。一旦子线程完成,它就会获得另一个“任务”。恕我直言,主要思想是将爬行与检索数据的处理分开。获取、存储并稍后处理。
希望这会以某种方式有所帮助:)
I don't think you can make multiple connections on the same thread. You need one thread per connection. But you can modify your design to make it more scalable.
You can make one control thread which does all the heavy lifting (or maybe several of these) and every such control thread spaws several child threads which go out and get the data and put them in some kind of array inside the parent class. Then the control class can recycle the child threads. Once a child thread is finished, it gets another "task". The main idea, IMHO, is to seperate the crawling from the processing of the retrieved data. Get it, store it and process it later.
Hope this helps in some way :)
没有理由认为这个应该被阻塞。异步 Web 请求的工作方式存在一些奇怪之处,这可能会强制您假设的异步请求同步。对于初学者来说,如果您实际上要发布数据,则必须使用 BeginGetRequestStream (不能混合异步和同步)请参阅:http://msdn.microsoft.com/en-us/library/system.net.httpwebrequest.begingetrequeststream.aspx
如果我没记错的话,WebRequest.Create 实际上没有发生任何事情,它只是设置对象,请求直到 BeginGetRequestStream 或 BeginGetResponse 才开始(取决于它是 post 还是 get)。
另一个重要的一点是,在我的发现中,读取来自 EndGetResponse 的流比读取来自请求的流有更多的延迟。您还应该在流上使用异步版本的 read。
There is no reason that this should be blocking. There are some oddities about how asynchronous web requests work which could force your supposed asynchronous requests to be synchronous. For starters, if you are actually posting data, you must use BeginGetRequestStream (you cannot mix asynch and synch) see: http://msdn.microsoft.com/en-us/library/system.net.httpwebrequest.begingetrequeststream.aspx
If I recall correctly nothing actually happens with WebRequest.Create, it just sets up the object, the request doesn't start until either BeginGetRequestStream or BeginGetResponse (depending if it's a post or get).
Another big note, in my findings, there is a lot more delay with reading the stream which comes from EndGetResponse than there is from the request. You should also use the asynchrnous version of read on the stream.