如何使用 URLConnection 进行导航?
我的应用程序需要一些网络抓取功能。我有下载所有数据的 URL 对象。但我需要抓取许多页面并创建许多 URL 对象,因此我打开了许多连接。如何优化它,以便我可以拥有一个连接并仅使用它导航到其他页面?
干杯
My application needs some web scraping functionality. I have URL object that downloads all the data. But I need to scrape many pages and I create many URL objects so I open many connections. How to optimize it, so I can have one connection and only navigate to other pages with it?
Cheers
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
据我所知,每个 URL 都必须有不同的
URLConnection
(这很有意义,因为底层网络连接也必须更改)。我严重怀疑创建这个对象是你的瓶颈;我怀疑这是网络时间,但没有配置文件很难确定。对于适量的页面,我会考虑一个工作队列(例如使用
ExecutorService
)。对于大量页面,我什至可能会考虑 Java 版本的 Map/Reduce。编辑:为了让 Map/Reduce 比简单的工作队列更好,您需要有多台计算机来进行抓取。
As far as I can tell, you must have a different
URLConnection
for each URL (which makes sense as the underlying network connection must change as well). I seriously doubt that creating this object is your bottleneck; I suspect it is the network time, but without profile it is hard to know for certain.For a moderate amount of pages, I would consider a working queue (say using an
ExecutorService
). For a large number of pages, I might even look into a Java version of Map/Reduce.Edit: For Map/Reduce to be better than a simple worker queue, you need to have multiple computers available to do the scraping.
您可以使用Apache HTTP组件,它有很多功能,包括支持并发访问
You could use Apache HTTP components, it has a lot of features, including a connection manager supporting concurrent access