Java 镜像网站
我需要从我的 Java 应用程序镜像一些网站。我一直在寻找一个开源java库来完成这项工作,但没有找到合适的。
有谁知道一些java友好的工具来检索整个网站,或者我必须坚持从我的程序中执行wget?
多谢。
I need to mirror some websites from my Java application. I was looking for an open source java library to do this job, but didn't find anything suitable.
Does anybody know about some java-friendly tool to retrieve entire websites, or must I stick to exec wget from my program?
Thanks a lot.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我发现此类库的最大问题是缺乏对 css 解析的支持,因此在镜像网站时也会下载导入的样式表、背景图像等。
wget 内置了对此的支持(至少在最近的版本中),尽管从 java 运行该程序并不是一个非常干净的解决方案,但我首先尝试一下,看看它是否满足您的需求。
The biggest problem I found with this kind of libraries was the lack of support for css parsing, so the imported stylesheets, background images and so on get downloaded as well when mirroring the website.
wget has built in support for this (at least in recent versions), and although it's not a very clean solution to run this program from java, I'd first try it and see if it fits your needs.
我会推荐一个爬虫/蜘蛛。 Aspider 和 Sperowider 使用 Apache HttpClient lib(我最喜欢的 httplib)并通过以下链接抓取网站。由于它们是OSS,您应该能够将其集成到您的软件中。它们目前也未维护,但如果您想编写自己的库,Apache HttpClient lib 将是一个不错的起点java中的镜像工具。
I would recommend a crawler/spider. Aspider and Sperowider use Apache HttpClient lib (my favourite httplib) and crawls through the site following links. Since they are OSS you should be able to integrate it into your software. They are also currently unmaintained, but Apache HttpClient lib would be a good place to start if you want to write your own mirroring tool in java.