从存储库下载 tarball
我目前正在开发一个从 SourceForge 抓取源代码的项目。 我想从代码存储库下载 tarball。
下面给出了一个示例链接: http://wurfl.cvs.sourceforge.net/viewvc/wurfl/?view =tar
我在下载时遇到的问题是,我无法使用传统的 URLConnection、HttpClient、HtmlUnit、Jsoup 等 API 来下载文件。指定的链接不包含任何文件名或扩展名,这使得下载过程更加复杂。
您能否建议一种方法,通过给定一组 tarball 链接作为参数,我应该能够将它们下载到我的磁盘上?另外,我可以使用 wget 下载它。有没有办法可以在 Windows 中用 Java 以编程方式执行此操作?
I am currently working on a project for scraping source code from SourceForge.
I would like to download the tarball from the code repository.
An example link is given below:
http://wurfl.cvs.sourceforge.net/viewvc/wurfl/?view=tar
The problems I faced while downloading is that, I am unable to use conventional URLConnection, HttpClient, HtmlUnit, Jsoup, etc API's to download the file. The specified link does not contain any filename or extension, this makes the download process even more complicated.
Can you suggest a means by which given a set of tarball links as parameters, I should be able to download them to my disk? Also, I was able to download it using wget. Is there a way I can programatically do it in Java in Windows?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
在进一步努力之前,请仔细阅读 Sourceforge 条款使用页面。如果您不理解 ToS,请联系 Sourceforge 并询问他们是否允许您执行您所提议的操作。
你的假设是不正确的。
您可以使用标准
HttpURLConnection
API 或 ApacheHttpClient
API 等 API 来完成此类操作。如果它不起作用,那是因为如果您发布一些有关尝试这些方法时发生的情况的详细信息,也许我们可以为您提供帮助。
(HtmlUnit 和 Jsoup 可能不合适,因为它们针对的是 HTML 内容。)
您可以从响应标头获取源文件名和/或内容类型。详细信息请参阅 HTTP 规范。
Before you go any further with your efforts, carefully read the Sourceforge Terms of Use page. If you don't understand the ToS, contact Sourceforge and ask them if you are allowed to do what you are proposing.
Your assumption is incorrect.
You CAN use APIs such as the standard
HttpURLConnection
API or the ApacheHttpClient
APIs to do this kind of thing. If it is not working, it is becauseIf you post some details on what is happening when you try these approaches, maybe we can help you.
(HtmlUnit and Jsoup are probably inappropriate because they target HTML content.)
You can get the source filename and / or content type from the response headers. Refer to the HTTP specifications for details.
如果您确实想要违反 SourceForges ToS,那么这可能会有所帮助。
您需要 wget.exe,如您所愿。
只要 wget.exe 与类文件位于同一目录中,此操作就可以工作。
您可能还想检查该文件是否存在,在这种情况下,您可以执行以下操作:
但我建议不要抓取 SourceForge,除非它是您自己的代码(我作为更新程序做过一次) 。如果你这样做,并且我的例子有帮助,请不要提及我。 =]
希望我有帮助!
In the case that you really DO want to perhaps violate SourceForges ToS, then this may help.
You need wget.exe, as you wanted.
This will work as long as you have wget.exe in the same directory as the class file.
You may also want to check if the file DOES exist, in which case you would do something among the lines of:
But I reccomend NOT scraping SourceForge, unless its your own code that you are scraping (I did that once as an updater program). If you do, and my example helps, please kindly don't mention me. =]
Hope I helped!