如何“抓取”来自另一个网站的内容
有朋友问过我这个问题,我无法回答。
他问道:我正在制作这个网站,您可以在其中存档您的网站......
它的工作原理是这样的,您输入您的网站,例如,something.com,然后我们的网站抓取该网站上的内容(例如图像)以及所有这些并将其上传到我们的网站。这样,即使支持something.com 的服务器已关闭,人们也可以在 oursite.com/something.com 上查看该网站的精确副本。
他怎么能这么做呢? (php?)有什么要求?
A friend has asked me this, and I couldn't answer.
He asked: I am making this site where you can archive your site...
It works like this, you enter your site like, something.com and then our site grabs the content on that website like images, and all that and uploads it to our site. Then people can view an exact copy of the site at oursite.com/something.com even if the server that is holding up something.com is down.
How could he do this? (php?) and what would be some requirements?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
听起来您需要创建一个网络爬虫。网络爬虫可以用任何语言编写,但我建议使用 C++(使用 cURL)、Java(使用 URLConnection)或 Python(w/urrlib2)。您也可以使用curl或wget命令和BASH来快速破解某些东西,尽管这可能不是最好的长期解决方案。另外,不要忘记,当您抓取某人的网站时,如果存在“robots.txt”文件,您应该下载、解析并尊重该文件。
It sounds like you need to create a webcrawler. Web crawlers can be written in any language, although I would recommend using C++ (using cURL), Java (using URLConnection), or Python (w/ urrlib2) for that. You could probably also hack something quickly together with the curl or wget commands and BASH, although that is probably not the best long-term solution. Also, don't forget that you should download, parse, and respect the "robots.txt" file if it is present whenever you crawl someone's website.
使用 wget。 gnuwin32 软件包中的 Linux 版本或 Windows 版本。 在此处获取。
Use wget. Either the linux version or the windows version from the gnuwin32 package. get it here.