您如何下载网站?
搜索引擎机器人抓取网络并下载他们访问的每个页面进行分析,对吗?
他们究竟如何下载页面?他们以什么方式存储页面?
我问这个问题是因为我想对几个网页进行分析。我可以通过访问该地址来抓取页面,但是将页面下载到我的计算机并从那里处理它们不是更有意义吗?
Search engine bots crawl the web and download each page they go to for analysis, right?
How exactly do they download a page? What way do they store the pages?
I am asking because I want to run an analysis on a few webpages. I could scrape the page by going to the address but wouldn't it make more sense to download the pages to my computer and work on them from there?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
wget --mirror
wget --mirror
尝试 HTTrack
关于他们的做法:
索引从指定的起点(如果您愿意,可以是入口)开始。从那里开始,蜘蛛程序递归地跟踪所有超链接,直到给定的深度。
搜索引擎蜘蛛的工作方式也像这样,但是有很多同时爬行,并且有还有其他重要因素。例如,在 SO 中新创建的帖子将很快被谷歌收录,但低流量网站的更新甚至会在几天后被收录。
Try HTTrack
About the way they do it:
The indexing starts from a designated starting point (an entrance if you prefer). From there, the spider follows recursively all hyperlinks until a given depth.
Search engine spiders work like this as well, but there are many crawling simultaneously and there are other factors that count. For example a newly created post here in SO will be picked up by google very fast, but an update at a low traffic web site will be picked up even days later.
您可以使用 Firefox(或 firebug)和 Chrome 内置的调试工具来检查页面的工作方式。至于直接下载它们,我不确定。您可以尝试在浏览器中查看页面源代码,然后复制并粘贴代码。
You can use the debugging tools built into Firefox (or firebug) and Chrome to examine how the page works. As far as downloading them directly, I am not sure. You could maybe try viewing the page source in your browser, and then copy and paste the code.