在 Linux 服务器上保存完整网页的最佳方法是什么?
我需要在我的 Linux 服务器上归档完整的页面,包括任何链接的图像等。寻找最佳解决方案。有没有办法保存所有资产,然后重新链接它们以在同一目录中工作?
我考虑过使用curl,但我不确定如何做到这一切。另外,我可能需要 PHP-DOM 吗?
有没有办法在服务器上使用 Firefox 并在加载地址后复制临时文件或类似的方法?
欢迎任何和所有的意见。
编辑:
似乎 wget “无法”工作,因为需要渲染文件。我在服务器上安装了 Firefox,有没有办法在 Firefox 中加载 url,然后抓取临时文件并在之后清除临时文件?
I need to archive complete pages including any linked images etc. on my linux server. Looking for the best solution. Is there a way to save all assets and then relink them all to work in the same directory?
I've thought about using curl, but I'm unsure of how to do all of this. Also, will I maybe need PHP-DOM?
Is there a way to use firefox on the server and copy the temp files after the address has been loaded or similar?
Any and all input welcome.
Edit:
It seems as though wget is 'not' going to work as the files need to be rendered. I have firefox installed on the server, is there a way to load the url in firefox and then grab the temp files and clear the temp files after?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
应该足够并抓取图像/媒体。您可以给它提供多种选择。
注意:我相信
wget
或任何其他程序都支持下载通过 CSS 指定的图像 - 因此您可能需要自己手动执行此操作。这里可能是一些有用的参数: http://www.linuxjournal.com/内容/下载整个网站-wget
Should be sufficient and grab images/media. There are plenty of options you can feed it.
Note: I believe
wget
nor any other program supports downloading images specified through CSS - so you may need to do that yourself manually.Here may be some useful arguments: http://www.linuxjournal.com/content/downloading-entire-web-site-wget
使用以下命令:
使用
-E
调整扩展名。使用-k
转换链接以从存储加载页面。使用-p
下载页面内的所有对象。请注意,此命令不会下载指定页面中超链接的其他页面。这意味着该命令仅下载正确加载指定页面所需的对象。
Use following command:
Use
-E
to adjust extensions. Use-k
to convert links to load the page from your storage. Use-p
to download all objects inside the page.Please note that this command does not download other pages hyperlinked in the specified page. It means that this command only download objects required to load the specified page properly.
wget
可以做到这一点,例如:这将镜像整个 example.com 站点。
一些有趣的选项是:
-Dexample.com
:不要关注其他域的链接--html-extension
:将 text/html 内容类型的页面重命名为 .html手册:http://www.gnu.org/software/wget/manual/
wget
can do that, for example:This will mirror the whole example.com site.
Some interesting options are:
-Dexample.com
: do not follow links of other domains--html-extension
: renames pages with text/html content-type to .htmlManual: http://www.gnu.org/software/wget/manual/
我尝试了几个工具
curl
、wget
,但没有任何效果达到我的预期。最后,我找到了一个保存完整网页的工具(图像、脚本、链接样式表......一切都包括在内)。它是用 Rust 编写的,名为
monolith
。 看看。它将图像和其他脚本/样式表打包在 1 个 html 文件中。
示例
我可以将网页 https://nodejs.org/en/docs/es6 保存到本地文件
es6.html
,使用以下命令将所有页面必需项打包在一个文件中:I tried couple of tools
curl
,wget
included but nothing works up to my expectations.Finally I found a tool to save a complete webpage (images, scripts, linked stylesheets.... everything included). Its written in rust named
monolith
. Take a look.It packs images and other scripts/stylesheets in 1 html file.
Example
I can save webpage https://nodejs.org/en/docs/es6 to a local file
es6.html
with all page requisites packed in one file with the following command:如果网页中的所有内容都是静态的,您可以使用
之类的内容来解决此问题wget
:或其一些变体。
由于您还拥有动态页面,因此通常无法使用 wget 或任何简单的 HTTP 客户端来归档此类网页。正确的存档需要包含后端数据库和任何服务器端脚本的内容。这意味着正确执行此操作的唯一方法是复制支持服务器端文件。这至少包括 HTTP 服务器文档根和任何数据库文件。
编辑:
作为解决方法,您可以修改网页,以便具有适当权限的用户可以下载所有服务器端文件,以及支持数据库的文本模式转储(例如 SQL 转储)。您应该极其小心,避免通过此归档系统打开任何安全漏洞。
如果您使用虚拟主机提供商,他们中的大多数都会提供某种允许备份整个站点的 Web 界面。如果您使用实际的服务器,则可以安装大量备份解决方案,包括一些用于托管站点的基于 Web 的解决方案。
If all the content in the web page was static, you could get around this issue with something like
wget
:or some variation thereof.
Since you also have dynamic pages, you cannot in general archive such a web page using
wget
or any simple HTTP client. A proper archive needs to incorporate the contents of the backend database and any server-side scripts. That means that the only way to do this properly is to copy the backing server-side files. That includes at least the HTTP server document root and any database files.EDIT:
As a work-around, you could modify your webpage so that a suitably priviledged user could download all the server-side files, as well as a text-mode dump of the backing database (e.g. an SQL dump). You should take extreme care to avoid opening any security holes through this archiving system.
If you are using a virtual hosting provider, most of them provide some kind of Web interface that allows backing-up the whole site. If you use an actual server, there is a large number of back-up solutions that you could install, including a few Web-based ones for hosted sites.