wget 克隆网站,链接到目录而不是index.html
我想使用 wget 克隆一个 WordPress 博客,以便我可以将其作为静态内容包含在我正在编写的 Web 应用程序中。
目前,我正在使用以下内容来克隆网站:
wget -rk http://sitename.com
这运行良好,但生成的 html 中的链接指向 index.html 文件。我希望这些链接指向包含该文件的目录。
我想我想要它是因为我认为url中的index.html看起来有点难看,而且这些页面都是关于演示的。
有什么想法吗?这可以通过 wget 或者其他命令行工具实现吗?
谢谢。
I would like to clone a wordpress blog with wget so that I can include it as static content in a web app I am writing.
At the moment I am using the following to clone the site:
wget -rk http://sitename.com
This is working well, but the links in the generated html point to the index.html file. I would like those links to point to the directory that contains the file.
e.g. for the page http://sitename.com/blog-post-about-cats/
wget is generating a directory "blog-post-about-cats" and putting a index.html file in there. Links to that blog post are written as "../blog-post-about-cats/index.html" where I want them to be "../blog-post-about-cats/".
I guess I want it because I think the index.html in the url looks a bit ugly and these pages are all about presentation.
Any ideas? Is this possible with wget or perhaps a different command line tool?
Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我假设 wget 默认情况下不会执行此操作,因为您的本地 Web 服务器可能配置为提供目录的索引页,而不是 index.html。最简单的解决方案是使用正则表达式对所有获取的 HTML 文件进行后处理:
如果此站点上的页面是其他类型的文件(例如 .php 文件),请替换“*.php”或任何合适的文件。正则表达式的作用是识别href="stuff/index.html"形式的字符串,并去掉index.html。 xargs 和 find 用于将其应用于所有页面,sed 的“-i”标志使其就地修改文件。正则表达式中的“gi”标志使其替换所有出现的情况,并且不区分大小写(因为 HTML 不区分大小写)。
I assume wget doesn't do this by default because your local web server might be configured to serve up index pages for directories, rather than index.html. The simplest solution is to postprocess all the fetched HTML files afterwards with a regular expression:
If the pages on this site are some other type of file such as .php files, substitute "*.php" or whatever is suitable. The function of the regular expression is to identify strings of the form href="stuff/index.html" and remove the index.html. The xargs and find are used to apply this to all pages, and the "-i" flag to sed makes it modify files in-place. The "gi" flags in the regular expression make it replace all occurrences, and be case-insensitive (since HTML is case-insensitive).
您可以让
wget
为您完成此操作。尝试使用:wget --mirror --convert-links --html-extension --wait=1 https://example.com/
您可以在此处找到更多详细信息:
https:// aawaara.com/post/188869436762/how-to-clone-any-website-in-the-world-and-host-it
You can let
wget
do this for you. Try using:wget --mirror --convert-links --html-extension --wait=1 https://example.com/
You can find more details here:
https://aawaara.com/post/188869436762/how-to-clone-any-website-in-the-world-and-host-it