wget 克隆网站，链接到目录而不是index.html

发布于 2025-01-01 22:55:56 字数 601 浏览 5 评论 0原文

我想使用 wget 克隆一个 WordPress 博客，以便我可以将其作为静态内容包含在我正在编写的 Web 应用程序中。

目前，我正在使用以下内容来克隆网站：

wget -rk http://sitename.com

这运行良好，但生成的 html 中的链接指向 index.html 文件。我希望这些链接指向包含该文件的目录。

例如，对于页面 http://sitename.com/blog-post-about-cats/ wget 正在生成一个目录“blog-post-about-cats”并在其中放置一个index.html 文件。该博客文章的链接写为“../blog-post-about-cats/index.html”，我希望它们为“../blog-post-about-cats/”。

我想我想要它是因为我认为url中的index.html看起来有点难看，而且这些页面都是关于演示的。

有什么想法吗？这可以通过 wget 或者其他命令行工具实现吗？

谢谢。

原文

I would like to clone a wordpress blog with wget so that I can include it as static content in a web app I am writing.

At the moment I am using the following to clone the site:

wget -rk http://sitename.com

This is working well, but the links in the generated html point to the index.html file. I would like those links to point to the directory that contains the file.

e.g. for the page http://sitename.com/blog-post-about-cats/
wget is generating a directory "blog-post-about-cats" and putting a index.html file in there. Links to that blog post are written as "../blog-post-about-cats/index.html" where I want them to be "../blog-post-about-cats/".

I guess I want it because I think the index.html in the url looks a bit ugly and these pages are all about presentation.

Any ideas? Is this possible with wget or perhaps a different command line tool?

Thanks.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

我做我的改变 2025-01-08 22:55:56

我假设 wget 默认情况下不会执行此操作，因为您的本地 Web 服务器可能配置为提供目录的索引页，而不是 index.html。最简单的解决方案是使用正则表达式对所有获取的 HTML 文件进行后处理：

find -name '*.html' | xargs sed -rie 's/href="([^"]*)\/index\.html"/href="\1\/"/gi'

如果此站点上的页面是其他类型的文件（例如 .php 文件），请替换“*.php”或任何合适的文件。正则表达式的作用是识别href="stuff/index.html"形式的字符串，并去掉index.html。 xargs 和 find 用于将其应用于所有页面，sed 的“-i”标志使其就地修改文件。正则表达式中的“gi”标志使其替换所有出现的情况，并且不区分大小写（因为 HTML 不区分大小写）。

I assume wget doesn't do this by default because your local web server might be configured to serve up index pages for directories, rather than index.html. The simplest solution is to postprocess all the fetched HTML files afterwards with a regular expression:

find -name '*.html' | xargs sed -rie 's/href="([^"]*)\/index\.html"/href="\1\/"/gi'

If the pages on this site are some other type of file such as .php files, substitute "*.php" or whatever is suitable. The function of the regular expression is to identify strings of the form href="stuff/index.html" and remove the index.html. The xargs and find are used to apply this to all pages, and the "-i" flag to sed makes it modify files in-place. The "gi" flags in the regular expression make it replace all occurrences, and be case-insensitive (since HTML is case-insensitive).

回复收藏 0 原文