如何仅镜像网站的一部分?
我无法让 wget 镜像网站的一部分(根目录下的文件夹路径) - 它似乎只能在网站主页上工作。
我尝试了很多选项 - 这是一个示例
wget -rkp -l3 -np http://somewebsite/subpath/down/here/
虽然我只想镜像该 URL 下面的内容链接 - 我还需要下载不在该路径中的所有页面资源。
它似乎适用于主页(/
),但我无法让它适用于任何子文件夹。
I cannot get wget to mirror a section of a website (a folder path below root) - it only seems to work from the website homepage.
I've tried many options - here is one example
wget -rkp -l3 -np http://somewebsite/subpath/down/here/
While I only want to mirror the content links below that URL - I also need to download all the page assets which are not in that path.
It seems to work fine for the homepage (/
) but I can't get it going for any sub folders.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
我使用 pavuk 来完成镜子,因为从一开始它似乎就更适合这个目的。你可以使用这样的东西:
I use pavuk to accomplish mirrors, as it seemed much better for this purpose just from the beginning. You can use something like this:
对于我的用例,无父选项不太有效。
我试图镜像
https://www.example.com/section
及其下的 URL,例如https://www.example.com/section/subsection
。这不能用--no-parent
来完成,因为如果您从/section
开始,那么如果您从/section 开始,它将下载整个网站/
然后站点重定向到/section
现在它位于父级,因此 wget 停止。乐趣。相反,我使用
--acept-regex 'https://www.example.com/(section|assets/).*'
。这有效。 (虽然它会下载sectionfoobar
但这对我来说是可以接受的,现在我们正在进入正则表达式领域,该领域在 SO 的其他地方已经得到了充分的介绍。)For my use case the no parent option didn't quite work.
I was trying to mirror
https://www.example.com/section
and URLs under it likehttps://www.example.com/section/subsection
. This can't be done with--no-parent
because if you start at/section
then it'll download the entire site if you start at/section/
then the site redirects to/section
and now it's at parent so wget stops. Fun.Instead, I am using
--acept-regex 'https://www.example.com/(section|assets/).*'
. This worked. (Although it would downloadsectionfoobar
but this was acceptable for me and now we are wandering into regexp territory which is amply covered elsewhere on SO.)查看 archivebox.io,它是一个开源、自托管工具,可创建网站的本地、静态、可浏览的 HTML 克隆(它保存 HTML、JS、媒体文件、PDF、屏幕截图、静态资产等)。
默认情况下,它仅存档您指定的 URL,但我们很快将添加一个
--depth=n
标志,让您可以递归地存档来自给定 URL 的链接。Check out archivebox.io, it's an open-source, self-hosted tool that creates a local, static, browsable HTML clone of websites (it saves HTML, JS, media files, PDFs, screenshot, static assets and more).
By default, it only archives the URL you specify, but we're adding a
--depth=n
flag soon that will let you recursively archive links from the given URL.使用
--mirror
(-m
) 和--no-parent
(-np
) 选项,加上一些很酷的,比如这个例子:Use the
--mirror
(-m
) and--no-parent
(-np
) options, plus a few of cool ones, like in this example:我通常使用:
I usually use: