如何使用 wget 指定文件名在本地镜像 html 文件及其随附图像?
我需要获取一个 html 格式的 url 及其随附的图像。 html 将使用自定义文件名保存(我在调用脚本中给它一个时间戳),并且图像也需要具有这些时间戳。
结果文件夹的结构应如下所示:
2012-02-22 06:00:00 UTC.html
2012-02-22 07:00:00 UTC.html
2012-02-22 08:00:00 UTC.html
img1_2012-02-22 06:00:00 UTC.gif
img2_2012-02-22 06:00:00 UTC.gif
img1_2012-02-22 07:00:00 UTC.gif
img2_2012-02-22 07:00:00 UTC.gif
img1_2012-02-22 08:00:00 UTC.gif
img2_2012-02-22 08:00:00 UTC.gif
本质上,这是一个镜像,需要将图像路径重写为 html 中的本地相对路径。我尝试过 wget 的 --directory-prefix
和 --output-document
,但显然没有真正成功,因为它将图像嵌入 html 输出文件中。
这对于 Stock wget 是可行的还是最好编写我自己的脚本来将每个文件拉下来,然后解析 html 文件并适当地替换字符串?
I need to grab a url as html along with it's accompanying images. The html is to be saved with a custom filename (I'm giving it a timestamp in the calling script) and the images need to have these timestamps as well.
The resultant folder should should be structured like:
2012-02-22 06:00:00 UTC.html
2012-02-22 07:00:00 UTC.html
2012-02-22 08:00:00 UTC.html
img1_2012-02-22 06:00:00 UTC.gif
img2_2012-02-22 06:00:00 UTC.gif
img1_2012-02-22 07:00:00 UTC.gif
img2_2012-02-22 07:00:00 UTC.gif
img1_2012-02-22 08:00:00 UTC.gif
img2_2012-02-22 08:00:00 UTC.gif
Essentially this is a mirror that needs to rewrite the images paths for the local relative path in the html. I've played around with wget's --directory-prefix
and --output-document
with no real success obviously since it embeds the images in the html output file.
Is this doable with stock wget or is it better to write my own script to just pull each file down and then parse up the html file replacing the strings appropriately?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我做了一些类似于您想要完成的事情这里。我非常确定您可以重新设计我所得到的内容来创建您的程序。
查看 4chan_image_crawler.rb
该程序执行这些操作。
如果您无法理解我的代码,请告诉我。
成功做到这一点的关键是了解如何操作正则表达式并使用 Ruby 中 String 类的 scan 。
I did something similar to what you are trying to accomplish here. I'm pretty sure you can reengineer what I got there to create your program.
Check out 4chan_image_crawler.rb
The program does these things.
Let me know if you have trouble understanding my code.
The key to doing this successfully is knowing how to manipulate regex and using scan from the String class in Ruby.