Scrapy图片下载如何使用自定义文件名
对于我的 scrapy 项目,我目前正在使用 ImagesPipeline。下载的图像使用其 URL 的 SHA1 哈希存储为文件名。
如何使用我自己的自定义文件名来存储文件?
如果我的自定义文件名需要包含同一项目中的另一个抓取字段怎么办?例如,使用 item['desc']
和带有 item['image_url']
的图像文件名。如果我理解正确,这将涉及以某种方式从图像管道访问其他项目字段。
任何帮助将不胜感激。
For my scrapy project I'm currently using the ImagesPipeline. The downloaded images are stored with a SHA1 hash of their URLs as the file names.
How can I store the files using my own custom file names instead?
What if my custom file name needs to contain another scraped field from the same item? e.g. use the item['desc']
and the filename for the image with item['image_url']
. If I understand correctly, that would involve somehow accessing the other item fields from the Image Pipeline.
Any help will be appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
这只是 scrapy 0.24(已编辑)答案的实现,其中
image_key()
已弃用This is just actualization of the answer for scrapy 0.24 (EDITED), where the
image_key()
is deprecated在 scrapy 0.12 中我解决了这样的问题
In scrapy 0.12 I solved something like this
我在2017年找到了自己的方法,scrapy 1.1.3
就像上面的代码一样,你可以在
get_media_requests()
中将你想要的名称添加到Request元中,并在file_path()中取回它
通过request.meta.get('yourname','')
。I found my way in 2017,scrapy 1.1.3
like the code above,you can add the name you want to a Request meta in
get_media_requests()
, and get it back infile_path()
byrequest.meta.get('yourname','')
.这就是我在 Scrapy 0.10 中解决问题的方法。
检查 FSImagesStoreChangeableDirectory 的 persist_image 方法。下载图像的文件名是关键
This was the way I solved the problem in Scrapy 0.10 .
Check the method persist_image of FSImagesStoreChangeableDirectory. The filename of the downloaded image is key
我为此做了一个令人讨厌的快速黑客攻击。就我而言,我将图像的标题存储在我的提要中。而且,每个项目只有 1 个
image_urls
,因此,我编写了以下脚本。它基本上使用我存储为 json 的项目提要中的相应标题重命名/images/full/
目录中的图像文件。这很令人讨厌&不推荐。但是,这是一种天真的替代方法。
I did a nasty quick hack for that. In my case, I stored the title of image in my feeds. And, I had only 1
image_urls
per item, so, I wrote the following script. It basically renames the image files in the/images/full/
directory with the corresponding title in the item feed that I had stored in as json.It's nasty & not recommended. But, it is a naive alternative approach.
我重写了代码,在thumb_path def中更改了“响应”。通过“请求”。如果否,它将不起作用,因为“响应设置为无”。
I rewrite the code, changing, in thumb_path def, "response." by "request.". If no, it won't work because "response is set to None".