下载图像并将其存储到单独的文件中
我想从网络下载图像并根据图像的标题名称将它们存储到单独的文件中。我开发了一个抓取器来抓取这些图像的链接,但是当我包含 files_pipeline
时,我无法在下载时将 .png
附加到每个图像,也不能更改名称从 SHA1
哈希码到我提取的名称,在 title
中给出。
这是我到目前为止所拥有的:
import scrapy
from scrapy_playwright.page import PageCoroutine
from scrapy.item import Field
from scrapy.loader import ItemLoader
from itemloaders.processors import TakeFirst, MapCompose, Join
from bs4 import BeautifulSoup
import json
import re
headers = {
'Connection': 'keep-alive',
'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="98", "Google Chrome";v="98"',
'Accept': '*/*',
'X-Requested-With': 'XMLHttpRequest',
'sec-ch-ua-mobile': '?0',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36',
'sec-ch-ua-platform': '"macOS"',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Dest': 'empty',
'Referer': 'https://digital.library.pitt.edu/islandora/object/pitt%3A31735061815696/viewer',
'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
}
class carnapItem(scrapy.Item):
title = Field(input_processor = MapCompose(str.strip),
output_processor = Join())
id_image = Field(output_processor = TakeFirst())
class carnapSpider(scrapy.Spider):
name = 'carnap'
start_urls = []
for pages in range(1, 44):
start_urls.append(f'https://digital.library.pitt.edu/collection/archives-scientific-philosophy?page={pages}&islandora_solr_search_navigation=0&f%5B0%5D=mods_relatedItem_host_titleInfo_title_ms%3A%22Rudolf%5C%20Carnap%5C%20Papers%22')
custom_settings = {
'USER_AGENT':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36'
}
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url=url, headers=headers,
callback = self.parse
)
def parse(self, response):
container = response.xpath("//div[@class='islandora islandora-solr-search-results']/div")
for data in container:
href_data = data.xpath('(.//a)[position() mod 5=1]//@href').get()
href_data = '/viewer#'.join(href_data.split("#"))
links = response.urljoin(href_data)
loader = ItemLoader(carnapItem())
loader.add_value('links', links)
yield loader.load_item()
def parse(self, response):
container = response.xpath("//div[@class='islandora islandora-solr-search-results']/div")
for data in container:
href_data = data.xpath('(.//a)[position() mod 5=1]//@href').get()
href_data = '/viewer#'.join(href_data.split("#"))
links = response.urljoin(href_data)
yield response.follow(url=links, callback = self.parse_carnap, headers=headers)
def parse_carnap(self, response):
soup = BeautifulSoup(response.body, 'lxml')
for i in range(53, 54, 1):
java_val= soup.select(f"*[type]:nth-child({i})")
for b in java_val:
data_test=b.text[b.text.find('{'):b.text.rfind('}')+1]
data_test = json.loads(data_test)
test = BeautifulSoup(data_test['islandoraInternetArchiveBookReader']['info'], 'lxml')
title = re.sub('Title','',test.find('tr', {'class':'odd'}).text)
id_no = [str(test.select('.even')[1]).split('>')[4].split("<")[0]]
page_count = data_test['islandoraInternetArchiveBookReader']['pageCount']
for id_m in id_no:
for pg in range(1, page_count+1):
another_str=f'https://digital.library.pitt.edu/internet_archive_bookreader_get_image_uri/pitt:{id_m}-00{str(pg).zfill(2)}'
yield scrapy.Request(
url = another_str,
method='POST',
headers=headers,
callback = self.parse_images,
cb_kwargs = {
'title':title}
)
def parse_images(self, response, title):
file_url = response.text
item = DownfilesItem()
item['original_file_name'] = title
item['file_urls'] = [file_url]
yield item
设置:
BOT_NAME = 'insta_vm'
SPIDER_MODULES = ['insta_vm.spiders']
NEWSPIDER_MODULE = 'insta_vm.spiders'
ROBOTSTXT_OBEY = False
ITEM_PIPELINES = {'scrapy.pipelines.files.FilesPipeline': 150}
FILES_STORE = "Files"
我的项目管道不会将 .png
格式附加到下载的文件中:
import scrapy
from itemloaders.processors import MapCompose
class DownfilesItem(scrapy.Item):
# define the fields for your item here like:
file_urls = scrapy.Field(input_processor = MapCompose(lambda x: x+'.png'))
original_file_name = scrapy.Field()
files = scrapy.Field()
I want to download images from the web and store these into separate files based on the title name of the image. I have developed a scraper to grab the links to these images, however when I include the files_pipeline
I cannot append .png
to each image when downloaded, not can I change the name from SHA1
hashcodes to the name I have extracted, given in title
.
Here's what I have so far:
import scrapy
from scrapy_playwright.page import PageCoroutine
from scrapy.item import Field
from scrapy.loader import ItemLoader
from itemloaders.processors import TakeFirst, MapCompose, Join
from bs4 import BeautifulSoup
import json
import re
headers = {
'Connection': 'keep-alive',
'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="98", "Google Chrome";v="98"',
'Accept': '*/*',
'X-Requested-With': 'XMLHttpRequest',
'sec-ch-ua-mobile': '?0',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36',
'sec-ch-ua-platform': '"macOS"',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Dest': 'empty',
'Referer': 'https://digital.library.pitt.edu/islandora/object/pitt%3A31735061815696/viewer',
'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
}
class carnapItem(scrapy.Item):
title = Field(input_processor = MapCompose(str.strip),
output_processor = Join())
id_image = Field(output_processor = TakeFirst())
class carnapSpider(scrapy.Spider):
name = 'carnap'
start_urls = []
for pages in range(1, 44):
start_urls.append(f'https://digital.library.pitt.edu/collection/archives-scientific-philosophy?page={pages}&islandora_solr_search_navigation=0&f%5B0%5D=mods_relatedItem_host_titleInfo_title_ms%3A%22Rudolf%5C%20Carnap%5C%20Papers%22')
custom_settings = {
'USER_AGENT':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36'
}
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url=url, headers=headers,
callback = self.parse
)
def parse(self, response):
container = response.xpath("//div[@class='islandora islandora-solr-search-results']/div")
for data in container:
href_data = data.xpath('(.//a)[position() mod 5=1]//@href').get()
href_data = '/viewer#'.join(href_data.split("#"))
links = response.urljoin(href_data)
loader = ItemLoader(carnapItem())
loader.add_value('links', links)
yield loader.load_item()
def parse(self, response):
container = response.xpath("//div[@class='islandora islandora-solr-search-results']/div")
for data in container:
href_data = data.xpath('(.//a)[position() mod 5=1]//@href').get()
href_data = '/viewer#'.join(href_data.split("#"))
links = response.urljoin(href_data)
yield response.follow(url=links, callback = self.parse_carnap, headers=headers)
def parse_carnap(self, response):
soup = BeautifulSoup(response.body, 'lxml')
for i in range(53, 54, 1):
java_val= soup.select(f"*[type]:nth-child({i})")
for b in java_val:
data_test=b.text[b.text.find('{'):b.text.rfind('}')+1]
data_test = json.loads(data_test)
test = BeautifulSoup(data_test['islandoraInternetArchiveBookReader']['info'], 'lxml')
title = re.sub('Title','',test.find('tr', {'class':'odd'}).text)
id_no = [str(test.select('.even')[1]).split('>')[4].split("<")[0]]
page_count = data_test['islandoraInternetArchiveBookReader']['pageCount']
for id_m in id_no:
for pg in range(1, page_count+1):
another_str=f'https://digital.library.pitt.edu/internet_archive_bookreader_get_image_uri/pitt:{id_m}-00{str(pg).zfill(2)}'
yield scrapy.Request(
url = another_str,
method='POST',
headers=headers,
callback = self.parse_images,
cb_kwargs = {
'title':title}
)
def parse_images(self, response, title):
file_url = response.text
item = DownfilesItem()
item['original_file_name'] = title
item['file_urls'] = [file_url]
yield item
Settings:
BOT_NAME = 'insta_vm'
SPIDER_MODULES = ['insta_vm.spiders']
NEWSPIDER_MODULE = 'insta_vm.spiders'
ROBOTSTXT_OBEY = False
ITEM_PIPELINES = {'scrapy.pipelines.files.FilesPipeline': 150}
FILES_STORE = "Files"
My Items pipeline won't append .png
format to the downloaded files:
import scrapy
from itemloaders.processors import MapCompose
class DownfilesItem(scrapy.Item):
# define the fields for your item here like:
file_urls = scrapy.Field(input_processor = MapCompose(lambda x: x+'.png'))
original_file_name = scrapy.Field()
files = scrapy.Field()
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
哦!有了Islandora 7,你就不会那样刮伤了。使用数据流 URL 并将文件名附加到末尾。
${DOMAIN}/islandora/object/${PID}/datastream/OBJ/${desired_file_name}.png
Islandora 会在您下载文件时自动为您命名该文件。我过去所做的就是使用 PID 作为文件名并获取原始对象。
${DOMAIN}/islandora/object/${PID}/datastream/OBJ/${PID}.png
确保原始对象是 PNG。如果您不确定,您需要查询 Solr 或 Fedora 以检查原始格式。大多数大学更喜欢 TIFF 等 RAW 格式而不是 PNG,并且您可能需要将 URL 的“/OBJ/”部分替换为不同的数据流。
OH! With Islandora 7 you don't scrape it like that. Use the datastream URL and append the filename to the end.
${DOMAIN}/islandora/object/${PID}/datastream/OBJ/${desired_file_name}.png
Islandora will automatically name the file for you as you download it. What I have done in the past is use the PID as the file name and fetched the original object.
${DOMAIN}/islandora/object/${PID}/datastream/OBJ/${PID}.png
Make sure the original object is a PNG. If you're not sure, you'll need to query Solr or Fedora to check the original format. Most universities prefer a RAW format like TIFFs instead of PNGs and you may need to replace the "/OBJ/" part of the URL for a different datastream.