Scrapy CrawlSpider 爬行,但不解析任何项目
我正在尝试收集有关杂货中出售的所有产品的信息。我有一些刮擦类似网站的经验,并使用了爬网轴进行操作。
当我运行蜘蛛时,似乎在整个网站上爬行,但不会返回任何项目。我已经尝试了多种不同的规则组合,因为我怀疑这个问题与这些问题有关,但是我无法修复它。
任何帮助都将不胜感激。
这是我的蜘蛛代码:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from aldiscraper.items import AldiscraperItem
from scrapy.loader import ItemLoader
from datetime import datetime
import re
class AldiSpider(CrawlSpider):
name = 'aldi'
start_urls = ['https://groceries.aldi.co.uk/']
rules = (
Rule(LinkExtractor(allow='en-GB/', deny=r'/ddddddddddddd')),
Rule(LinkExtractor(allow=r'/ddddddddddddd'), callback='parse_products')
)
custom_settings = {
'FEED_EXPORT_FIELDS': [
'prod_id',
'name',
'size',
'price',
'scrape_date',
],
}
def parse_products(self, response):
item = AldiscraperItem()
item['prod_id'] = response.css('span.sku.small::text').get()
item['name'] = response.css('h1.my-0::text').get()
item['size'] = response.css('span.text-black-50.font-weight-bold::text').get()
item['price'] = response.css('span.product-price.h4.m-0.font-weight-bold::text').get()
item['scrape_date'] = datetime.now().strftime('%d/%m/%Y')
yield item
我最初尝试使用以下规则来运行蜘蛛,结果相同:
class AldiSpider(CrawlSpider):
name = 'aldi'
start_urls = ['https://groceries.aldi.co.uk/']
rules = (
Rule(LinkExtractor(allow='en-GB/', deny='en-GB/p-')),
Rule(LinkExtractor(allow='en-GB/p-'), callback='parse_products')
)
我正在使用此命令运行蜘蛛:
scrapy crawl aldi -O aldi.csv
这是日志中的摘录
2022-04-11 19:32:04 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: aldi)
2022-04-11 19:32:04 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.2.0, Python 3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1l 24 Aug 2021), cryptography 3.4.8, Platform Windows-10-10.0.19044-SP0
2022-04-11 19:32:04 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'aldi',
'COOKIES_ENABLED': False,
'DOWNLOAD_DELAY': 0.75,
'FEED_EXPORT_FIELDS': ['prod_id', 'name', 'size', 'price', 'scrape_date'],
'NEWSPIDER_MODULE': 'aldiscraper.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['aldiscraper.spiders']}
2022-04-11 19:32:04 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-04-11 19:32:04 [scrapy.extensions.telnet] INFO: Telnet Password: 2072eb077b8dcc04
2022-04-11 19:32:05 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2022-04-11 19:32:06 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-04-11 19:32:06 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-04-11 19:32:06 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-04-11 19:32:06 [scrapy.core.engine] INFO: Spider opened
2022-04-11 19:32:06 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-04-11 19:32:06 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-04-11 19:32:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/robots.txt> (referer: None)
2022-04-11 19:32:07 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://groceries.aldi.co.uk/en-GB/> from <GET https://groceries.aldi.co.uk/>
2022-04-11 19:32:09 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/> (referer: None)
2022-04-11 19:32:09 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://groceries.aldi.co.uk/en-GB/#footer-collapse-0> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2022-04-11 19:32:10 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/forgot-password> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/bakery/bread?origin=dropdown&c1=groceries&c2=bakery&c3=bread&c4=sliced-bread&clickedon=sliced-bread> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/bakery?origin=dropdown&c1=groceries&c2=bakery&c3=bread&c4=shopall-bread&clickedon=shopall-bread> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/bakery/bread?origin=dropdown&c1=groceries&c2=bakery&c3=bread&clickedon=bread> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/bakery?origin=dropdown&c1=groceries&c2=bakery&c3=shopall-bakery&clickedon=shopall-bakery> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/bakery?origin=dropdown&c1=groceries&c2=bakery&clickedon=bakery> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-drinks/vegan-alcohol?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=vegan-drinks&c4=vegan-alcohol&clickedon=vegan-alcohol> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-drinks/milk-alternatives?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=vegan-drinks&c4=milk-alternatives&clickedon=milk-alternatives> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-drinks?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=vegan-drinks&c4=shopall-vegan-drinks&clickedon=shopall-vegan-drinks> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-drinks?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=vegan-drinks&clickedon=vegan-drinks> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-food/vegan-sides-snacks?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=vegan-food&c4=vegan-sides-snacks&clickedon=vegan-sides-snacks> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-food/vegan-meat-alternatives?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=vegan-food&c4=meat-alternatives&clickedon=meat-alternatives> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-food?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=vegan-food&c4=shopall-vegan-food&clickedon=shopall-vegan-food> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-food?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=vegan-food&clickedon=vegan-food> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=shopall-vegan-plant-based&clickedon=shopall-vegan-plant-based> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range?origin=dropdown&c1=groceries&c2=vegan-plant-based&clickedon=vegan-plant-based> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/Grocery-Click-and-Collect/Notify-Me> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/Modern-Slavery-Act?origin=footer&c1=about-aldi&c2=modern-slavery-act&clickedon=modern-slavery-act> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/bakery/bread?sortDirection=asc&page=3> (referer: https://groceries.aldi.co.uk/en-GB/bakery/bread?origin=dropdown&c1=groceries&c2=bakery&c3=bread&c4=sliced-bread&clickedon=sliced-bread)
2022-04-11 19:32:28 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/bakery?sortDirection=asc&page=7> (referer: https://groceries.aldi.co.uk/en-GB/bakery?origin=dropdown&c1=groceries&c2=bakery&c3=bread&c4=shopall-bread&clickedon=shopall-bread)
2022-04-11 19:32:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/bakery?sortDirection=asc&page=6> (referer: https://groceries.aldi.co.uk/en-GB/bakery?origin=dropdown&c1=groceries&c2=bakery&c3=bread&c4=shopall-bread&clickedon=shopall-bread)
2022-04-11 19:32:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/bakery?sortDirection=asc&page=5> (referer: https://groceries.aldi.co.uk/en-GB/bakery?origin=dropdown&c1=groceries&c2=bakery&c3=bread&c4=shopall-bread&clickedon=shopall-bread)
2022-04-11 19:32:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/bakery?sortDirection=asc&page=4> (referer: https://groceries.aldi.co.uk/en-GB/bakery?origin=dropdown&c1=groceries&c2=bakery&c3=bread&c4=shopall-bread&clickedon=shopall-bread)
2022-04-11 19:32:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-drinks/vegan-alcohol?sortDirection=asc&page=1> (referer: https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-drinks/vegan-alcohol?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=vegan-drinks&c4=vegan-alcohol&clickedon=vegan-alcohol)
2022-04-11 19:32:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-drinks/milk-alternatives?sortDirection=asc&page=1> (referer: https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-drinks/milk-alternatives?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=vegan-drinks&c4=milk-alternatives&clickedon=milk-alternatives)
2022-04-11 19:33:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-drinks> (referer: https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-drinks/vegan-alcohol?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=vegan-drinks&c4=vegan-alcohol&clickedon=vegan-alcohol)
2022-04-11 19:33:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-drinks?sortDirection=asc&page=1> (referer: https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-drinks?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=vegan-drinks&c4=shopall-vegan-drinks&clickedon=shopall-vegan-drinks)
2022-04-11 19:33:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-food/vegan-sides-snacks?sortDirection=asc&page=1> (referer: https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-food/vegan-sides-snacks?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=vegan-food&c4=vegan-sides-snacks&clickedon=vegan-sides-snacks)
2022-04-11 19:33:02 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-food/vegan-meat-alternatives?sortDirection=asc&page=1> (referer: https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-food/vegan-meat-alternatives?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=vegan-food&c4=meat-alternatives&clickedon=meat-alternatives)
2022-04-11 19:33:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-food?sortDirection=asc&page=1> (referer: https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-food?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=vegan-food&c4=shopall-vegan-food&clickedon=shopall-vegan-food)
2022-04-11 19:33:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-food> (referer: https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-food/vegan-sides-snacks?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=vegan-food&c4=vegan-sides-snacks&clickedon=vegan-sides-snacks)
2022-04-11 19:33:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range?sortDirection=asc&page=3> (referer: https://groceries.aldi.co.uk/en-GB/vegan-range?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=shopall-vegan-plant-based&clickedon=shopall-vegan-plant-based)
2022-04-11 19:33:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range?sortDirection=asc&page=2> (referer: https://groceries.aldi.co.uk/en-GB/vegan-range?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=shopall-vegan-plant-based&clickedon=shopall-vegan-plant-based)
2022-04-11 19:33:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/Grocery-Click-and-Collect> (referer: https://groceries.aldi.co.uk/en-GB/Grocery-Click-and-Collect/Notify-Me)
2022-04-11 19:33:06 [scrapy.extensions.logstats] INFO: Crawled 36 pages (at 36 pages/min), scraped 0 items (at 0 items/min)
2022-04-11 19:33:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/Login> (referer: https://groceries.aldi.co.uk/en-GB/Grocery-Click-and-Collect/Notify-Me)
2022-04-11 19:33:09 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range?sortDirection=asc&page=1> (referer: https://groceries.aldi.co.uk/en-GB/vegan-range?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=shopall-vegan-plant-based&clickedon=shopall-vegan-plant-based)
2022-04-11 19:33:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range> (referer: https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-drinks/vegan-alcohol?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=vegan-drinks&c4=vegan-alcohol&clickedon=vegan-alcohol)
2022-04-11 19:33:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/bakery?sortDirection=asc&page=3> (referer: https://groceries.aldi.co.uk/en-GB/bakery?origin=dropdown&c1=groceries&c2=bakery&c3=bread&c4=shopall-bread&clickedon=shopall-bread)
2022-04-11 19:33:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/bakery?sortDirection=asc&page=2> (referer: https://groceries.aldi.co.uk/en-GB/bakery?origin=dropdown&c1=groceries&c2=bakery&c3=bread&c4=shopall-bread&clickedon=shopall-bread)
2022-04-11 19:33:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/bakery?sortDirection=asc&page=1> (referer: https://groceries.aldi.co.uk/en-GB/bakery?origin=dropdown&c1=groceries&c2=bakery&c3=bread&c4=shopall-bread&clickedon=shopall-bread)
2022-04-11 19:33:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/easter/hot-cross-buns> (referer: https://groceries.aldi.co.uk/en-GB/bakery?origin=dropdown&c1=groceries&c2=bakery&c3=bread&c4=shopall-bread&clickedon=shopall-bread)
2022-04-11 19:33:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/bakery/bread?sortDirection=asc&page=2> (referer: https://groceries.aldi.co.uk/en-GB/bakery/bread?origin=dropdown&c1=groceries&c2=bakery&c3=bread&c4=sliced-bread&clickedon=sliced-bread)
2022-04-11 19:33:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/About-Click--Collect?origin=footer&c1=about-aldi&c2=covid-19&clickedon=covid-19> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:33:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/bakery/bread?sortDirection=asc&page=1> (referer: https://groceries.aldi.co.uk/en-GB/bakery/bread?origin=dropdown&c1=groceries&c2=bakery&c3=bread&c4=sliced-bread&clickedon=sliced-bread)
2022-04-11 19:33:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/Privacy-Notice?origin=footer&c1=help&c2=privacy-notice&clickedon=privacy-notice> (referer: https://groceries.aldi.co.uk/en-GB/)
,最后是统计数据:
2022-04-11 19:49:55 [scrapy.core.engine] INFO: Closing spider (finished)
2022-04-11 19:49:55 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 5,
'downloader/exception_type_count/scrapy.exceptions.IgnoreRequest': 5,
'downloader/request_bytes': 539252,
'downloader/request_count': 1095,
'downloader/request_method_count/GET': 1095,
'downloader/response_bytes': 56319721,
'downloader/response_count': 1095,
'downloader/response_status_count/200': 1087,
'downloader/response_status_count/301': 1,
'downloader/response_status_count/302': 3,
'downloader/response_status_count/404': 4,
'dupefilter/filtered': 455167,
'elapsed_time_seconds': 1068.912414,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 4, 11, 18, 49, 55, 791572),
'httpcompression/response_bytes': 370815991,
'httpcompression/response_count': 1091,
'log_count/DEBUG': 1102,
'log_count/INFO': 27,
'request_depth_max': 4,
'response_received_count': 1091,
'robotstxt/forbidden': 5,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 1099,
'scheduler/dequeued/memory': 1099,
'scheduler/enqueued': 1099,
'scheduler/enqueued/memory': 1099,
'start_time': datetime.datetime(2022, 4, 11, 18, 32, 6, 879158)}
2022-04-11 19:49:55 [scrapy.core.engine] INFO: Spider closed (finished)
唯一其他输出是一个完全空白的CSV文件。
我不明白为什么它刮擦页面,但没有返回任何项目。在此先感谢您提供的任何帮助!
谢谢 克里斯
I'm attempting to collect information about all of the products sold on groceries.aldi.co.uk. I have some experience scraping similar websites and have used the CrawlSpider to do so.
When I run the spider it seems to crawl throughout the website, but does not return any of the items. I've tried multiple different rule combinations as I suspect the issue is linked to these, but I haven't been able to fix it.
Any help would be really appreciated.
Here's my spider code:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from aldiscraper.items import AldiscraperItem
from scrapy.loader import ItemLoader
from datetime import datetime
import re
class AldiSpider(CrawlSpider):
name = 'aldi'
start_urls = ['https://groceries.aldi.co.uk/']
rules = (
Rule(LinkExtractor(allow='en-GB/', deny=r'/ddddddddddddd')),
Rule(LinkExtractor(allow=r'/ddddddddddddd'), callback='parse_products')
)
custom_settings = {
'FEED_EXPORT_FIELDS': [
'prod_id',
'name',
'size',
'price',
'scrape_date',
],
}
def parse_products(self, response):
item = AldiscraperItem()
item['prod_id'] = response.css('span.sku.small::text').get()
item['name'] = response.css('h1.my-0::text').get()
item['size'] = response.css('span.text-black-50.font-weight-bold::text').get()
item['price'] = response.css('span.product-price.h4.m-0.font-weight-bold::text').get()
item['scrape_date'] = datetime.now().strftime('%d/%m/%Y')
yield item
I originally tried to run the spider using the following rules, with the same results:
class AldiSpider(CrawlSpider):
name = 'aldi'
start_urls = ['https://groceries.aldi.co.uk/']
rules = (
Rule(LinkExtractor(allow='en-GB/', deny='en-GB/p-')),
Rule(LinkExtractor(allow='en-GB/p-'), callback='parse_products')
)
I'm using this command to run the spider:
scrapy crawl aldi -O aldi.csv
And here's an extract from the logs
2022-04-11 19:32:04 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: aldi)
2022-04-11 19:32:04 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.2.0, Python 3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1l 24 Aug 2021), cryptography 3.4.8, Platform Windows-10-10.0.19044-SP0
2022-04-11 19:32:04 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'aldi',
'COOKIES_ENABLED': False,
'DOWNLOAD_DELAY': 0.75,
'FEED_EXPORT_FIELDS': ['prod_id', 'name', 'size', 'price', 'scrape_date'],
'NEWSPIDER_MODULE': 'aldiscraper.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['aldiscraper.spiders']}
2022-04-11 19:32:04 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-04-11 19:32:04 [scrapy.extensions.telnet] INFO: Telnet Password: 2072eb077b8dcc04
2022-04-11 19:32:05 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2022-04-11 19:32:06 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-04-11 19:32:06 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-04-11 19:32:06 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-04-11 19:32:06 [scrapy.core.engine] INFO: Spider opened
2022-04-11 19:32:06 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-04-11 19:32:06 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-04-11 19:32:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/robots.txt> (referer: None)
2022-04-11 19:32:07 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://groceries.aldi.co.uk/en-GB/> from <GET https://groceries.aldi.co.uk/>
2022-04-11 19:32:09 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/> (referer: None)
2022-04-11 19:32:09 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://groceries.aldi.co.uk/en-GB/#footer-collapse-0> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2022-04-11 19:32:10 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/forgot-password> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/bakery/bread?origin=dropdown&c1=groceries&c2=bakery&c3=bread&c4=sliced-bread&clickedon=sliced-bread> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/bakery?origin=dropdown&c1=groceries&c2=bakery&c3=bread&c4=shopall-bread&clickedon=shopall-bread> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/bakery/bread?origin=dropdown&c1=groceries&c2=bakery&c3=bread&clickedon=bread> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/bakery?origin=dropdown&c1=groceries&c2=bakery&c3=shopall-bakery&clickedon=shopall-bakery> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/bakery?origin=dropdown&c1=groceries&c2=bakery&clickedon=bakery> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-drinks/vegan-alcohol?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=vegan-drinks&c4=vegan-alcohol&clickedon=vegan-alcohol> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-drinks/milk-alternatives?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=vegan-drinks&c4=milk-alternatives&clickedon=milk-alternatives> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-drinks?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=vegan-drinks&c4=shopall-vegan-drinks&clickedon=shopall-vegan-drinks> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-drinks?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=vegan-drinks&clickedon=vegan-drinks> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-food/vegan-sides-snacks?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=vegan-food&c4=vegan-sides-snacks&clickedon=vegan-sides-snacks> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-food/vegan-meat-alternatives?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=vegan-food&c4=meat-alternatives&clickedon=meat-alternatives> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-food?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=vegan-food&c4=shopall-vegan-food&clickedon=shopall-vegan-food> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-food?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=vegan-food&clickedon=vegan-food> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=shopall-vegan-plant-based&clickedon=shopall-vegan-plant-based> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range?origin=dropdown&c1=groceries&c2=vegan-plant-based&clickedon=vegan-plant-based> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/Grocery-Click-and-Collect/Notify-Me> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/Modern-Slavery-Act?origin=footer&c1=about-aldi&c2=modern-slavery-act&clickedon=modern-slavery-act> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/bakery/bread?sortDirection=asc&page=3> (referer: https://groceries.aldi.co.uk/en-GB/bakery/bread?origin=dropdown&c1=groceries&c2=bakery&c3=bread&c4=sliced-bread&clickedon=sliced-bread)
2022-04-11 19:32:28 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/bakery?sortDirection=asc&page=7> (referer: https://groceries.aldi.co.uk/en-GB/bakery?origin=dropdown&c1=groceries&c2=bakery&c3=bread&c4=shopall-bread&clickedon=shopall-bread)
2022-04-11 19:32:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/bakery?sortDirection=asc&page=6> (referer: https://groceries.aldi.co.uk/en-GB/bakery?origin=dropdown&c1=groceries&c2=bakery&c3=bread&c4=shopall-bread&clickedon=shopall-bread)
2022-04-11 19:32:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/bakery?sortDirection=asc&page=5> (referer: https://groceries.aldi.co.uk/en-GB/bakery?origin=dropdown&c1=groceries&c2=bakery&c3=bread&c4=shopall-bread&clickedon=shopall-bread)
2022-04-11 19:32:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/bakery?sortDirection=asc&page=4> (referer: https://groceries.aldi.co.uk/en-GB/bakery?origin=dropdown&c1=groceries&c2=bakery&c3=bread&c4=shopall-bread&clickedon=shopall-bread)
2022-04-11 19:32:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-drinks/vegan-alcohol?sortDirection=asc&page=1> (referer: https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-drinks/vegan-alcohol?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=vegan-drinks&c4=vegan-alcohol&clickedon=vegan-alcohol)
2022-04-11 19:32:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-drinks/milk-alternatives?sortDirection=asc&page=1> (referer: https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-drinks/milk-alternatives?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=vegan-drinks&c4=milk-alternatives&clickedon=milk-alternatives)
2022-04-11 19:33:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-drinks> (referer: https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-drinks/vegan-alcohol?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=vegan-drinks&c4=vegan-alcohol&clickedon=vegan-alcohol)
2022-04-11 19:33:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-drinks?sortDirection=asc&page=1> (referer: https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-drinks?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=vegan-drinks&c4=shopall-vegan-drinks&clickedon=shopall-vegan-drinks)
2022-04-11 19:33:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-food/vegan-sides-snacks?sortDirection=asc&page=1> (referer: https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-food/vegan-sides-snacks?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=vegan-food&c4=vegan-sides-snacks&clickedon=vegan-sides-snacks)
2022-04-11 19:33:02 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-food/vegan-meat-alternatives?sortDirection=asc&page=1> (referer: https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-food/vegan-meat-alternatives?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=vegan-food&c4=meat-alternatives&clickedon=meat-alternatives)
2022-04-11 19:33:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-food?sortDirection=asc&page=1> (referer: https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-food?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=vegan-food&c4=shopall-vegan-food&clickedon=shopall-vegan-food)
2022-04-11 19:33:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-food> (referer: https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-food/vegan-sides-snacks?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=vegan-food&c4=vegan-sides-snacks&clickedon=vegan-sides-snacks)
2022-04-11 19:33:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range?sortDirection=asc&page=3> (referer: https://groceries.aldi.co.uk/en-GB/vegan-range?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=shopall-vegan-plant-based&clickedon=shopall-vegan-plant-based)
2022-04-11 19:33:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range?sortDirection=asc&page=2> (referer: https://groceries.aldi.co.uk/en-GB/vegan-range?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=shopall-vegan-plant-based&clickedon=shopall-vegan-plant-based)
2022-04-11 19:33:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/Grocery-Click-and-Collect> (referer: https://groceries.aldi.co.uk/en-GB/Grocery-Click-and-Collect/Notify-Me)
2022-04-11 19:33:06 [scrapy.extensions.logstats] INFO: Crawled 36 pages (at 36 pages/min), scraped 0 items (at 0 items/min)
2022-04-11 19:33:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/Login> (referer: https://groceries.aldi.co.uk/en-GB/Grocery-Click-and-Collect/Notify-Me)
2022-04-11 19:33:09 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range?sortDirection=asc&page=1> (referer: https://groceries.aldi.co.uk/en-GB/vegan-range?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=shopall-vegan-plant-based&clickedon=shopall-vegan-plant-based)
2022-04-11 19:33:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range> (referer: https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-drinks/vegan-alcohol?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=vegan-drinks&c4=vegan-alcohol&clickedon=vegan-alcohol)
2022-04-11 19:33:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/bakery?sortDirection=asc&page=3> (referer: https://groceries.aldi.co.uk/en-GB/bakery?origin=dropdown&c1=groceries&c2=bakery&c3=bread&c4=shopall-bread&clickedon=shopall-bread)
2022-04-11 19:33:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/bakery?sortDirection=asc&page=2> (referer: https://groceries.aldi.co.uk/en-GB/bakery?origin=dropdown&c1=groceries&c2=bakery&c3=bread&c4=shopall-bread&clickedon=shopall-bread)
2022-04-11 19:33:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/bakery?sortDirection=asc&page=1> (referer: https://groceries.aldi.co.uk/en-GB/bakery?origin=dropdown&c1=groceries&c2=bakery&c3=bread&c4=shopall-bread&clickedon=shopall-bread)
2022-04-11 19:33:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/easter/hot-cross-buns> (referer: https://groceries.aldi.co.uk/en-GB/bakery?origin=dropdown&c1=groceries&c2=bakery&c3=bread&c4=shopall-bread&clickedon=shopall-bread)
2022-04-11 19:33:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/bakery/bread?sortDirection=asc&page=2> (referer: https://groceries.aldi.co.uk/en-GB/bakery/bread?origin=dropdown&c1=groceries&c2=bakery&c3=bread&c4=sliced-bread&clickedon=sliced-bread)
2022-04-11 19:33:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/About-Click--Collect?origin=footer&c1=about-aldi&c2=covid-19&clickedon=covid-19> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:33:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/bakery/bread?sortDirection=asc&page=1> (referer: https://groceries.aldi.co.uk/en-GB/bakery/bread?origin=dropdown&c1=groceries&c2=bakery&c3=bread&c4=sliced-bread&clickedon=sliced-bread)
2022-04-11 19:33:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/Privacy-Notice?origin=footer&c1=help&c2=privacy-notice&clickedon=privacy-notice> (referer: https://groceries.aldi.co.uk/en-GB/)
And finally, here are the stats:
2022-04-11 19:49:55 [scrapy.core.engine] INFO: Closing spider (finished)
2022-04-11 19:49:55 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 5,
'downloader/exception_type_count/scrapy.exceptions.IgnoreRequest': 5,
'downloader/request_bytes': 539252,
'downloader/request_count': 1095,
'downloader/request_method_count/GET': 1095,
'downloader/response_bytes': 56319721,
'downloader/response_count': 1095,
'downloader/response_status_count/200': 1087,
'downloader/response_status_count/301': 1,
'downloader/response_status_count/302': 3,
'downloader/response_status_count/404': 4,
'dupefilter/filtered': 455167,
'elapsed_time_seconds': 1068.912414,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 4, 11, 18, 49, 55, 791572),
'httpcompression/response_bytes': 370815991,
'httpcompression/response_count': 1091,
'log_count/DEBUG': 1102,
'log_count/INFO': 27,
'request_depth_max': 4,
'response_received_count': 1091,
'robotstxt/forbidden': 5,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 1099,
'scheduler/dequeued/memory': 1099,
'scheduler/enqueued': 1099,
'scheduler/enqueued/memory': 1099,
'start_time': datetime.datetime(2022, 4, 11, 18, 32, 6, 879158)}
2022-04-11 19:49:55 [scrapy.core.engine] INFO: Spider closed (finished)
The only other output is a completely blank CSV file.
I can't understand why it is scraping the pages but not returning any items. Thanks in advance for any help you can give me!
Thanks
Chris
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
该网址由 javascript 动态填充,并且
crawlspider
无法呈现 javascript,这意味着scrapy
无法呈现 javascript。所以你必须使用一个自动化工具,比如 selenium 和 scrapy 有点复杂,或者你可以轻松地从 api 中获取数据(如果它们存在隐藏的 api)。下面是一个如何从 api 中提取数据的示例,因为 url 包含 api。输出:
The url is dynamically populated by javascript and
crawlspider
can't render javascript meaningscrapy
can't render javascript. So you have to use an automation tool something like selenium with scrapy a bit complex or You can easily grab data from api if they exist hidden api. Here is an example how to extract data from api as the url contains api.Output: