Scrapy 现在在一个曾经运行良好的网站上超时了
我正在使用 scrapy 来抓取网站:https://www.sephora.fr/marques/de-aaz/
。 一年前它运行良好,但现在显示错误:
用户超时导致连接失败:获取https://www.sephora.fr/robots.txt< /a> 花费的时间超过 180.0 秒
重试 5 次,然后完全失败。我可以在 chrome 上访问该网址,但在 scrapy 上不起作用。我尝试过使用自定义用户代理并模拟标头请求,但它仍然不起作用。
下面是我的代码:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import json
import requests
from urllib.parse import parse_qsl, urlencode
import re
from ..pipelines import Pipeline
class SephoraSpider(scrapy.Spider):
"""
The SephoraSpider object gets you the data on all products hosted on sephora.fr
"""
name = 'sephora'
allowed_domains = ['sephora.fr']
# the url of all the brands
start_urls = ['https://www.sephora.fr/marques-de-a-a-z/']
custom_settings = {
'DOWNLOAD_TIMEOUT': '180',
}
def __init__(self):
self.base_url = 'https://www.sephora.fr'
def parse(self, response):
"""
Parses the response of a webpage we are given when we start crawling the first webpage.
This method is automatically launched by Scrapy when crawling.
:param response: the response from the webpage triggered by a get query while crawling.
A Response object represents an HTTP response, which is usually downloaded (by the Downloader)
and fed to the Spiders for processing.
:return: the results of parse_brand().
:rtype: scrapy.Request()
"""
# if we are given an url of the brand we are interested in (burberry) we send an http request to them
if response.url == "https://www.sephora.fr/marques/de-a-a-z/burberry-burb/":
yield scrapy.Request(url=response.url, callback=self.parse_brand)
# otherwise it means we are visiting another html object (another brand, a higher level url ...)
# we call the url back with another method
else:
self.log("parse: I just visited: " + response.url)
urls = response.css('a.sub-category-link::attr(href)').extract()
if urls:
for url in urls:
yield scrapy.Request(url=self.base_url + url, callback=self.parse_brand)
...
Scrapy log:
(scr_env) [email protected]:~/environment/bass2/scraper (master) $ scrapy crawl sephora
2022-03-13 16:39:19 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: nosetime_scraper)
2022-03-13 16:39:19 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.2.0, Python 3.6.9 (default, Dec 8 2021, 21:08:43) - [GCC 8.4.0], pyOpenSSL 22.0.0 (OpenSSL 1.1.1m 14 Dec 2021), cryptography 36.0.1, Platform Linux-5.4.0-1068-aws-x86_64-with-Ubuntu-18.04-bionic
2022-03-13 16:39:19 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'nosetime_scraper',
'CONCURRENT_REQUESTS': 1,
'COOKIES_ENABLED': False,
'DOWNLOAD_DELAY': 7,
'DOWNLOAD_TIMEOUT': '180',
'EDITOR': '',
'NEWSPIDER_MODULE': 'nosetime_scraper.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['nosetime_scraper.spiders'],
'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) '
'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 '
'Safari/537.36'}
2022-03-13 16:39:19 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-03-13 16:39:19 [scrapy.extensions.telnet] INFO: Telnet Password: af81c5b648cc3542
2022-03-13 16:39:19 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2022-03-13 16:39:19 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-03-13 16:39:19 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-03-13 16:39:19 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-03-13 16:39:19 [scrapy.core.engine] INFO: Spider opened
2022-03-13 16:39:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 16:39:19 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-03-13 16:40:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 16:41:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 16:42:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 16:42:19 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.sephora.fr/robots.txt> (failed 1 times): User timeout caused connection failure: Getting https://www.sephora.fr/robots.txt took longer than 180.0 seconds..
2022-03-13 16:42:19 [py.warnings] WARNING: /home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/scrapy/core/engine.py:276: ScrapyDeprecationWarning: Passing a 'spider' argument to ExecutionEngine.download is deprecated
return self.download(result, spider) if isinstance(result, Request) else result
2022-03-13 16:43:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
current.result, *args, **kwargs
File "/home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/scrapy/core/downloader/handlers/http11.py", line 360, in _cb_timeout
raise TimeoutError(f"Getting {url} took longer than {timeout} seconds.")
twisted.internet.error.TimeoutError: User timeout caused connection failure: Getting https://www.sephora.fr/robots.txt took longer than 180.0 seconds..
2022-03-13 16:49:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 16:50:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 16:51:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 16:51:19 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.sephora.fr/marques-de-a-a-z/> (failed 1 times): User timeout caused connection failure: Getting https://www.sephora.fr/marques-de-a-a-z/ took longer than 180.0 seconds..
2022-03-13 16:52:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 16:53:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 16:54:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 16:54:19 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.sephora.fr/marques-de-a-a-z/> (failed 2 times): User timeout caused connection failure: Getting https://www.sephora.fr/marques-de-a-a-z/ took longer than 180.0 seconds..
2022-03-13 16:55:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 16:56:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 16:57:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 16:57:19 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://www.sephora.fr/marques-de-a-a-z/> (failed 3 times): User timeout caused connection failure: Getting https://www.sephora.fr/marques-de-a-a-z/ took longer than 180.0 seconds..
2022-03-13 16:57:19 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.sephora.fr/marques-de-a-a-z/>
Traceback (most recent call last):
File "/home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/twisted/internet/defer.py", line 1657, in _inlineCallbacks
cast(Failure, result).throwExceptionIntoGenerator, gen
File "/home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/twisted/internet/defer.py", line 62, in run
return f(*args, **kwargs)
File "/home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/twisted/python/failure.py", line 489, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 49, in process_request
return (yield download_func(request=request, spider=spider))
File "/home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/twisted/internet/defer.py", line 858, in _runCallbacks
current.result, *args, **kwargs
File "/home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/scrapy/core/downloader/handlers/http11.py", line 360, in _cb_timeout
raise TimeoutError(f"Getting {url} took longer than {timeout} seconds.")
twisted.internet.error.TimeoutError: User timeout caused connection failure: Getting https://www.sephora.fr/marques-de-a-a-z/ took longer than 180.0 seconds..
2022-03-13 16:57:19 [scrapy.core.engine] INFO: Closing spider (finished)
2022-03-13 16:57:19 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 6,
'downloader/exception_type_count/twisted.internet.error.TimeoutError': 6,
'downloader/request_bytes': 1881,
'downloader/request_count': 6,
'downloader/request_method_count/GET': 6,
'elapsed_time_seconds': 1080.231435,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 3, 13, 16, 57, 19, 904633),
'log_count/DEBUG': 5,
'log_count/ERROR': 4,
'log_count/INFO': 28,
'log_count/WARNING': 1,
'memusage/max': 72749056,
'memusage/startup': 70950912,
'retry/count': 4,
'retry/max_reached': 2,
'retry/reason_count/twisted.internet.error.TimeoutError': 4,
"robotstxt/exception_count/<class 'twisted.internet.error.TimeoutError'>": 1,
'robotstxt/request_count': 1,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2022, 3, 13, 16, 39, 19, 673198)}
2022-03-13 16:57:19 [scrapy.core.engine] INFO: Spider closed (finished)
我将使用 fiddler 查看请求标头并进行一些测试。也许 Scrapy 默认情况下会发送 Connection: close
标头,因为我没有从丝芙兰网站收到任何响应?
以下是我选择不尊重 robots.txt 时的日志:
(scr_env) [email protected]:~/environment/bass2/scraper (master) $ scrapy crawl sephora
2022-03-13 23:23:38 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: nosetime_scraper)
2022-03-13 23:23:38 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.2.0, Python 3.6.9 (default, Dec 8 2021, 21:08:43) - [GCC 8.4.0], pyOpenSSL 22.0.0 (OpenSSL 1.1.1m 14 Dec 2021), cryptography 36.0.1, Platform Linux-5.4.0-1068-aws-x86_64-with-Ubuntu-18.04-bionic
2022-03-13 23:23:38 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'nosetime_scraper',
'CONCURRENT_REQUESTS': 1,
'COOKIES_ENABLED': False,
'DOWNLOAD_DELAY': 7,
'DOWNLOAD_TIMEOUT': '180',
'EDITOR': '',
'NEWSPIDER_MODULE': 'nosetime_scraper.spiders',
'SPIDER_MODULES': ['nosetime_scraper.spiders'],
'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) '
'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 '
'Safari/537.36'}
2022-03-13 23:23:38 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-03-13 23:23:38 [scrapy.extensions.telnet] INFO: Telnet Password: 3f4205a34aff02c5
2022-03-13 23:23:38 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2022-03-13 23:23:38 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-03-13 23:23:38 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-03-13 23:23:38 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-03-13 23:23:38 [scrapy.core.engine] INFO: Spider opened
2022-03-13 23:23:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 23:23:38 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-03-13 23:24:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 23:25:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 23:26:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 23:26:38 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.sephora.fr/marques-de-a-a-z/> (failed 1 times): User timeout caused connection failure: Getting https://www.sephora.fr/marques-de-a-a-z/ took longer than 180.0 seconds..
2022-03-13 23:27:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 23:28:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 23:29:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 23:29:38 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.sephora.fr/marques-de-a-a-z/> (failed 2 times): User timeout caused connection failure: Getting https://www.sephora.fr/marques-de-a-a-z/ took longer than 180.0 seconds..
2022-03-13 23:30:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 23:31:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 23:32:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 23:32:38 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://www.sephora.fr/marques-de-a-a-z/> (failed 3 times): User timeout caused connection failure: Getting https://www.sephora.fr/marques-de-a-a-z/ took longer than 180.0 seconds..
2022-03-13 23:32:38 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.sephora.fr/marques-de-a-a-z/>
Traceback (most recent call last):
File "/home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/twisted/internet/defer.py", line 1657, in _inlineCallbacks
cast(Failure, result).throwExceptionIntoGenerator, gen
File "/home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/twisted/internet/defer.py", line 62, in run
return f(*args, **kwargs)
File "/home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/twisted/python/failure.py", line 489, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 49, in process_request
return (yield download_func(request=request, spider=spider))
File "/home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/twisted/internet/defer.py", line 858, in _runCallbacks
current.result, *args, **kwargs
File "/home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/scrapy/core/downloader/handlers/http11.py", line 360, in _cb_timeout
raise TimeoutError(f"Getting {url} took longer than {timeout} seconds.")
twisted.internet.error.TimeoutError: User timeout caused connection failure: Getting https://www.sephora.fr/marques-de-a-a-z/ took longer than 180.0 seconds..
2022-03-13 23:32:39 [scrapy.core.engine] INFO: Closing spider (finished)
2022-03-13 23:32:39 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,
'downloader/exception_type_count/twisted.internet.error.TimeoutError': 3,
'downloader/request_bytes': 951,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'elapsed_time_seconds': 540.224149,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 3, 13, 23, 32, 39, 59500),
'log_count/DEBUG': 3,
'log_count/ERROR': 2,
'log_count/INFO': 19,
'memusage/max': 72196096,
'memusage/startup': 70766592,
'retry/count': 2,
'retry/max_reached': 1,
'retry/reason_count/twisted.internet.error.TimeoutError': 2,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2022, 3, 13, 23, 23, 38, 835351)}
2022-03-13 23:32:39 [scrapy.core.engine] INFO: Spider closed (finished)
这是我的环境,pip list
输出:
(scr_env) C:\Users\antoi\Documents\Programming\Work\scrapy-scraper>pip list
Package Version
------------------- ------------------
async-generator 1.10
attrs 21.4.0
Automat 20.2.0
beautifulsoup4 4.10.0
blis 0.7.5
bs4 0.0.1
catalogue 2.0.6
certifi 2021.10.8
cffi 1.15.0
charset-normalizer 2.0.12
click 8.0.4
colorama 0.4.4
configparser 5.2.0
constantly 15.1.0
crayons 0.4.0
cryptography 36.0.1
cssselect 1.1.0
cymem 2.0.6
DAWG-Python 0.7.2
docopt 0.6.2
en-core-web-sm 3.2.0
et-xmlfile 1.1.0
geographiclib 1.52
geopy 2.2.0
h11 0.13.0
h2 3.2.0
hpack 3.0.0
hyperframe 5.2.0
hyperlink 21.0.0
idna 3.3
incremental 21.3.0
itemadapter 0.4.0
itemloaders 1.0.4
Jinja2 3.0.3
jmespath 0.10.0
langcodes 3.3.0
libretranslatepy 2.1.1
lxml 4.8.0
MarkupSafe 2.1.0
murmurhash 1.0.6
numpy 1.22.2
openpyxl 3.0.9
outcome 1.1.0
packaging 21.3
pandas 1.4.1
parsel 1.6.0
pathy 0.6.1
pip 22.0.4
preshed 3.0.6
priority 1.3.0
Protego 0.2.1
pyaes 1.6.1
pyasn1 0.4.8
pyasn1-modules 0.2.8
pycparser 2.21
pydantic 1.8.2
PyDispatcher 2.0.5
pymongo 3.11.0
pymorphy2 0.9.1
pymorphy2-dicts-ru 2.4.417127.4579844
pyOpenSSL 22.0.0
pyparsing 3.0.7
PySocks 1.7.1
python-dateutil 2.8.2
pytz 2021.3
queuelib 1.6.2
requests 2.27.1
rsa 4.8
ru-core-news-md 3.2.0
Scrapy 2.5.1
selenium 4.1.2
service-identity 21.1.0
setuptools 56.0.0
six 1.16.0
smart-open 5.2.1
sniffio 1.2.0
sortedcontainers 2.4.0
soupsieve 2.3.1
spacy 3.2.2
spacy-legacy 3.0.9
spacy-loggers 1.0.1
srsly 2.4.2
Telethon 1.24.0
thinc 8.0.13
tqdm 4.62.3
translate 3.6.1
trio 0.20.0
trio-websocket 0.9.2
Twisted 22.1.0
twisted-iocpsupport 1.0.2
typer 0.4.0
typing_extensions 4.1.1
urllib3 1.26.8
w3lib 1.22.0
wasabi 0.9.0
webdriver-manager 3.5.3
wsproto 1.0.0
zope.interface 5.4.0
使用 scrapy runningpider sephora.py
我说它不接受我的相对导入from ..pipelines import Pipeline
:
(scr_env) C:\Users\antoi\Documents\Programming\Work\scrapy-scraper\nosetime_scraper\spiders>scrapy runspider sephora.py
2022-03-14 01:00:27 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: nosetime_scraper)
2022-03-14 01:00:27 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.1.0, Python 3.
9.6 (tags/v3.9.6:db3ff76, Jun 28 2021, 15:26:21) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 22.0.0 (OpenSSL 1.1.1m 14 Dec 2021), cryptography 36.0.1, Platform
Windows-10-10.0.19043-SP0
2022-03-14 01:00:27 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
Usage
=====
scrapy runspider [options] <spider_file>
runspider: error: Unable to load 'sephora.py': attempted relative import with no known parent package
这是我的settings.py
:
# Scrapy settings for nosetime_scraper project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'nosetime_scraper'
SPIDER_MODULES = ['nosetime_scraper.spiders']
NEWSPIDER_MODULE = 'nosetime_scraper.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 1
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 7
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
COOKIES_ENABLED = True
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'nosetime_scraper.middlewares.NosetimeScraperSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'nosetime_scraper.middlewares.NosetimeScraperDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# 'nosetime_scraper.pipelines.NosetimeScraperPipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
I'm using scrapy to scrape a website: https://www.sephora.fr/marques/de-a-a-z/
.
It used to work well a year ago but it now shows an error:
User timeout caused connection failure: Getting https://www.sephora.fr/robots.txt took longer than 180.0 seconds
It retries for 5 times and then fails completely. I can access the url on chrome but it's not working on scrapy. I've tried using custom user agents and emulating header requests but It still doesn't work.
Below is my code:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import json
import requests
from urllib.parse import parse_qsl, urlencode
import re
from ..pipelines import Pipeline
class SephoraSpider(scrapy.Spider):
"""
The SephoraSpider object gets you the data on all products hosted on sephora.fr
"""
name = 'sephora'
allowed_domains = ['sephora.fr']
# the url of all the brands
start_urls = ['https://www.sephora.fr/marques-de-a-a-z/']
custom_settings = {
'DOWNLOAD_TIMEOUT': '180',
}
def __init__(self):
self.base_url = 'https://www.sephora.fr'
def parse(self, response):
"""
Parses the response of a webpage we are given when we start crawling the first webpage.
This method is automatically launched by Scrapy when crawling.
:param response: the response from the webpage triggered by a get query while crawling.
A Response object represents an HTTP response, which is usually downloaded (by the Downloader)
and fed to the Spiders for processing.
:return: the results of parse_brand().
:rtype: scrapy.Request()
"""
# if we are given an url of the brand we are interested in (burberry) we send an http request to them
if response.url == "https://www.sephora.fr/marques/de-a-a-z/burberry-burb/":
yield scrapy.Request(url=response.url, callback=self.parse_brand)
# otherwise it means we are visiting another html object (another brand, a higher level url ...)
# we call the url back with another method
else:
self.log("parse: I just visited: " + response.url)
urls = response.css('a.sub-category-link::attr(href)').extract()
if urls:
for url in urls:
yield scrapy.Request(url=self.base_url + url, callback=self.parse_brand)
...
Scrapy log:
(scr_env) [email protected]:~/environment/bass2/scraper (master) $ scrapy crawl sephora
2022-03-13 16:39:19 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: nosetime_scraper)
2022-03-13 16:39:19 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.2.0, Python 3.6.9 (default, Dec 8 2021, 21:08:43) - [GCC 8.4.0], pyOpenSSL 22.0.0 (OpenSSL 1.1.1m 14 Dec 2021), cryptography 36.0.1, Platform Linux-5.4.0-1068-aws-x86_64-with-Ubuntu-18.04-bionic
2022-03-13 16:39:19 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'nosetime_scraper',
'CONCURRENT_REQUESTS': 1,
'COOKIES_ENABLED': False,
'DOWNLOAD_DELAY': 7,
'DOWNLOAD_TIMEOUT': '180',
'EDITOR': '',
'NEWSPIDER_MODULE': 'nosetime_scraper.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['nosetime_scraper.spiders'],
'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) '
'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 '
'Safari/537.36'}
2022-03-13 16:39:19 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-03-13 16:39:19 [scrapy.extensions.telnet] INFO: Telnet Password: af81c5b648cc3542
2022-03-13 16:39:19 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2022-03-13 16:39:19 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-03-13 16:39:19 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-03-13 16:39:19 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-03-13 16:39:19 [scrapy.core.engine] INFO: Spider opened
2022-03-13 16:39:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 16:39:19 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-03-13 16:40:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 16:41:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 16:42:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 16:42:19 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.sephora.fr/robots.txt> (failed 1 times): User timeout caused connection failure: Getting https://www.sephora.fr/robots.txt took longer than 180.0 seconds..
2022-03-13 16:42:19 [py.warnings] WARNING: /home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/scrapy/core/engine.py:276: ScrapyDeprecationWarning: Passing a 'spider' argument to ExecutionEngine.download is deprecated
return self.download(result, spider) if isinstance(result, Request) else result
2022-03-13 16:43:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
current.result, *args, **kwargs
File "/home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/scrapy/core/downloader/handlers/http11.py", line 360, in _cb_timeout
raise TimeoutError(f"Getting {url} took longer than {timeout} seconds.")
twisted.internet.error.TimeoutError: User timeout caused connection failure: Getting https://www.sephora.fr/robots.txt took longer than 180.0 seconds..
2022-03-13 16:49:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 16:50:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 16:51:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 16:51:19 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.sephora.fr/marques-de-a-a-z/> (failed 1 times): User timeout caused connection failure: Getting https://www.sephora.fr/marques-de-a-a-z/ took longer than 180.0 seconds..
2022-03-13 16:52:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 16:53:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 16:54:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 16:54:19 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.sephora.fr/marques-de-a-a-z/> (failed 2 times): User timeout caused connection failure: Getting https://www.sephora.fr/marques-de-a-a-z/ took longer than 180.0 seconds..
2022-03-13 16:55:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 16:56:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 16:57:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 16:57:19 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://www.sephora.fr/marques-de-a-a-z/> (failed 3 times): User timeout caused connection failure: Getting https://www.sephora.fr/marques-de-a-a-z/ took longer than 180.0 seconds..
2022-03-13 16:57:19 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.sephora.fr/marques-de-a-a-z/>
Traceback (most recent call last):
File "/home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/twisted/internet/defer.py", line 1657, in _inlineCallbacks
cast(Failure, result).throwExceptionIntoGenerator, gen
File "/home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/twisted/internet/defer.py", line 62, in run
return f(*args, **kwargs)
File "/home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/twisted/python/failure.py", line 489, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 49, in process_request
return (yield download_func(request=request, spider=spider))
File "/home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/twisted/internet/defer.py", line 858, in _runCallbacks
current.result, *args, **kwargs
File "/home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/scrapy/core/downloader/handlers/http11.py", line 360, in _cb_timeout
raise TimeoutError(f"Getting {url} took longer than {timeout} seconds.")
twisted.internet.error.TimeoutError: User timeout caused connection failure: Getting https://www.sephora.fr/marques-de-a-a-z/ took longer than 180.0 seconds..
2022-03-13 16:57:19 [scrapy.core.engine] INFO: Closing spider (finished)
2022-03-13 16:57:19 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 6,
'downloader/exception_type_count/twisted.internet.error.TimeoutError': 6,
'downloader/request_bytes': 1881,
'downloader/request_count': 6,
'downloader/request_method_count/GET': 6,
'elapsed_time_seconds': 1080.231435,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 3, 13, 16, 57, 19, 904633),
'log_count/DEBUG': 5,
'log_count/ERROR': 4,
'log_count/INFO': 28,
'log_count/WARNING': 1,
'memusage/max': 72749056,
'memusage/startup': 70950912,
'retry/count': 4,
'retry/max_reached': 2,
'retry/reason_count/twisted.internet.error.TimeoutError': 4,
"robotstxt/exception_count/<class 'twisted.internet.error.TimeoutError'>": 1,
'robotstxt/request_count': 1,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2022, 3, 13, 16, 39, 19, 673198)}
2022-03-13 16:57:19 [scrapy.core.engine] INFO: Spider closed (finished)
I am going to look at the request headers using fiddler and doing some tests. Maybe Scrapy is sending a Connection: close
header by default due to which I'm not getting any response from the sephora site ?
Here are the logs when I chose not to respect robots.txt:
(scr_env) [email protected]:~/environment/bass2/scraper (master) $ scrapy crawl sephora
2022-03-13 23:23:38 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: nosetime_scraper)
2022-03-13 23:23:38 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.2.0, Python 3.6.9 (default, Dec 8 2021, 21:08:43) - [GCC 8.4.0], pyOpenSSL 22.0.0 (OpenSSL 1.1.1m 14 Dec 2021), cryptography 36.0.1, Platform Linux-5.4.0-1068-aws-x86_64-with-Ubuntu-18.04-bionic
2022-03-13 23:23:38 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'nosetime_scraper',
'CONCURRENT_REQUESTS': 1,
'COOKIES_ENABLED': False,
'DOWNLOAD_DELAY': 7,
'DOWNLOAD_TIMEOUT': '180',
'EDITOR': '',
'NEWSPIDER_MODULE': 'nosetime_scraper.spiders',
'SPIDER_MODULES': ['nosetime_scraper.spiders'],
'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) '
'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 '
'Safari/537.36'}
2022-03-13 23:23:38 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-03-13 23:23:38 [scrapy.extensions.telnet] INFO: Telnet Password: 3f4205a34aff02c5
2022-03-13 23:23:38 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2022-03-13 23:23:38 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-03-13 23:23:38 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-03-13 23:23:38 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-03-13 23:23:38 [scrapy.core.engine] INFO: Spider opened
2022-03-13 23:23:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 23:23:38 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-03-13 23:24:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 23:25:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 23:26:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 23:26:38 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.sephora.fr/marques-de-a-a-z/> (failed 1 times): User timeout caused connection failure: Getting https://www.sephora.fr/marques-de-a-a-z/ took longer than 180.0 seconds..
2022-03-13 23:27:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 23:28:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 23:29:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 23:29:38 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.sephora.fr/marques-de-a-a-z/> (failed 2 times): User timeout caused connection failure: Getting https://www.sephora.fr/marques-de-a-a-z/ took longer than 180.0 seconds..
2022-03-13 23:30:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 23:31:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 23:32:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 23:32:38 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://www.sephora.fr/marques-de-a-a-z/> (failed 3 times): User timeout caused connection failure: Getting https://www.sephora.fr/marques-de-a-a-z/ took longer than 180.0 seconds..
2022-03-13 23:32:38 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.sephora.fr/marques-de-a-a-z/>
Traceback (most recent call last):
File "/home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/twisted/internet/defer.py", line 1657, in _inlineCallbacks
cast(Failure, result).throwExceptionIntoGenerator, gen
File "/home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/twisted/internet/defer.py", line 62, in run
return f(*args, **kwargs)
File "/home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/twisted/python/failure.py", line 489, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 49, in process_request
return (yield download_func(request=request, spider=spider))
File "/home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/twisted/internet/defer.py", line 858, in _runCallbacks
current.result, *args, **kwargs
File "/home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/scrapy/core/downloader/handlers/http11.py", line 360, in _cb_timeout
raise TimeoutError(f"Getting {url} took longer than {timeout} seconds.")
twisted.internet.error.TimeoutError: User timeout caused connection failure: Getting https://www.sephora.fr/marques-de-a-a-z/ took longer than 180.0 seconds..
2022-03-13 23:32:39 [scrapy.core.engine] INFO: Closing spider (finished)
2022-03-13 23:32:39 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,
'downloader/exception_type_count/twisted.internet.error.TimeoutError': 3,
'downloader/request_bytes': 951,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'elapsed_time_seconds': 540.224149,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 3, 13, 23, 32, 39, 59500),
'log_count/DEBUG': 3,
'log_count/ERROR': 2,
'log_count/INFO': 19,
'memusage/max': 72196096,
'memusage/startup': 70766592,
'retry/count': 2,
'retry/max_reached': 1,
'retry/reason_count/twisted.internet.error.TimeoutError': 2,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2022, 3, 13, 23, 23, 38, 835351)}
2022-03-13 23:32:39 [scrapy.core.engine] INFO: Spider closed (finished)
And here is my environment, pip list
output:
(scr_env) C:\Users\antoi\Documents\Programming\Work\scrapy-scraper>pip list
Package Version
------------------- ------------------
async-generator 1.10
attrs 21.4.0
Automat 20.2.0
beautifulsoup4 4.10.0
blis 0.7.5
bs4 0.0.1
catalogue 2.0.6
certifi 2021.10.8
cffi 1.15.0
charset-normalizer 2.0.12
click 8.0.4
colorama 0.4.4
configparser 5.2.0
constantly 15.1.0
crayons 0.4.0
cryptography 36.0.1
cssselect 1.1.0
cymem 2.0.6
DAWG-Python 0.7.2
docopt 0.6.2
en-core-web-sm 3.2.0
et-xmlfile 1.1.0
geographiclib 1.52
geopy 2.2.0
h11 0.13.0
h2 3.2.0
hpack 3.0.0
hyperframe 5.2.0
hyperlink 21.0.0
idna 3.3
incremental 21.3.0
itemadapter 0.4.0
itemloaders 1.0.4
Jinja2 3.0.3
jmespath 0.10.0
langcodes 3.3.0
libretranslatepy 2.1.1
lxml 4.8.0
MarkupSafe 2.1.0
murmurhash 1.0.6
numpy 1.22.2
openpyxl 3.0.9
outcome 1.1.0
packaging 21.3
pandas 1.4.1
parsel 1.6.0
pathy 0.6.1
pip 22.0.4
preshed 3.0.6
priority 1.3.0
Protego 0.2.1
pyaes 1.6.1
pyasn1 0.4.8
pyasn1-modules 0.2.8
pycparser 2.21
pydantic 1.8.2
PyDispatcher 2.0.5
pymongo 3.11.0
pymorphy2 0.9.1
pymorphy2-dicts-ru 2.4.417127.4579844
pyOpenSSL 22.0.0
pyparsing 3.0.7
PySocks 1.7.1
python-dateutil 2.8.2
pytz 2021.3
queuelib 1.6.2
requests 2.27.1
rsa 4.8
ru-core-news-md 3.2.0
Scrapy 2.5.1
selenium 4.1.2
service-identity 21.1.0
setuptools 56.0.0
six 1.16.0
smart-open 5.2.1
sniffio 1.2.0
sortedcontainers 2.4.0
soupsieve 2.3.1
spacy 3.2.2
spacy-legacy 3.0.9
spacy-loggers 1.0.1
srsly 2.4.2
Telethon 1.24.0
thinc 8.0.13
tqdm 4.62.3
translate 3.6.1
trio 0.20.0
trio-websocket 0.9.2
Twisted 22.1.0
twisted-iocpsupport 1.0.2
typer 0.4.0
typing_extensions 4.1.1
urllib3 1.26.8
w3lib 1.22.0
wasabi 0.9.0
webdriver-manager 3.5.3
wsproto 1.0.0
zope.interface 5.4.0
With scrapy runspider sephora.py
I remark it doesn't accept my relative import from ..pipelines import Pipeline
:
(scr_env) C:\Users\antoi\Documents\Programming\Work\scrapy-scraper\nosetime_scraper\spiders>scrapy runspider sephora.py
2022-03-14 01:00:27 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: nosetime_scraper)
2022-03-14 01:00:27 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.1.0, Python 3.
9.6 (tags/v3.9.6:db3ff76, Jun 28 2021, 15:26:21) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 22.0.0 (OpenSSL 1.1.1m 14 Dec 2021), cryptography 36.0.1, Platform
Windows-10-10.0.19043-SP0
2022-03-14 01:00:27 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
Usage
=====
scrapy runspider [options] <spider_file>
runspider: error: Unable to load 'sephora.py': attempted relative import with no known parent package
Here are my settings.py
:
# Scrapy settings for nosetime_scraper project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'nosetime_scraper'
SPIDER_MODULES = ['nosetime_scraper.spiders']
NEWSPIDER_MODULE = 'nosetime_scraper.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 1
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 7
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
COOKIES_ENABLED = True
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'nosetime_scraper.middlewares.NosetimeScraperSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'nosetime_scraper.middlewares.NosetimeScraperDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# 'nosetime_scraper.pipelines.NosetimeScraperPipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论