与芹菜不在Docker容器上跑步的冰期
我编写了一个 Scrapy CrawlerProcess 来从脚本运行。 它还使用 Celery+RabbitMQ 来控制要废弃的 url。
update.py 脚本将 URL 发送到 RabbitMQ,Celery 工作线程运行 Scrapy 脚本。
在我的IDE上调试时,运行成功。 但是,当我尝试在 Docker 容器内运行时,该脚本不会运行。
我已经仔细检查了 docker-compose.yaml
上的 network
属性,一切看起来都正确。 我已经工作和调试了好几天了,没有任何不同的结果。
update.py
...
for key in urls.keys():
site_name = key
links = urls[key]
logger.info(f'Insert into queue: {key} | {len(links)} records')
crawler_task.delay({'site_name': site_name, 'links': links})
...
app.py(Celery 工作器和设置)
import logging
import os
from billiard.context import Process
from celery import Celery
from scrapy import signals
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from twisted.internet import reactor
from enums import sites
logger = logging.getLogger(__name__)
def get_broker():
rabbitmq_user = os.getenv('RABBITMQ_USER')
rabbitmq_password = os.getenv('RABBITMQ_PASSWORD')
rabbitmq_host = os.getenv('RABBITMQ_HOST')
rabbitmq_port = os.getenv('RABBITMQ_PORT')
return f'amqp://{rabbitmq_user}:{rabbitmq_password}@{rabbitmq_host}:{rabbitmq_port}'
app = Celery('app', broker=get_broker())
class CrawlerScript(Process):
def __init__(self, params):
Process.__init__(self)
settings = Settings()
os.environ['SCRAPY_SETTINGS_MODULE'] = 'crawler.settings'
settings_module_path = os.environ['SCRAPY_SETTINGS_MODULE']
settings.setmodule(settings_module_path)
links = params.get('links')
site_name = params.get('site_name')
site = sites.get(site_name)
if site:
self.spider = site.get('spider')
spider_settings = site.get('settings')
meta = site.get('meta')
if spider_settings:
settings.setdict(spider_settings)
self.spider.urls = [{'listing_id': link.get('listing_id'), 'link': link.get('link')} for link in links]
self.spider.meta = meta
self.crawler = Crawler(self.spider, settings)
self.crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
else:
logger.error(f'No site found: {site_name}')
def run(self):
self.crawler.crawl(self.spider())
reactor.run()
@app.task(soft_time_limit=30, time_limit=60)
def crawler_task(params):
crawler = CrawlerScript(params)
crawler.start()
crawler.join()
if __name__ == '__main__':
app.start()
base.py(蜘蛛的通用类)
import scrapy
from crawler.items import Item
from enums.listing_status import ListingStatusEnum
class BaseSpider(scrapy.Spider):
download_timeout = 30
def __init__(self, name=None, **kwargs):
super().__init__(name, **kwargs)
def start_requests(self):
for url in self.urls:
print(f"START REQUESTS: {url.get('link')}") # This get printed while running on IDE
yield scrapy.Request(url.get('link'),
callback=self.parse,
errback=self.errback,
cb_kwargs=dict(listing_id=url.get('listing_id')),
meta=self.meta,
dont_filter=True)
def errback(self, failure):
self.logger.error('ERROR CALLBACK', repr(failure))
listing_id = failure.request.cb_kwargs['listing_id']
status = ListingStatusEnum.URL_NOT_FOUND.value
yield Item(listing_id=listing_id, name=None, price=None, status=status)
docker-compose.yaml
version: "3.9"
services:
rabbitmq:
image: rabbitmq:3.6.16-management-alpine
container_name: "rabbitmq"
restart: unless-stopped
environment:
RABBITMQ_DEFAULT_USER: "${RABBITMQ_USER}"
RABBITMQ_DEFAULT_PASS: "${RABBITMQ_PASSWORD}"
ports:
- "${RABBITMQ_PORT}:${RABBITMQ_PORT}"
- "1${RABBITMQ_PORT}:1${RABBITMQ_PORT}"
volumes:
- ./.docker/rabbitmq/data/:/var/lib/rabbitmq/
- ./.docker/rabbitmq/log/:/var/log/rabbitmq
deploy:
resources:
limits:
cpus: "1"
memory: 1G
reservations:
memory: 512M
crawler:
build:
context: .
dockerfile: Dockerfile
container_name: crawler
command: bash -c "celery -A app worker --pool=threads --loglevel=INFO --concurrency=1 -n worker@%n"
restart: unless-stopped
environment:
RABBITMQ_USER: "${RABBITMQ_USER}"
RABBITMQ_PASSWORD: "${RABBITMQ_PASSWORD}"
RABBITMQ_HOST: "${RABBITMQ_HOST}"
RABBITMQ_PORT: "${RABBITMQ_PORT}"
network_mode: host
depends_on:
- rabbitmq
deploy:
resources:
limits:
cpus: "4"
memory: 3G
reservations:
memory: 512M
logging:
options:
max-size: "1G"
max-file: "30"
在 IDE 上运行时,脚本正常启动
在 IDE 上运行的输出
[2022-04-07 08:53:54,330: INFO/MainProcess] Task app.crawler_task[e3b8401a-a6a7-4c1d-ac61-426cb8c1ccd0] received
[2022-04-07 08:54:04,732: INFO/MainProcess] Overridden settings:
{'BOT_NAME': 'CRAWLER',
'LOG_ENABLED': False,
'LOG_LEVEL': 'INFO',
'NEWSPIDER_MODULE': 'crawler.spiders',
'SPIDER_MODULES': ['crawler.spiders'],
'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, '
'like Gecko) Chrome/92.0.4515.131 Safari/537.36'}
[2022-04-07 08:54:04,762: INFO/MainProcess] Telnet Password: 5e1278a7d809124a
[2022-04-07 08:54:04,794: INFO/MainProcess] Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
[2022-04-07 08:54:15,748: INFO/CrawlerScript-1] Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
[2022-04-07 08:54:15,757: INFO/CrawlerScript-1] Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
[2022-04-07 08:54:16,095: INFO/CrawlerScript-1] Enabled item pipelines:
['crawler.pipelines.PricingCrawlerPipeline']
[2022-04-07 08:54:16,095: INFO/CrawlerScript-1] Spider opened
[2022-04-07 08:54:16,996: INFO/CrawlerScript-1] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
[2022-04-07 08:54:17,000: INFO/CrawlerScript-1] Telnet console listening on 127.0.0.1:6023
**The script is running - START REQUESTS log from base.py**
[2022-04-07 08:54:18,831: WARNING/CrawlerScript-1] START REQUESTS: https://xxx
[2022-04-07 08:54:18,872: WARNING/CrawlerScript-1] START REQUESTS: https://xxx
[2022-04-07 08:54:18,878: WARNING/CrawlerScript-1] START REQUESTS: https://xxx
[2022-04-07 08:54:18,884: WARNING/CrawlerScript-1] START REQUESTS: https://xxx
[2022-04-07 08:54:18,889: WARNING/CrawlerScript-1] START REQUESTS: https://xxx
[2022-04-07 08:54:18,894: WARNING/CrawlerScript-1] START REQUESTS: https://xxx
...
[2022-04-07 08:54:26,029: INFO/CrawlerScript-1] Closing spider (finished)
[2022-04-07 08:54:26,037: INFO/CrawlerScript-1] Dumping Scrapy stats:
{'downloader/request_bytes': 6401,
'downloader/request_count': 15,
'downloader/request_method_count/GET': 15,
'downloader/response_bytes': 1443579,
'downloader/response_count': 15,
'downloader/response_status_count/200': 14,
'downloader/response_status_count/302': 1,
'elapsed_time_seconds': 9.036862,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 4, 7, 11, 54, 26, 33066),
'httpcompression/response_bytes': 7165556,
'httpcompression/response_count': 14,
'item_scraped_count': 14,
'log_count/INFO': 10,
'log_count/WARNING': 14,
'memusage/max': 96260096,
'memusage/startup': 96260096,
'response_received_count': 14,
'scheduler/dequeued': 15,
'scheduler/dequeued/memory': 15,
'scheduler/enqueued': 15,
'scheduler/enqueued/memory': 15,
'start_time': datetime.datetime(2022, 4, 7, 11, 54, 16, 996204)}
[2022-04-07 08:54:26,038: INFO/CrawlerScript-1] Spider closed (finished)
[2022-04-07 08:54:26,070: INFO/MainProcess] Task app.crawler_task[e3b8401a-a6a7-4c1d-ac61-426cb8c1ccd0] succeeded in 31.737013949023094s: None
在 Docker 上运行时,脚本不启动
在 Docker 容器上运行的输出
crawler | [2022-04-07 12:18:33,000: INFO/MainProcess] Task app.crawler_task[2d05036b-ae92-488b-b7de-a6213905af48] received
crawler | [2022-04-07 12:18:33,009: INFO/MainProcess] Overridden settings:
crawler | {'BOT_NAME': 'CRAWLER',
crawler | 'LOG_ENABLED': False,
crawler | 'LOG_LEVEL': 'INFO',
crawler | 'NEWSPIDER_MODULE': 'crawler.spiders',
crawler | 'SPIDER_MODULES': ['crawler.spiders'],
crawler | 'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, '
crawler | 'like Gecko) Chrome/92.0.4515.131 Safari/537.36'}
crawler | [2022-04-07 12:18:33,027: INFO/MainProcess] Telnet Password: f64b27b6d4457920
crawler | [2022-04-07 12:18:33,081: INFO/MainProcess] Enabled extensions:
crawler | ['scrapy.extensions.corestats.CoreStats',
crawler | 'scrapy.extensions.telnet.TelnetConsole',
crawler | 'scrapy.extensions.memusage.MemoryUsage',
crawler | 'scrapy.extensions.logstats.LogStats']
crawler | [2022-04-07 12:18:33,159: INFO/CrawlerScript-1] Enabled downloader middlewares:
crawler | ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
crawler | 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
crawler | 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
crawler | 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
crawler | 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
crawler | 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
crawler | 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
crawler | 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
crawler | 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
crawler | 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
crawler | 'scrapy.downloadermiddlewares.stats.DownloaderStats']
crawler | [2022-04-07 12:18:33,164: INFO/CrawlerScript-1] Enabled spider middlewares:
crawler | ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
crawler | 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
crawler | 'scrapy.spidermiddlewares.referer.RefererMiddleware',
crawler | 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
crawler | 'scrapy.spidermiddlewares.depth.DepthMiddleware']
仅此而已发生。
我在 Python 3.8 上运行,具有以下依赖项:
requirements.txt
scrapy==2.6.1
celery==5.2.3
billiard==3.6.4.0
可能与 Docker 上的 reactor
相关吗? 关于脚本无法在容器上启动的原因有什么想法吗?
I code a Scrapy CrawlerProcess to run from script.
It also uses Celery+RabbitMQ for controlling the urls to be scrapped.
The update.py
script sends the urls to RabbitMQ, and the Celery worker runs the Scrapy script.
When debugging on my IDE, it runs successfully.
However, when I try to run inside a Docker container, the script doesn't run.
I have already double checked network
properties on docker-compose.yaml
and everything looks correct.
I've been working and debbuging this for days, without any different result.
update.py
...
for key in urls.keys():
site_name = key
links = urls[key]
logger.info(f'Insert into queue: {key} | {len(links)} records')
crawler_task.delay({'site_name': site_name, 'links': links})
...
app.py (Celery worker and setup)
import logging
import os
from billiard.context import Process
from celery import Celery
from scrapy import signals
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from twisted.internet import reactor
from enums import sites
logger = logging.getLogger(__name__)
def get_broker():
rabbitmq_user = os.getenv('RABBITMQ_USER')
rabbitmq_password = os.getenv('RABBITMQ_PASSWORD')
rabbitmq_host = os.getenv('RABBITMQ_HOST')
rabbitmq_port = os.getenv('RABBITMQ_PORT')
return f'amqp://{rabbitmq_user}:{rabbitmq_password}@{rabbitmq_host}:{rabbitmq_port}'
app = Celery('app', broker=get_broker())
class CrawlerScript(Process):
def __init__(self, params):
Process.__init__(self)
settings = Settings()
os.environ['SCRAPY_SETTINGS_MODULE'] = 'crawler.settings'
settings_module_path = os.environ['SCRAPY_SETTINGS_MODULE']
settings.setmodule(settings_module_path)
links = params.get('links')
site_name = params.get('site_name')
site = sites.get(site_name)
if site:
self.spider = site.get('spider')
spider_settings = site.get('settings')
meta = site.get('meta')
if spider_settings:
settings.setdict(spider_settings)
self.spider.urls = [{'listing_id': link.get('listing_id'), 'link': link.get('link')} for link in links]
self.spider.meta = meta
self.crawler = Crawler(self.spider, settings)
self.crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
else:
logger.error(f'No site found: {site_name}')
def run(self):
self.crawler.crawl(self.spider())
reactor.run()
@app.task(soft_time_limit=30, time_limit=60)
def crawler_task(params):
crawler = CrawlerScript(params)
crawler.start()
crawler.join()
if __name__ == '__main__':
app.start()
base.py (Generic class to spiders)
import scrapy
from crawler.items import Item
from enums.listing_status import ListingStatusEnum
class BaseSpider(scrapy.Spider):
download_timeout = 30
def __init__(self, name=None, **kwargs):
super().__init__(name, **kwargs)
def start_requests(self):
for url in self.urls:
print(f"START REQUESTS: {url.get('link')}") # This get printed while running on IDE
yield scrapy.Request(url.get('link'),
callback=self.parse,
errback=self.errback,
cb_kwargs=dict(listing_id=url.get('listing_id')),
meta=self.meta,
dont_filter=True)
def errback(self, failure):
self.logger.error('ERROR CALLBACK', repr(failure))
listing_id = failure.request.cb_kwargs['listing_id']
status = ListingStatusEnum.URL_NOT_FOUND.value
yield Item(listing_id=listing_id, name=None, price=None, status=status)
docker-compose.yaml
version: "3.9"
services:
rabbitmq:
image: rabbitmq:3.6.16-management-alpine
container_name: "rabbitmq"
restart: unless-stopped
environment:
RABBITMQ_DEFAULT_USER: "${RABBITMQ_USER}"
RABBITMQ_DEFAULT_PASS: "${RABBITMQ_PASSWORD}"
ports:
- "${RABBITMQ_PORT}:${RABBITMQ_PORT}"
- "1${RABBITMQ_PORT}:1${RABBITMQ_PORT}"
volumes:
- ./.docker/rabbitmq/data/:/var/lib/rabbitmq/
- ./.docker/rabbitmq/log/:/var/log/rabbitmq
deploy:
resources:
limits:
cpus: "1"
memory: 1G
reservations:
memory: 512M
crawler:
build:
context: .
dockerfile: Dockerfile
container_name: crawler
command: bash -c "celery -A app worker --pool=threads --loglevel=INFO --concurrency=1 -n worker@%n"
restart: unless-stopped
environment:
RABBITMQ_USER: "${RABBITMQ_USER}"
RABBITMQ_PASSWORD: "${RABBITMQ_PASSWORD}"
RABBITMQ_HOST: "${RABBITMQ_HOST}"
RABBITMQ_PORT: "${RABBITMQ_PORT}"
network_mode: host
depends_on:
- rabbitmq
deploy:
resources:
limits:
cpus: "4"
memory: 3G
reservations:
memory: 512M
logging:
options:
max-size: "1G"
max-file: "30"
When running on IDE, the script starts normally
Output running on IDE
[2022-04-07 08:53:54,330: INFO/MainProcess] Task app.crawler_task[e3b8401a-a6a7-4c1d-ac61-426cb8c1ccd0] received
[2022-04-07 08:54:04,732: INFO/MainProcess] Overridden settings:
{'BOT_NAME': 'CRAWLER',
'LOG_ENABLED': False,
'LOG_LEVEL': 'INFO',
'NEWSPIDER_MODULE': 'crawler.spiders',
'SPIDER_MODULES': ['crawler.spiders'],
'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, '
'like Gecko) Chrome/92.0.4515.131 Safari/537.36'}
[2022-04-07 08:54:04,762: INFO/MainProcess] Telnet Password: 5e1278a7d809124a
[2022-04-07 08:54:04,794: INFO/MainProcess] Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
[2022-04-07 08:54:15,748: INFO/CrawlerScript-1] Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
[2022-04-07 08:54:15,757: INFO/CrawlerScript-1] Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
[2022-04-07 08:54:16,095: INFO/CrawlerScript-1] Enabled item pipelines:
['crawler.pipelines.PricingCrawlerPipeline']
[2022-04-07 08:54:16,095: INFO/CrawlerScript-1] Spider opened
[2022-04-07 08:54:16,996: INFO/CrawlerScript-1] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
[2022-04-07 08:54:17,000: INFO/CrawlerScript-1] Telnet console listening on 127.0.0.1:6023
**The script is running - START REQUESTS log from base.py**
[2022-04-07 08:54:18,831: WARNING/CrawlerScript-1] START REQUESTS: https://xxx
[2022-04-07 08:54:18,872: WARNING/CrawlerScript-1] START REQUESTS: https://xxx
[2022-04-07 08:54:18,878: WARNING/CrawlerScript-1] START REQUESTS: https://xxx
[2022-04-07 08:54:18,884: WARNING/CrawlerScript-1] START REQUESTS: https://xxx
[2022-04-07 08:54:18,889: WARNING/CrawlerScript-1] START REQUESTS: https://xxx
[2022-04-07 08:54:18,894: WARNING/CrawlerScript-1] START REQUESTS: https://xxx
...
[2022-04-07 08:54:26,029: INFO/CrawlerScript-1] Closing spider (finished)
[2022-04-07 08:54:26,037: INFO/CrawlerScript-1] Dumping Scrapy stats:
{'downloader/request_bytes': 6401,
'downloader/request_count': 15,
'downloader/request_method_count/GET': 15,
'downloader/response_bytes': 1443579,
'downloader/response_count': 15,
'downloader/response_status_count/200': 14,
'downloader/response_status_count/302': 1,
'elapsed_time_seconds': 9.036862,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 4, 7, 11, 54, 26, 33066),
'httpcompression/response_bytes': 7165556,
'httpcompression/response_count': 14,
'item_scraped_count': 14,
'log_count/INFO': 10,
'log_count/WARNING': 14,
'memusage/max': 96260096,
'memusage/startup': 96260096,
'response_received_count': 14,
'scheduler/dequeued': 15,
'scheduler/dequeued/memory': 15,
'scheduler/enqueued': 15,
'scheduler/enqueued/memory': 15,
'start_time': datetime.datetime(2022, 4, 7, 11, 54, 16, 996204)}
[2022-04-07 08:54:26,038: INFO/CrawlerScript-1] Spider closed (finished)
[2022-04-07 08:54:26,070: INFO/MainProcess] Task app.crawler_task[e3b8401a-a6a7-4c1d-ac61-426cb8c1ccd0] succeeded in 31.737013949023094s: None
When running on Docker, the script doesn't start
Output running on Docker container
crawler | [2022-04-07 12:18:33,000: INFO/MainProcess] Task app.crawler_task[2d05036b-ae92-488b-b7de-a6213905af48] received
crawler | [2022-04-07 12:18:33,009: INFO/MainProcess] Overridden settings:
crawler | {'BOT_NAME': 'CRAWLER',
crawler | 'LOG_ENABLED': False,
crawler | 'LOG_LEVEL': 'INFO',
crawler | 'NEWSPIDER_MODULE': 'crawler.spiders',
crawler | 'SPIDER_MODULES': ['crawler.spiders'],
crawler | 'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, '
crawler | 'like Gecko) Chrome/92.0.4515.131 Safari/537.36'}
crawler | [2022-04-07 12:18:33,027: INFO/MainProcess] Telnet Password: f64b27b6d4457920
crawler | [2022-04-07 12:18:33,081: INFO/MainProcess] Enabled extensions:
crawler | ['scrapy.extensions.corestats.CoreStats',
crawler | 'scrapy.extensions.telnet.TelnetConsole',
crawler | 'scrapy.extensions.memusage.MemoryUsage',
crawler | 'scrapy.extensions.logstats.LogStats']
crawler | [2022-04-07 12:18:33,159: INFO/CrawlerScript-1] Enabled downloader middlewares:
crawler | ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
crawler | 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
crawler | 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
crawler | 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
crawler | 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
crawler | 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
crawler | 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
crawler | 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
crawler | 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
crawler | 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
crawler | 'scrapy.downloadermiddlewares.stats.DownloaderStats']
crawler | [2022-04-07 12:18:33,164: INFO/CrawlerScript-1] Enabled spider middlewares:
crawler | ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
crawler | 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
crawler | 'scrapy.spidermiddlewares.referer.RefererMiddleware',
crawler | 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
crawler | 'scrapy.spidermiddlewares.depth.DepthMiddleware']
And nothing more happens.
I'm running on Python 3.8 with the following dependencies:
requirements.txt
scrapy==2.6.1
celery==5.2.3
billiard==3.6.4.0
Could be something related to reactor
on Docker?
Any ideas on the reason why the script doesn't start on the container?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论