与芹菜不在Docker容器上跑步的冰期

发布于 2025-01-19 13:53:41 字数 12010 浏览 3 评论 0原文

我编写了一个 Scrapy CrawlerProcess 来从脚本运行。 它还使用 Celery+RabbitMQ 来控制要废弃的 url。

update.py 脚本将 URL 发送到 RabbitMQ,Celery 工作线程运行 Scrapy 脚本。

在我的IDE上调试时,运行成功。 但是,当我尝试在 Docker 容器内运行时,该脚本不会运行。

我已经仔细检查了 docker-compose.yaml 上的 network 属性,一切看起来都正确。 我已经工作和调试了好几天了,没有任何不同的结果。

update.py

...
for key in urls.keys():
        site_name = key
        links = urls[key]
        logger.info(f'Insert into queue: {key} | {len(links)} records')
        crawler_task.delay({'site_name': site_name, 'links': links})
...

app.py(Celery 工作器和设置)

import logging
import os

from billiard.context import Process
from celery import Celery
from scrapy import signals
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from twisted.internet import reactor

from enums import sites

logger = logging.getLogger(__name__)


def get_broker():
    rabbitmq_user = os.getenv('RABBITMQ_USER')
    rabbitmq_password = os.getenv('RABBITMQ_PASSWORD')
    rabbitmq_host = os.getenv('RABBITMQ_HOST')
    rabbitmq_port = os.getenv('RABBITMQ_PORT')

    return f'amqp://{rabbitmq_user}:{rabbitmq_password}@{rabbitmq_host}:{rabbitmq_port}'


app = Celery('app', broker=get_broker())


class CrawlerScript(Process):
    def __init__(self, params):
        Process.__init__(self)
        settings = Settings()

        os.environ['SCRAPY_SETTINGS_MODULE'] = 'crawler.settings'
        settings_module_path = os.environ['SCRAPY_SETTINGS_MODULE']
        settings.setmodule(settings_module_path)

        links = params.get('links')
        site_name = params.get('site_name')

        site = sites.get(site_name)

        if site:
            self.spider = site.get('spider')
            spider_settings = site.get('settings')
            meta = site.get('meta')

            if spider_settings:
                settings.setdict(spider_settings)

            self.spider.urls = [{'listing_id': link.get('listing_id'), 'link': link.get('link')} for link in links]
            self.spider.meta = meta

            self.crawler = Crawler(self.spider, settings)
            self.crawler.signals.connect(reactor.stop, signal=signals.spider_closed)

        else:
            logger.error(f'No site found: {site_name}')

    def run(self):
        self.crawler.crawl(self.spider())
        reactor.run()


@app.task(soft_time_limit=30, time_limit=60)
def crawler_task(params):
    crawler = CrawlerScript(params)
    crawler.start()
    crawler.join()


if __name__ == '__main__':
    app.start()

base.py(蜘蛛的通用类)

import scrapy

from crawler.items import Item
from enums.listing_status import ListingStatusEnum


class BaseSpider(scrapy.Spider):
    download_timeout = 30

    def __init__(self, name=None, **kwargs):
        super().__init__(name, **kwargs)

    def start_requests(self):
        for url in self.urls:
            print(f"START REQUESTS: {url.get('link')}") # This get printed while running on IDE
            yield scrapy.Request(url.get('link'),
                                 callback=self.parse,
                                 errback=self.errback,
                                 cb_kwargs=dict(listing_id=url.get('listing_id')),
                                 meta=self.meta,
                                 dont_filter=True)

    def errback(self, failure):
        self.logger.error('ERROR CALLBACK', repr(failure))
        listing_id = failure.request.cb_kwargs['listing_id']
        status = ListingStatusEnum.URL_NOT_FOUND.value

        yield Item(listing_id=listing_id, name=None, price=None, status=status)

docker-compose.yaml

version: "3.9"
services:
  rabbitmq:
    image: rabbitmq:3.6.16-management-alpine
    container_name: "rabbitmq"
    restart: unless-stopped
    environment:
      RABBITMQ_DEFAULT_USER: "${RABBITMQ_USER}"
      RABBITMQ_DEFAULT_PASS: "${RABBITMQ_PASSWORD}"
    ports:
      - "${RABBITMQ_PORT}:${RABBITMQ_PORT}"
      - "1${RABBITMQ_PORT}:1${RABBITMQ_PORT}"
    volumes:
      - ./.docker/rabbitmq/data/:/var/lib/rabbitmq/
      - ./.docker/rabbitmq/log/:/var/log/rabbitmq
    deploy:
      resources:
        limits:
          cpus: "1"
          memory: 1G
        reservations:
          memory: 512M

  crawler:
    build:
      context: .
      dockerfile: Dockerfile
    container_name: crawler
    command: bash -c "celery -A app worker --pool=threads --loglevel=INFO  --concurrency=1 -n worker@%n"
    restart: unless-stopped
    environment:
      RABBITMQ_USER: "${RABBITMQ_USER}"
      RABBITMQ_PASSWORD: "${RABBITMQ_PASSWORD}"
      RABBITMQ_HOST: "${RABBITMQ_HOST}"
      RABBITMQ_PORT: "${RABBITMQ_PORT}"
    network_mode: host
    depends_on:
      - rabbitmq
    deploy:
      resources:
        limits:
          cpus: "4"
          memory: 3G
        reservations:
          memory: 512M
    logging:
      options:
        max-size: "1G"
        max-file: "30"

在 IDE 上运行时,脚本正常启动

在 IDE 上运行的输出

[2022-04-07 08:53:54,330: INFO/MainProcess] Task app.crawler_task[e3b8401a-a6a7-4c1d-ac61-426cb8c1ccd0] received
[2022-04-07 08:54:04,732: INFO/MainProcess] Overridden settings:
{'BOT_NAME': 'CRAWLER',
 'LOG_ENABLED': False,
 'LOG_LEVEL': 'INFO',
 'NEWSPIDER_MODULE': 'crawler.spiders',
 'SPIDER_MODULES': ['crawler.spiders'],
 'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, '
               'like Gecko) Chrome/92.0.4515.131 Safari/537.36'}
[2022-04-07 08:54:04,762: INFO/MainProcess] Telnet Password: 5e1278a7d809124a
[2022-04-07 08:54:04,794: INFO/MainProcess] Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
[2022-04-07 08:54:15,748: INFO/CrawlerScript-1] Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
[2022-04-07 08:54:15,757: INFO/CrawlerScript-1] Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
[2022-04-07 08:54:16,095: INFO/CrawlerScript-1] Enabled item pipelines:
['crawler.pipelines.PricingCrawlerPipeline']
[2022-04-07 08:54:16,095: INFO/CrawlerScript-1] Spider opened
[2022-04-07 08:54:16,996: INFO/CrawlerScript-1] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
[2022-04-07 08:54:17,000: INFO/CrawlerScript-1] Telnet console listening on 127.0.0.1:6023

**The script is running - START REQUESTS log from base.py**
[2022-04-07 08:54:18,831: WARNING/CrawlerScript-1] START REQUESTS: https://xxx
[2022-04-07 08:54:18,872: WARNING/CrawlerScript-1] START REQUESTS: https://xxx
[2022-04-07 08:54:18,878: WARNING/CrawlerScript-1] START REQUESTS: https://xxx
[2022-04-07 08:54:18,884: WARNING/CrawlerScript-1] START REQUESTS: https://xxx
[2022-04-07 08:54:18,889: WARNING/CrawlerScript-1] START REQUESTS: https://xxx
[2022-04-07 08:54:18,894: WARNING/CrawlerScript-1] START REQUESTS: https://xxx
...
[2022-04-07 08:54:26,029: INFO/CrawlerScript-1] Closing spider (finished)
[2022-04-07 08:54:26,037: INFO/CrawlerScript-1] Dumping Scrapy stats:
{'downloader/request_bytes': 6401,
 'downloader/request_count': 15,
 'downloader/request_method_count/GET': 15,
 'downloader/response_bytes': 1443579,
 'downloader/response_count': 15,
 'downloader/response_status_count/200': 14,
 'downloader/response_status_count/302': 1,
 'elapsed_time_seconds': 9.036862,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 4, 7, 11, 54, 26, 33066),
 'httpcompression/response_bytes': 7165556,
 'httpcompression/response_count': 14,
 'item_scraped_count': 14,
 'log_count/INFO': 10,
 'log_count/WARNING': 14,
 'memusage/max': 96260096,
 'memusage/startup': 96260096,
 'response_received_count': 14,
 'scheduler/dequeued': 15,
 'scheduler/dequeued/memory': 15,
 'scheduler/enqueued': 15,
 'scheduler/enqueued/memory': 15,
 'start_time': datetime.datetime(2022, 4, 7, 11, 54, 16, 996204)}
[2022-04-07 08:54:26,038: INFO/CrawlerScript-1] Spider closed (finished)
[2022-04-07 08:54:26,070: INFO/MainProcess] Task app.crawler_task[e3b8401a-a6a7-4c1d-ac61-426cb8c1ccd0] succeeded in 31.737013949023094s: None

在 Docker 上运行时,脚本不启动

在 Docker 容器上运行的输出

crawler    | [2022-04-07 12:18:33,000: INFO/MainProcess] Task app.crawler_task[2d05036b-ae92-488b-b7de-a6213905af48] received
crawler    | [2022-04-07 12:18:33,009: INFO/MainProcess] Overridden settings:
crawler    | {'BOT_NAME': 'CRAWLER',
crawler    |  'LOG_ENABLED': False,
crawler    |  'LOG_LEVEL': 'INFO',
crawler    |  'NEWSPIDER_MODULE': 'crawler.spiders',
crawler    |  'SPIDER_MODULES': ['crawler.spiders'],
crawler    |  'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, '
crawler    |                'like Gecko) Chrome/92.0.4515.131 Safari/537.36'}
crawler    | [2022-04-07 12:18:33,027: INFO/MainProcess] Telnet Password: f64b27b6d4457920
crawler    | [2022-04-07 12:18:33,081: INFO/MainProcess] Enabled extensions:
crawler    | ['scrapy.extensions.corestats.CoreStats',
crawler    |  'scrapy.extensions.telnet.TelnetConsole',
crawler    |  'scrapy.extensions.memusage.MemoryUsage',
crawler    |  'scrapy.extensions.logstats.LogStats']
crawler    | [2022-04-07 12:18:33,159: INFO/CrawlerScript-1] Enabled downloader middlewares:
crawler    | ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
crawler    |  'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
crawler    |  'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
crawler    |  'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
crawler    |  'scrapy.downloadermiddlewares.retry.RetryMiddleware',
crawler    |  'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
crawler    |  'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
crawler    |  'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
crawler    |  'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
crawler    |  'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
crawler    |  'scrapy.downloadermiddlewares.stats.DownloaderStats']
crawler    | [2022-04-07 12:18:33,164: INFO/CrawlerScript-1] Enabled spider middlewares:
crawler    | ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
crawler    |  'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
crawler    |  'scrapy.spidermiddlewares.referer.RefererMiddleware',
crawler    |  'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
crawler    |  'scrapy.spidermiddlewares.depth.DepthMiddleware']

仅此而已发生。

我在 Python 3.8 上运行,具有以下依赖项:

requirements.txt

scrapy==2.6.1
celery==5.2.3
billiard==3.6.4.0

可能与 Docker 上的 reactor 相关吗? 关于脚本无法在容器上启动的原因有什么想法吗?

I code a Scrapy CrawlerProcess to run from script.
It also uses Celery+RabbitMQ for controlling the urls to be scrapped.

The update.py script sends the urls to RabbitMQ, and the Celery worker runs the Scrapy script.

When debugging on my IDE, it runs successfully.
However, when I try to run inside a Docker container, the script doesn't run.

I have already double checked network properties on docker-compose.yaml and everything looks correct.
I've been working and debbuging this for days, without any different result.

update.py

...
for key in urls.keys():
        site_name = key
        links = urls[key]
        logger.info(f'Insert into queue: {key} | {len(links)} records')
        crawler_task.delay({'site_name': site_name, 'links': links})
...

app.py (Celery worker and setup)

import logging
import os

from billiard.context import Process
from celery import Celery
from scrapy import signals
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from twisted.internet import reactor

from enums import sites

logger = logging.getLogger(__name__)


def get_broker():
    rabbitmq_user = os.getenv('RABBITMQ_USER')
    rabbitmq_password = os.getenv('RABBITMQ_PASSWORD')
    rabbitmq_host = os.getenv('RABBITMQ_HOST')
    rabbitmq_port = os.getenv('RABBITMQ_PORT')

    return f'amqp://{rabbitmq_user}:{rabbitmq_password}@{rabbitmq_host}:{rabbitmq_port}'


app = Celery('app', broker=get_broker())


class CrawlerScript(Process):
    def __init__(self, params):
        Process.__init__(self)
        settings = Settings()

        os.environ['SCRAPY_SETTINGS_MODULE'] = 'crawler.settings'
        settings_module_path = os.environ['SCRAPY_SETTINGS_MODULE']
        settings.setmodule(settings_module_path)

        links = params.get('links')
        site_name = params.get('site_name')

        site = sites.get(site_name)

        if site:
            self.spider = site.get('spider')
            spider_settings = site.get('settings')
            meta = site.get('meta')

            if spider_settings:
                settings.setdict(spider_settings)

            self.spider.urls = [{'listing_id': link.get('listing_id'), 'link': link.get('link')} for link in links]
            self.spider.meta = meta

            self.crawler = Crawler(self.spider, settings)
            self.crawler.signals.connect(reactor.stop, signal=signals.spider_closed)

        else:
            logger.error(f'No site found: {site_name}')

    def run(self):
        self.crawler.crawl(self.spider())
        reactor.run()


@app.task(soft_time_limit=30, time_limit=60)
def crawler_task(params):
    crawler = CrawlerScript(params)
    crawler.start()
    crawler.join()


if __name__ == '__main__':
    app.start()

base.py (Generic class to spiders)

import scrapy

from crawler.items import Item
from enums.listing_status import ListingStatusEnum


class BaseSpider(scrapy.Spider):
    download_timeout = 30

    def __init__(self, name=None, **kwargs):
        super().__init__(name, **kwargs)

    def start_requests(self):
        for url in self.urls:
            print(f"START REQUESTS: {url.get('link')}") # This get printed while running on IDE
            yield scrapy.Request(url.get('link'),
                                 callback=self.parse,
                                 errback=self.errback,
                                 cb_kwargs=dict(listing_id=url.get('listing_id')),
                                 meta=self.meta,
                                 dont_filter=True)

    def errback(self, failure):
        self.logger.error('ERROR CALLBACK', repr(failure))
        listing_id = failure.request.cb_kwargs['listing_id']
        status = ListingStatusEnum.URL_NOT_FOUND.value

        yield Item(listing_id=listing_id, name=None, price=None, status=status)

docker-compose.yaml

version: "3.9"
services:
  rabbitmq:
    image: rabbitmq:3.6.16-management-alpine
    container_name: "rabbitmq"
    restart: unless-stopped
    environment:
      RABBITMQ_DEFAULT_USER: "${RABBITMQ_USER}"
      RABBITMQ_DEFAULT_PASS: "${RABBITMQ_PASSWORD}"
    ports:
      - "${RABBITMQ_PORT}:${RABBITMQ_PORT}"
      - "1${RABBITMQ_PORT}:1${RABBITMQ_PORT}"
    volumes:
      - ./.docker/rabbitmq/data/:/var/lib/rabbitmq/
      - ./.docker/rabbitmq/log/:/var/log/rabbitmq
    deploy:
      resources:
        limits:
          cpus: "1"
          memory: 1G
        reservations:
          memory: 512M

  crawler:
    build:
      context: .
      dockerfile: Dockerfile
    container_name: crawler
    command: bash -c "celery -A app worker --pool=threads --loglevel=INFO  --concurrency=1 -n worker@%n"
    restart: unless-stopped
    environment:
      RABBITMQ_USER: "${RABBITMQ_USER}"
      RABBITMQ_PASSWORD: "${RABBITMQ_PASSWORD}"
      RABBITMQ_HOST: "${RABBITMQ_HOST}"
      RABBITMQ_PORT: "${RABBITMQ_PORT}"
    network_mode: host
    depends_on:
      - rabbitmq
    deploy:
      resources:
        limits:
          cpus: "4"
          memory: 3G
        reservations:
          memory: 512M
    logging:
      options:
        max-size: "1G"
        max-file: "30"

When running on IDE, the script starts normally

Output running on IDE

[2022-04-07 08:53:54,330: INFO/MainProcess] Task app.crawler_task[e3b8401a-a6a7-4c1d-ac61-426cb8c1ccd0] received
[2022-04-07 08:54:04,732: INFO/MainProcess] Overridden settings:
{'BOT_NAME': 'CRAWLER',
 'LOG_ENABLED': False,
 'LOG_LEVEL': 'INFO',
 'NEWSPIDER_MODULE': 'crawler.spiders',
 'SPIDER_MODULES': ['crawler.spiders'],
 'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, '
               'like Gecko) Chrome/92.0.4515.131 Safari/537.36'}
[2022-04-07 08:54:04,762: INFO/MainProcess] Telnet Password: 5e1278a7d809124a
[2022-04-07 08:54:04,794: INFO/MainProcess] Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
[2022-04-07 08:54:15,748: INFO/CrawlerScript-1] Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
[2022-04-07 08:54:15,757: INFO/CrawlerScript-1] Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
[2022-04-07 08:54:16,095: INFO/CrawlerScript-1] Enabled item pipelines:
['crawler.pipelines.PricingCrawlerPipeline']
[2022-04-07 08:54:16,095: INFO/CrawlerScript-1] Spider opened
[2022-04-07 08:54:16,996: INFO/CrawlerScript-1] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
[2022-04-07 08:54:17,000: INFO/CrawlerScript-1] Telnet console listening on 127.0.0.1:6023

**The script is running - START REQUESTS log from base.py**
[2022-04-07 08:54:18,831: WARNING/CrawlerScript-1] START REQUESTS: https://xxx
[2022-04-07 08:54:18,872: WARNING/CrawlerScript-1] START REQUESTS: https://xxx
[2022-04-07 08:54:18,878: WARNING/CrawlerScript-1] START REQUESTS: https://xxx
[2022-04-07 08:54:18,884: WARNING/CrawlerScript-1] START REQUESTS: https://xxx
[2022-04-07 08:54:18,889: WARNING/CrawlerScript-1] START REQUESTS: https://xxx
[2022-04-07 08:54:18,894: WARNING/CrawlerScript-1] START REQUESTS: https://xxx
...
[2022-04-07 08:54:26,029: INFO/CrawlerScript-1] Closing spider (finished)
[2022-04-07 08:54:26,037: INFO/CrawlerScript-1] Dumping Scrapy stats:
{'downloader/request_bytes': 6401,
 'downloader/request_count': 15,
 'downloader/request_method_count/GET': 15,
 'downloader/response_bytes': 1443579,
 'downloader/response_count': 15,
 'downloader/response_status_count/200': 14,
 'downloader/response_status_count/302': 1,
 'elapsed_time_seconds': 9.036862,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 4, 7, 11, 54, 26, 33066),
 'httpcompression/response_bytes': 7165556,
 'httpcompression/response_count': 14,
 'item_scraped_count': 14,
 'log_count/INFO': 10,
 'log_count/WARNING': 14,
 'memusage/max': 96260096,
 'memusage/startup': 96260096,
 'response_received_count': 14,
 'scheduler/dequeued': 15,
 'scheduler/dequeued/memory': 15,
 'scheduler/enqueued': 15,
 'scheduler/enqueued/memory': 15,
 'start_time': datetime.datetime(2022, 4, 7, 11, 54, 16, 996204)}
[2022-04-07 08:54:26,038: INFO/CrawlerScript-1] Spider closed (finished)
[2022-04-07 08:54:26,070: INFO/MainProcess] Task app.crawler_task[e3b8401a-a6a7-4c1d-ac61-426cb8c1ccd0] succeeded in 31.737013949023094s: None

When running on Docker, the script doesn't start

Output running on Docker container

crawler    | [2022-04-07 12:18:33,000: INFO/MainProcess] Task app.crawler_task[2d05036b-ae92-488b-b7de-a6213905af48] received
crawler    | [2022-04-07 12:18:33,009: INFO/MainProcess] Overridden settings:
crawler    | {'BOT_NAME': 'CRAWLER',
crawler    |  'LOG_ENABLED': False,
crawler    |  'LOG_LEVEL': 'INFO',
crawler    |  'NEWSPIDER_MODULE': 'crawler.spiders',
crawler    |  'SPIDER_MODULES': ['crawler.spiders'],
crawler    |  'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, '
crawler    |                'like Gecko) Chrome/92.0.4515.131 Safari/537.36'}
crawler    | [2022-04-07 12:18:33,027: INFO/MainProcess] Telnet Password: f64b27b6d4457920
crawler    | [2022-04-07 12:18:33,081: INFO/MainProcess] Enabled extensions:
crawler    | ['scrapy.extensions.corestats.CoreStats',
crawler    |  'scrapy.extensions.telnet.TelnetConsole',
crawler    |  'scrapy.extensions.memusage.MemoryUsage',
crawler    |  'scrapy.extensions.logstats.LogStats']
crawler    | [2022-04-07 12:18:33,159: INFO/CrawlerScript-1] Enabled downloader middlewares:
crawler    | ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
crawler    |  'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
crawler    |  'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
crawler    |  'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
crawler    |  'scrapy.downloadermiddlewares.retry.RetryMiddleware',
crawler    |  'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
crawler    |  'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
crawler    |  'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
crawler    |  'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
crawler    |  'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
crawler    |  'scrapy.downloadermiddlewares.stats.DownloaderStats']
crawler    | [2022-04-07 12:18:33,164: INFO/CrawlerScript-1] Enabled spider middlewares:
crawler    | ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
crawler    |  'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
crawler    |  'scrapy.spidermiddlewares.referer.RefererMiddleware',
crawler    |  'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
crawler    |  'scrapy.spidermiddlewares.depth.DepthMiddleware']

And nothing more happens.

I'm running on Python 3.8 with the following dependencies:

requirements.txt

scrapy==2.6.1
celery==5.2.3
billiard==3.6.4.0

Could be something related to reactor on Docker?
Any ideas on the reason why the script doesn't start on the container?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文