如何在砂纸中构建自己的中间件?

发布于 2025-02-09 09:57:55 字数 834 浏览 3 评论 0原文

我刚刚开始学习纸巾,我有一个问题。对于我的“蜘蛛”,我必须从Google表格表中获取一个URL(start_url)列表,并且我有此代码:

import gspread
from oauth2client.service_account import ServiceAccountCredentials

scope = ['https://spreadsheets.google.com/feeds','https://www.googleapis.com/auth/drive']
creds = ServiceAccountCredentials.from_json_keyfile_name('token.json', scope)


client = gspread.authorize(creds)
sheet = client.open('Sheet_1')
sheet_instance = sheet.get_worksheet(0)
records_data = sheet_instance.col_values(col=2)

for link in records_data:
    print(link)
    ........

我如何配置中间件,以便当蜘蛛(scrappy crawl my_spider)时是否启动,此代码中的链接会自动替换为start_urls?也许我需要在Middlewares.py中创建一个类? 我将感谢任何帮助,例如。 该规则有必要适用于所有新蜘蛛,从start_requests中的文件生成列表(例如start_urls = [l.strip()for Open String('urls.txt').readline()] .Readline()])不方便...

I'm just starting to learn Scrapy and I have such a question. for my "spider" I have to take a list of urls (start_urls) from the google sheets table and I have this code:

import gspread
from oauth2client.service_account import ServiceAccountCredentials

scope = ['https://spreadsheets.google.com/feeds','https://www.googleapis.com/auth/drive']
creds = ServiceAccountCredentials.from_json_keyfile_name('token.json', scope)


client = gspread.authorize(creds)
sheet = client.open('Sheet_1')
sheet_instance = sheet.get_worksheet(0)
records_data = sheet_instance.col_values(col=2)

for link in records_data:
    print(link)
    ........

How do I configure the middleware so that when the spider (scrappy crawl my_spider) is launched, links from this code are automatically substituted into start_urls? perhaps i need to create a class in middlewares.py?
I will be grateful for any help, with examples.
it is necessary that this rule applies to all new spiders, generating a list from a file in start_requests (for example start_urls = [l.strip() for an open string('urls.txt ').readline()]) is not convenient...

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

沙沙粒小 2025-02-16 09:57:55

阅读

” py:

import scrapy


class ExampleSpider(scrapy.Spider):
    name = 'example'

    custom_settings = {
        'SPIDER_MIDDLEWARES': {
            'tempbuffer.middlewares.ExampleMiddleware': 543,
        }
    }
    
    def parse(self, response):
        print(response.url)

middlewares.py:

class ExampleMiddleware(object):
    def process_start_requests(self, start_requests, spider):
        # change this to your needs:
        with open('urls.txt', 'r') as f:
            for url in f:
                yield scrapy.Request(url=url)

urls.txt:

https://example.com
https://example1.com
https://example2.org

输出:

[scrapy.core.engine] DEBUG: Crawled (200) <GET https://example2.org> (referer: None)
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.com> (referer: None)
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://example1.com> (referer: None)
https://example2.org
https://example.com
https://example1.com

Read this

spider.py:

import scrapy


class ExampleSpider(scrapy.Spider):
    name = 'example'

    custom_settings = {
        'SPIDER_MIDDLEWARES': {
            'tempbuffer.middlewares.ExampleMiddleware': 543,
        }
    }
    
    def parse(self, response):
        print(response.url)

middlewares.py:

class ExampleMiddleware(object):
    def process_start_requests(self, start_requests, spider):
        # change this to your needs:
        with open('urls.txt', 'r') as f:
            for url in f:
                yield scrapy.Request(url=url)

urls.txt:

https://example.com
https://example1.com
https://example2.org

output:

[scrapy.core.engine] DEBUG: Crawled (200) <GET https://example2.org> (referer: None)
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.com> (referer: None)
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://example1.com> (referer: None)
https://example2.org
https://example.com
https://example1.com
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文