添加Kward和第二台珀斯克萨纸后停止工作

发布于 2025-02-10 07:12:46 字数 2986 浏览 1 评论 0原文

添加Kward脚本停止以输出任何刮擦数据后,它仅输出了普通的蜘蛛调试数据。我完全不知道为什么它是这样做的, 看来整个牧师都在那儿坐在那里,什么也没做。

这是我的代码:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy import Request, Spider


class RynekMainSpider(scrapy.Spider):
    name = "RynekMain"
    start_urls = [
        'https://rynekpierwotny.pl/deweloperzy/?page=1']
    def parse(self, response):
        websites = response.css('div.root')
        for websitep in websites:
            websiteurl = websitep.css('div.rp-l0pkv6 a::attr(href)').get()
            href = websitep.css('li.rp-np9kb1 a::attr(href)').get()
            url = response.urljoin(href)
            yield Request(url, cb_kwargs={'websiteurl': websiteurl}, callback=self.parseMain)
    
    def parseMain(self, response, websiteurl):
   # def parse(self, response):
        for quote in response.css('.rp-y89gny.eboilu01 ul li'):
                address = quote.css('address.rp-o9b83y::text').get(),
                name = quote.css('h2.rp-69f2r4::text').get(),
                href = quote.css('li.rp-np9kb1 a::attr(href)').get(),
                PAGETEST = response.css('a.rp-mmikj9::attr(href)').get()
        yield {
            'address' : address,
            'name' : name,
            'href' : href,
            'PAGETEST' : PAGETEST,
            'websiteurl' : websiteurl
            }
        next_page=response.css('a.rp-mmikj9::attr(href)').get()
        if next_page is not None:
            next_page_link=response.urljoin(next_page)
            yield scrapy.Request(url=next_page_link, callback= self.parse)
    
if __name__ == "__main__":
    process =CrawlerProcess()
    process.crawl(RynekMainSpider)
    process.start()

感谢提前的帮助。 编辑:哦,拍摄,我忘了告诉我的代码应该做什么。 基本解析是从“ https://rynekpierwotny.pl/deweloperzy/dom-development-sa-955/”中的子页面内部获取网站URL。 当Parsemain从主页“ https://rynekpierwotny.pl/deweloperzy/?page=1”中获取所有数据(例如地址,名称)。

# -*- coding: utf-8 -*-
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy import Request, Spider


class RynekMainSpider(scrapy.Spider):
    name = "RynekMain"
    start_urls = [
        'https://rynekpierwotny.pl/deweloperzy/?page=1']
   
    
 
    def parse(self, response):
        for quote in response.css('.rp-y89gny.eboilu01 ul li'):
            yield {
                'address' : quote.css('address.rp-o9b83y::text').get(),
                'name' : quote.css('h2.rp-69f2r4::text').get(),
                'href' : quote.css('li.rp-np9kb1 a::attr(href)').get(),
                'PAGETEST' : response.css('a.rp-mmikj9::attr(href)').get()
            }
        next_page=response.css('a.rp-mmikj9::attr(href)').get()
        if next_page is not None:
            next_page_link=response.urljoin(next_page)
            yield scrapy.Request(url=next_page_link, callback= self.parse)
    
if __name__ == "__main__":
    process =CrawlerProcess()
    process.crawl(RynekMainSpider)
    process.start()

这起作用了

after adding kward script stopped to output any scraped data, it only outputed normal spider debug data. I have completly no idea why the hell it does that,
it looks like whole parseMain is just sittin there and doin nothing.

Here is my code:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy import Request, Spider


class RynekMainSpider(scrapy.Spider):
    name = "RynekMain"
    start_urls = [
        'https://rynekpierwotny.pl/deweloperzy/?page=1']
    def parse(self, response):
        websites = response.css('div.root')
        for websitep in websites:
            websiteurl = websitep.css('div.rp-l0pkv6 a::attr(href)').get()
            href = websitep.css('li.rp-np9kb1 a::attr(href)').get()
            url = response.urljoin(href)
            yield Request(url, cb_kwargs={'websiteurl': websiteurl}, callback=self.parseMain)
    
    def parseMain(self, response, websiteurl):
   # def parse(self, response):
        for quote in response.css('.rp-y89gny.eboilu01 ul li'):
                address = quote.css('address.rp-o9b83y::text').get(),
                name = quote.css('h2.rp-69f2r4::text').get(),
                href = quote.css('li.rp-np9kb1 a::attr(href)').get(),
                PAGETEST = response.css('a.rp-mmikj9::attr(href)').get()
        yield {
            'address' : address,
            'name' : name,
            'href' : href,
            'PAGETEST' : PAGETEST,
            'websiteurl' : websiteurl
            }
        next_page=response.css('a.rp-mmikj9::attr(href)').get()
        if next_page is not None:
            next_page_link=response.urljoin(next_page)
            yield scrapy.Request(url=next_page_link, callback= self.parse)
    
if __name__ == "__main__":
    process =CrawlerProcess()
    process.crawl(RynekMainSpider)
    process.start()

Thanks for help in advance.
EDIT: Oh shoot i forgot to tell what my code is supposed to do.
Basicly parse is getting website url from inside of subPages like "https://rynekpierwotny.pl/deweloperzy/dom-development-sa-955/".
While parseMain is getting all data(like address,name) from main page "https://rynekpierwotny.pl/deweloperzy/?page=1".

# -*- coding: utf-8 -*-
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy import Request, Spider


class RynekMainSpider(scrapy.Spider):
    name = "RynekMain"
    start_urls = [
        'https://rynekpierwotny.pl/deweloperzy/?page=1']
   
    
 
    def parse(self, response):
        for quote in response.css('.rp-y89gny.eboilu01 ul li'):
            yield {
                'address' : quote.css('address.rp-o9b83y::text').get(),
                'name' : quote.css('h2.rp-69f2r4::text').get(),
                'href' : quote.css('li.rp-np9kb1 a::attr(href)').get(),
                'PAGETEST' : response.css('a.rp-mmikj9::attr(href)').get()
            }
        next_page=response.css('a.rp-mmikj9::attr(href)').get()
        if next_page is not None:
            next_page_link=response.urljoin(next_page)
            yield scrapy.Request(url=next_page_link, callback= self.parse)
    
if __name__ == "__main__":
    process =CrawlerProcess()
    process.crawl(RynekMainSpider)
    process.start()

This worked

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

北凤男飞 2025-02-17 07:12:46

编辑:

我根据您要编程要做的事情进行了一些进一步的调整。它应该按照您现在的期望工作。

而是尝试一下:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy import Request, Spider


class RynekMainSpider(scrapy.Spider):
    name = "RynekMain"
    start_urls = [
        'https://rynekpierwotny.pl/deweloperzy/?page=1']
    def parse(self, response):
        websites = response.css('div#root')[0]
        PAGETEST = response.xpath('//a[contains(@class,"rp-173nt6g")]/../following-sibling::li').css('a::attr(href)').get()
        for website in websites.css('li.rp-np9kb1'):
            page = website.css('a::attr(href)').get()
            address = website.css('address.rp-o9b83y::text').get()
            name = website.css('h2.rp-69f2r4::text').get()
            params = {
            'address' : address,
            'name' : name,
            'href' : page,
            }
            url  = response.urljoin(page)
            yield Request(url=url, cb_kwargs={'params': params}, callback=self.parseMain)
        yield Request(url=response.urljoin(PAGETEST), callback=self.parse)

    def parseMain(self, response, params=None):
        # print(response.url)
        website = response.css('div.rp-l0pkv6 a::attr(href)').get()
        params['website'] = website
        yield params

if __name__ == "__main__":
    process =CrawlerProcess()
    process.crawl(RynekMainSpider)
    process.start()

Edit:

I made some further adjustments based on your notes of what you want to program to do. It should work the way you expect now.

try this instead:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy import Request, Spider


class RynekMainSpider(scrapy.Spider):
    name = "RynekMain"
    start_urls = [
        'https://rynekpierwotny.pl/deweloperzy/?page=1']
    def parse(self, response):
        websites = response.css('div#root')[0]
        PAGETEST = response.xpath('//a[contains(@class,"rp-173nt6g")]/../following-sibling::li').css('a::attr(href)').get()
        for website in websites.css('li.rp-np9kb1'):
            page = website.css('a::attr(href)').get()
            address = website.css('address.rp-o9b83y::text').get()
            name = website.css('h2.rp-69f2r4::text').get()
            params = {
            'address' : address,
            'name' : name,
            'href' : page,
            }
            url  = response.urljoin(page)
            yield Request(url=url, cb_kwargs={'params': params}, callback=self.parseMain)
        yield Request(url=response.urljoin(PAGETEST), callback=self.parse)

    def parseMain(self, response, params=None):
        # print(response.url)
        website = response.css('div.rp-l0pkv6 a::attr(href)').get()
        params['website'] = website
        yield params

if __name__ == "__main__":
    process =CrawlerProcess()
    process.crawl(RynekMainSpider)
    process.start()

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文