使用scrapy清空结果文件

发布于 2025-01-15 18:58:05 字数 1492 浏览 0 评论 0原文

刚刚开始学习 python 很抱歉,如果这是一个愚蠢的问题!

我正在尝试从该网站抓取房地产数据: https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?pn=2&r=10 使用 scrapy。

理想情况下,最终我会得到一个包含所有可用房地产报价及其各自地址、价格、面积(平方米)和其他详细信息(例如公共交通连接)的文件。

我用 scrapy 构建了一个测试蜘蛛,但它总是返回一个空文件。我尝试了一大堆不同的 xpath,但无法让它工作。有人可以帮忙吗?这是我的代码:

import scrapy

class GetdataSpider(scrapy.Spider):
  name = 'getdata' 
  allowed_domains = ['immoscout24.ch']
  start_urls = ['https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?r=10',
  'https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?pn=2&r=10',
  'https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?pn=3&r=10',
  'https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?pn=4&r=10',
  'https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?pn=5&r=10']

  def parse(self, response):

    single_offer = response.xpath('//*[@class="Body-jQnOud bjiWLb"]')


    for offer in single_offer: 
        offer_price = offer.xpath('.//*[@class="Box-cYFBPY jPbvXR Heading-daBLVV dOtgYu xh-   highlight"]/text()').extract()
        offer_address = offer.xpath('.//*[@class="Address__AddressStyled-lnefMi fUIggX"]/text()').extract_first()

        yield {'Price': offer_price,
                'Address': offer_address}

just started learning python so sorry if this is a stupid question!

I'm trying to scrape real estate data from this website: https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?pn=2&r=10 using scrapy.

Ideally, in the end I'd get a file containing all available real estate offers and their respective address, price, area in m2, and other details (e.g. connection to public transport).

I built a test spider with scrapy but it always returns an empty file. I tried a whole bunch of different xpaths but can't get it to work. Can anyone help? Here's my code:

import scrapy

class GetdataSpider(scrapy.Spider):
  name = 'getdata' 
  allowed_domains = ['immoscout24.ch']
  start_urls = ['https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?r=10',
  'https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?pn=2&r=10',
  'https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?pn=3&r=10',
  'https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?pn=4&r=10',
  'https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?pn=5&r=10']

  def parse(self, response):

    single_offer = response.xpath('//*[@class="Body-jQnOud bjiWLb"]')


    for offer in single_offer: 
        offer_price = offer.xpath('.//*[@class="Box-cYFBPY jPbvXR Heading-daBLVV dOtgYu xh-   highlight"]/text()').extract()
        offer_address = offer.xpath('.//*[@class="Address__AddressStyled-lnefMi fUIggX"]/text()').extract_first()

        yield {'Price': offer_price,
                'Address': offer_address}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

唱一曲作罢 2025-01-22 18:58:05

首先,您需要添加真正的用户代理。我在 settings.py 文件中注入了用户代理。我还更正了 xpath 选择并在 start_urls 中进行了分页,其中下一页分页类型是其他类型的 2 倍。这是 woeking 示例。

import scrapy
class GetdataSpider(scrapy.Spider):
    name = 'getdata'
    allowed_domains = ['immoscout24.ch']
    start_urls = ['https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?pn='+ str(x)+'&r=10' for x in range(1,5)]
                

    def parse(self, response):

        single_offer = response.xpath('//*[@class="Content-kCEgNG degSLr"]')

        for offer in single_offer:
            name = offer.xpath('.//*[@class="Box-cYFBPY jPbvXR Heading-daBLVV dOtgYu"]/span/text()').extract_first()
            offer_address = offer.xpath('.//*[@class="AddressLine__TextStyled-eaUAMD iBNjyG"]/text()').extract_first()

            yield {'title': name,
                   'Address': offer_address}

输出:

{'title': 'CHF 950.—', 'Address': 'Friedackerstrasse 6, 8050 Zürich, ZH'}
2022-03-22 00:58:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?r=10>
{'title': 'CHF 220.—', 'Address': 'Freilagerstrasse 40, 8047 Zürich, ZH'}
2022-03-22 00:58:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?r=10>
{'title': 'CHF 220.—', 'Address': 'Freilagerstrasse 40, 8047 Zürich, ZH'}
2022-03-22 00:58:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?r=10>
{'title': 'CHF 220.—', 'Address': 'Freilagerstrasse 40, 8047 Zürich, ZH'}
2022-03-22 00:58:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?r=10>
{'title': 'CHF 4020.—', 'Address': 'Uitikonerstrasse 9, 8952 Schlieren, ZH'}
2022-03-22 00:58:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?r=10>
{'title': 'CHF 260.—', 'Address': 'Buckhauserstrasse 45/49, 8048 Zürich, ZH'}
2022-03-22 00:58:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?r=10>
{'title': 'CHF 260.—', 'Address': 'Letzigraben 75, 8003 Zürich, ZH'}
2022-03-22 00:58:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?r=10>
{'title': 'CHF 2655.—', 'Address': 'Weiningerstr. 53, 8103 Unterengstringen, ZH'}
2022-03-22 00:58:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?r=10>
{'title': 'CHF 320.—', 'Address': 'Baslerstrasse 60, 8048 Zürich, ZH'}
2022-03-22 00:58:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?r=10>
{'title': 'CHF 190.—', 'Address': 'Neugutstrasse 66, 8600 Dübendorf, ZH'}
2022-03-22 00:58:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?r=10>
{'title': 'CHF 180.—', 'Address': 'Herostrasse 9, 8048 Zürich, ZH'}
2022-03-22 00:58:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?r=10>
{'title': 'CHF 180.—', 'Address': 'Herostrasse 9, 8048 Zürich, ZH'}
2022-03-22 00:58:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?r=10>

...等等

First all of , You need to add your real user agent . I injected user-agent in settings.py file. I also have corrected the xpath selection and made pagination in start_urls which type of next page pagination is 2 time fister than other types.This is the woeking example.

import scrapy
class GetdataSpider(scrapy.Spider):
    name = 'getdata'
    allowed_domains = ['immoscout24.ch']
    start_urls = ['https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?pn='+ str(x)+'&r=10' for x in range(1,5)]
                

    def parse(self, response):

        single_offer = response.xpath('//*[@class="Content-kCEgNG degSLr"]')

        for offer in single_offer:
            name = offer.xpath('.//*[@class="Box-cYFBPY jPbvXR Heading-daBLVV dOtgYu"]/span/text()').extract_first()
            offer_address = offer.xpath('.//*[@class="AddressLine__TextStyled-eaUAMD iBNjyG"]/text()').extract_first()

            yield {'title': name,
                   'Address': offer_address}

Output:

{'title': 'CHF 950.—', 'Address': 'Friedackerstrasse 6, 8050 Zürich, ZH'}
2022-03-22 00:58:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?r=10>
{'title': 'CHF 220.—', 'Address': 'Freilagerstrasse 40, 8047 Zürich, ZH'}
2022-03-22 00:58:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?r=10>
{'title': 'CHF 220.—', 'Address': 'Freilagerstrasse 40, 8047 Zürich, ZH'}
2022-03-22 00:58:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?r=10>
{'title': 'CHF 220.—', 'Address': 'Freilagerstrasse 40, 8047 Zürich, ZH'}
2022-03-22 00:58:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?r=10>
{'title': 'CHF 4020.—', 'Address': 'Uitikonerstrasse 9, 8952 Schlieren, ZH'}
2022-03-22 00:58:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?r=10>
{'title': 'CHF 260.—', 'Address': 'Buckhauserstrasse 45/49, 8048 Zürich, ZH'}
2022-03-22 00:58:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?r=10>
{'title': 'CHF 260.—', 'Address': 'Letzigraben 75, 8003 Zürich, ZH'}
2022-03-22 00:58:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?r=10>
{'title': 'CHF 2655.—', 'Address': 'Weiningerstr. 53, 8103 Unterengstringen, ZH'}
2022-03-22 00:58:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?r=10>
{'title': 'CHF 320.—', 'Address': 'Baslerstrasse 60, 8048 Zürich, ZH'}
2022-03-22 00:58:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?r=10>
{'title': 'CHF 190.—', 'Address': 'Neugutstrasse 66, 8600 Dübendorf, ZH'}
2022-03-22 00:58:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?r=10>
{'title': 'CHF 180.—', 'Address': 'Herostrasse 9, 8048 Zürich, ZH'}
2022-03-22 00:58:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?r=10>
{'title': 'CHF 180.—', 'Address': 'Herostrasse 9, 8048 Zürich, ZH'}
2022-03-22 00:58:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?r=10>

... so on

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文