使用scrapy清空结果文件
刚刚开始学习 python 很抱歉,如果这是一个愚蠢的问题!
我正在尝试从该网站抓取房地产数据: https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?pn=2&r=10 使用 scrapy。
理想情况下,最终我会得到一个包含所有可用房地产报价及其各自地址、价格、面积(平方米)和其他详细信息(例如公共交通连接)的文件。
我用 scrapy 构建了一个测试蜘蛛,但它总是返回一个空文件。我尝试了一大堆不同的 xpath,但无法让它工作。有人可以帮忙吗?这是我的代码:
import scrapy
class GetdataSpider(scrapy.Spider):
name = 'getdata'
allowed_domains = ['immoscout24.ch']
start_urls = ['https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?r=10',
'https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?pn=2&r=10',
'https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?pn=3&r=10',
'https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?pn=4&r=10',
'https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?pn=5&r=10']
def parse(self, response):
single_offer = response.xpath('//*[@class="Body-jQnOud bjiWLb"]')
for offer in single_offer:
offer_price = offer.xpath('.//*[@class="Box-cYFBPY jPbvXR Heading-daBLVV dOtgYu xh- highlight"]/text()').extract()
offer_address = offer.xpath('.//*[@class="Address__AddressStyled-lnefMi fUIggX"]/text()').extract_first()
yield {'Price': offer_price,
'Address': offer_address}
just started learning python so sorry if this is a stupid question!
I'm trying to scrape real estate data from this website: https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?pn=2&r=10 using scrapy.
Ideally, in the end I'd get a file containing all available real estate offers and their respective address, price, area in m2, and other details (e.g. connection to public transport).
I built a test spider with scrapy but it always returns an empty file. I tried a whole bunch of different xpaths but can't get it to work. Can anyone help? Here's my code:
import scrapy
class GetdataSpider(scrapy.Spider):
name = 'getdata'
allowed_domains = ['immoscout24.ch']
start_urls = ['https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?r=10',
'https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?pn=2&r=10',
'https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?pn=3&r=10',
'https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?pn=4&r=10',
'https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?pn=5&r=10']
def parse(self, response):
single_offer = response.xpath('//*[@class="Body-jQnOud bjiWLb"]')
for offer in single_offer:
offer_price = offer.xpath('.//*[@class="Box-cYFBPY jPbvXR Heading-daBLVV dOtgYu xh- highlight"]/text()').extract()
offer_address = offer.xpath('.//*[@class="Address__AddressStyled-lnefMi fUIggX"]/text()').extract_first()
yield {'Price': offer_price,
'Address': offer_address}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
首先,您需要添加真正的用户代理。我在 settings.py 文件中注入了用户代理。我还更正了 xpath 选择并在 start_urls 中进行了分页,其中下一页分页类型是其他类型的 2 倍。这是 woeking 示例。
输出:
...等等
First all of , You need to add your real user agent . I injected user-agent in settings.py file. I also have corrected the xpath selection and made pagination in start_urls which type of next page pagination is 2 time fister than other types.This is the woeking example.
Output:
... so on