使用 scrapy 抓取列出的 HTML 值
我似乎不知道如何构造这个 xpath 选择器。我什至尝试使用 nextsibling::text 但无济于事。我还浏览了 stackoverflow 问题来抓取列出的值,但无法正确实现。我不断得到空白结果。任何和所有的帮助将不胜感激。谢谢。
该网站是 https://www.unegui.mn/adv/ 5737502_10-r-khoroolold-1-oroo/。
预期结果:
Woods
2015
当前结果:
空白
当前:XPath scrapy 代码:
list_li = response.xpath(".//ul[contains( @class, '字符列')]/li/text()").extract()
list_li = response.xpath("./ul[contains(@class,'value-chars')]//text()").extract()
Floor_type = list_li[0].strip() Commission_year = list_li[1].strip()
HTML 片段:
<div class="announcement-characteristics clearfix">
<ul class="chars-column">
<li class="">
<span class="key-chars">Flooring:</span>
<span class="value-chars">Wood</span></li>
<li class="">
<span class="key-chars">Commission year:</span>
<a href="https://www.unegui.mn/l-hdlh/l-hdlh-zarna/oron-suuts-zarna/1-r/ashon_min---2011/"class="value-chars">2015</a>
</li>
</ul>
</div>
进一步说明: 我之前做了两个选择器(一个用于跨度列表,一个用于 href 列表),但问题是网站上的某些页面不遵循相同的跨度列表/列表顺序(即在一页上,表值将位于span 列表,但其他一些页面可能位于 href 列表中)。这就是为什么我一直尝试只使用一个选择器并获取所有值。
这会产生如下图所示的值。它不是抓取窗口的数量(又名整数),而是抓取地址,因为在某些页面上,表值位于 href 列表下,而不是在 span 列表下。
前 2 个选择器:
list_span = response.xpath(".//span[contains(@class,'value-chars')]//text()").extract()
list_a = response.xpath(".//a[contains(@class, 'value-chars')]//text()").extract()
整个代码(如果有人需要它来测试它):
# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request
from datetime import datetime
from scrapy.crawler import CrawlerProcess
from selenium import webdriver
dt_today = datetime.now().strftime('%Y%m%d')
filename = dt_today + ' UB HPI Buying Data'
# create Spider class
class UneguiApartmentsSpider(scrapy.Spider):
name = "unegui_apts"
allowed_domains = ["www.unegui.mn"]
custom_settings = {
"FEEDS": {
f'{filename}.csv': {
'format': 'csv',
'overwrite': True}}
}
# function used for start url
def start_requests(self):
urls = ['https://www.unegui.mn/l-hdlh/l-hdlh-zarna/oron-suuts-zarna/ulan-bator/']
for url in urls:
yield Request(url, self.parse)
def parse(self, response, **kwargs):
cards = response.xpath("//li[contains(@class,'announcement-container')]")
# parse details
for card in cards:
name = card.xpath(".//a[@itemprop='name']/@content").extract_first().strip()
price = card.xpath(".//*[@itemprop='price']/@content").extract_first().strip()
rooms = card.xpath("normalize-space(.//div[contains(@class,'announcement-block__breadcrumbs')]/span[2]/text())").extract_first().strip()
link = card.xpath(".//a[@itemprop='url']/@href").extract_first().strip()
date_block = card.xpath("normalize-space(.//div[contains(@class,'announcement-block__date')]/text())").extract_first().split(',')
date = date_block[0].strip()
city = date_block[1].strip()
item = {'name': name,
'date': date,
'rooms': rooms,
'price': price,
'city': city,
}
# follow absolute link to scrape deeper level
yield response.follow(link, callback=self.parse_item, meta={'item': item})
# handling pagination
next_page = response.xpath("//a[contains(@class,'number-list-next js-page-filter number-list-line')]/@href").get()
if next_page:
yield response.follow(next_page, callback=self.parse)
print(f'Scraped {next_page}')
def parse_item(self, response):
# retrieve previously scraped item between callbacks
item = response.meta['item']
# parse additional details
list_li = response.xpath(".//*[contains(@class, 'value-chars')]/text()").extract()
# get additional details from list of <span> tags, element by element
floor_type = list_li[0].strip()
num_balcony = list_li[1].strip()
commission_year = list_li[2].strip()
garage = list_li[3].strip()
window_type = list_li[4].strip()
num_floors = list_li[5].strip()
door_type = list_li[6].strip()
area_sqm = list_li[7].strip()
floor = list_li[8].strip()
leasing = list_li[9].strip()
district = list_li[10].strip()
num_window = list_li[11].strip()
address = list_li[12].strip()
#list_span = response.xpath(".//span[contains(@class,'value-chars')]//text()").extract()
#list_a = response.xpath(".//a[contains(@class,'value-chars')]//text()").extract()
# get additional details from list of <span> tags, element by element
#floor_type = list_span[0].strip()
#num_balcony = list_span[1].strip()
#garage = list_span[2].strip()
#window_type = list_span[3].strip()
#door_type = list_span[4].strip()
#num_window = list_span[5].strip()
# get additional details from list of <a> tags, element by element
#commission_year = list_a[0].strip()
#num_floors = list_a[1].strip()
#area_sqm = list_a[2].strip()
#floor = list_a[3].strip()
#leasing = list_a[4].strip()
#district = list_a[5].strip()
#address = list_a[6].strip()
# update item with newly parsed data
item.update({
'district': district,
'address': address,
'area_sqm': area_sqm,
'floor': floor,
'commission_year': commission_year,
'num_floors': num_floors,
'num_windows': num_window,
'num_balcony': num_balcony,
'floor_type': floor_type,
'window_type': window_type,
'door_type': door_type,
'garage': garage,
'leasing': leasing
})
yield item
def __init__(self):
self.driver = webdriver.Firefox()
def parse_item2(self, response):
self.driver.get(response.url)
while True:
next = self.driver.find_element_by_xpath(".//span[contains(@class,'phone-author__title')]//text()")
try:
next.click()
# get the data and write it to scrapy items
except:
break
self.driver.close()
# main driver
if __name__ == "__main__":
process = CrawlerProcess()
process.crawl(UneguiApartmentsSpider)
process.start()
I can't seem to figure out how to construct this xpath selector. I have even tried using nextsibling::text but to no avail. I have also browsed stackoverflow questions for scraping listed values but could not implement it correctly. I keep getting blank results. Any and all help would be appreciated. Thank you.
The website is https://www.unegui.mn/adv/5737502_10-r-khoroolold-1-oroo/.
Expected Results:
Woods
2015
Current Results:
blank
Current: XPath scrapy code:
list_li = response.xpath(".//ul[contains(@class, 'chars-column')]/li/text()").extract()
list_li = response.xpath("./ul[contains(@class,'value-chars')]//text()").extract()
floor_type = list_li[0].strip()
commission_year = list_li[1].strip()
HTML Snippet:
<div class="announcement-characteristics clearfix">
<ul class="chars-column">
<li class="">
<span class="key-chars">Flooring:</span>
<span class="value-chars">Wood</span></li>
<li class="">
<span class="key-chars">Commission year:</span>
<a href="https://www.unegui.mn/l-hdlh/l-hdlh-zarna/oron-suuts-zarna/1-r/ashon_min---2011/"class="value-chars">2015</a>
</li>
</ul>
</div>
FURTHER CLARIFICATION:
I previously did two selectors (one for the span list, one for the href list), but the problem was some pages on the website dont follow the same span list/a list order (i.e. on one page the table value would be in a span list, but some other page it would be in a href list). That is why I have been trying to only use one selector and get all the values.
This results in values as shown below in the image. Instead of the number of window aka an integer being scraped, it scrapes the address because on some pages the table value is under the href list not under the span list.
Previous 2 selectors:
list_span = response.xpath(".//span[contains(@class,'value-chars')]//text()").extract()
list_a = response.xpath(".//a[contains(@class,'value-chars')]//text()").extract()
Whole Code (if someone needs it to test it):
# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request
from datetime import datetime
from scrapy.crawler import CrawlerProcess
from selenium import webdriver
dt_today = datetime.now().strftime('%Y%m%d')
filename = dt_today + ' UB HPI Buying Data'
# create Spider class
class UneguiApartmentsSpider(scrapy.Spider):
name = "unegui_apts"
allowed_domains = ["www.unegui.mn"]
custom_settings = {
"FEEDS": {
f'{filename}.csv': {
'format': 'csv',
'overwrite': True}}
}
# function used for start url
def start_requests(self):
urls = ['https://www.unegui.mn/l-hdlh/l-hdlh-zarna/oron-suuts-zarna/ulan-bator/']
for url in urls:
yield Request(url, self.parse)
def parse(self, response, **kwargs):
cards = response.xpath("//li[contains(@class,'announcement-container')]")
# parse details
for card in cards:
name = card.xpath(".//a[@itemprop='name']/@content").extract_first().strip()
price = card.xpath(".//*[@itemprop='price']/@content").extract_first().strip()
rooms = card.xpath("normalize-space(.//div[contains(@class,'announcement-block__breadcrumbs')]/span[2]/text())").extract_first().strip()
link = card.xpath(".//a[@itemprop='url']/@href").extract_first().strip()
date_block = card.xpath("normalize-space(.//div[contains(@class,'announcement-block__date')]/text())").extract_first().split(',')
date = date_block[0].strip()
city = date_block[1].strip()
item = {'name': name,
'date': date,
'rooms': rooms,
'price': price,
'city': city,
}
# follow absolute link to scrape deeper level
yield response.follow(link, callback=self.parse_item, meta={'item': item})
# handling pagination
next_page = response.xpath("//a[contains(@class,'number-list-next js-page-filter number-list-line')]/@href").get()
if next_page:
yield response.follow(next_page, callback=self.parse)
print(f'Scraped {next_page}')
def parse_item(self, response):
# retrieve previously scraped item between callbacks
item = response.meta['item']
# parse additional details
list_li = response.xpath(".//*[contains(@class, 'value-chars')]/text()").extract()
# get additional details from list of <span> tags, element by element
floor_type = list_li[0].strip()
num_balcony = list_li[1].strip()
commission_year = list_li[2].strip()
garage = list_li[3].strip()
window_type = list_li[4].strip()
num_floors = list_li[5].strip()
door_type = list_li[6].strip()
area_sqm = list_li[7].strip()
floor = list_li[8].strip()
leasing = list_li[9].strip()
district = list_li[10].strip()
num_window = list_li[11].strip()
address = list_li[12].strip()
#list_span = response.xpath(".//span[contains(@class,'value-chars')]//text()").extract()
#list_a = response.xpath(".//a[contains(@class,'value-chars')]//text()").extract()
# get additional details from list of <span> tags, element by element
#floor_type = list_span[0].strip()
#num_balcony = list_span[1].strip()
#garage = list_span[2].strip()
#window_type = list_span[3].strip()
#door_type = list_span[4].strip()
#num_window = list_span[5].strip()
# get additional details from list of <a> tags, element by element
#commission_year = list_a[0].strip()
#num_floors = list_a[1].strip()
#area_sqm = list_a[2].strip()
#floor = list_a[3].strip()
#leasing = list_a[4].strip()
#district = list_a[5].strip()
#address = list_a[6].strip()
# update item with newly parsed data
item.update({
'district': district,
'address': address,
'area_sqm': area_sqm,
'floor': floor,
'commission_year': commission_year,
'num_floors': num_floors,
'num_windows': num_window,
'num_balcony': num_balcony,
'floor_type': floor_type,
'window_type': window_type,
'door_type': door_type,
'garage': garage,
'leasing': leasing
})
yield item
def __init__(self):
self.driver = webdriver.Firefox()
def parse_item2(self, response):
self.driver.get(response.url)
while True:
next = self.driver.find_element_by_xpath(".//span[contains(@class,'phone-author__title')]//text()")
try:
next.click()
# get the data and write it to scrapy items
except:
break
self.driver.close()
# main driver
if __name__ == "__main__":
process = CrawlerProcess()
process.crawl(UneguiApartmentsSpider)
process.start()
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您需要两个选择器,一个将传递键,另一个将解析值。这将产生两个可以压缩在一起的列表,以便为您提供所需的结果。
CSS 选择器可能类似于:
Keys Selector -->
.chars-column li .key-chars
值选择器 -->
.chars-column li .value-chars
提取两个列表后,您可以压缩它们并将它们用作键值。
You need two selectors, one will pass keys and another one will parse values. This will result in two lists that can be zipped together in order to give you the results you are looking for.
CSS Selectors could be like:
Keys Selector -->
.chars-column li .key-chars
Values Selector -->
.chars-column li .value-chars
Once you extract both lists, you can zip them and consume them as key value.
我想这是因为无效的 HTML(某些跨度元素未关闭),正常的 xpath 是不可能的。
这确实给了我结果:
*
表示任何元素,因此它将选择 select和
I suppose this is because of invalid HTML (some span-elements are not closed) normal xpath's are not possible.
This did gave me results:
The
*
means any element, so it will select both selectand
使用此
XPath
获取Wood
使用此
XPath
获取2015
Use this
XPath
to getWood
Use this
XPath
to get2015