如何使用 CSS 选择器 (Scrapy) 从包含特定文本的类中获取 href

发布于 2025-01-09 10:40:18 字数 1259 浏览 1 评论 0原文

我正在使用以下网站: https://inmuebles.mercadolibre.com.mx/venta/ ,我正在尝试从“Inmueble”部分(红色)的“ver_todos”按钮获取链接。但是,访问该网站时,“Tour virtual”和“Publicados hoy”部分(蓝色)可能会出现,也可能不会出现。

图所示,类 ui-search-filter-dl 包含上图中菜单中的特定部分;而 ui-search-filter-container 类包含网站显示的子部分(例如 Inmueble 的 Casas、Departamento 和 Terrenos)。为了从“Inmueble”部分的“ver todos”按钮获取链接,我使用了这行代码:

ver_todos = response.css('div.ui-search-filter-dl')[2].css('a.ui-search-modal__link').attrib['href']

但由于“Tour virtual”和“Publicados hoy”并不总是在页面中,我无法确定索引 2 处的 ui-search-filter-dl 始终是与“ver todos”按钮对应的索引。

输入图片这里的描述

我试图使用这行代码从“ver todos”获取链接:

response.css(''':contains("Inmueble") ~ .ui-search-filter-dt-title
                            .ui-search-modal__link::attr(href)''').extract()

基本上,我试图从 ui-search-filter-dt-title 获取 href类其中包含标题“Inmueble”。不幸的是,输出是一个空列表。我想通过使用 css 和正则表达式找到“ver todos”的链接,但我遇到了问题。我怎样才能做到这一点?

I am working with the following web site: https://inmuebles.mercadolibre.com.mx/venta/, and I am trying to get the link from "ver_todos" button from "Inmueble" section (in red). However, the "Tour virtual" and "Publicados hoy" sections (in blue) may or may not appear when visiting the site.

enter image description here

As shown in the image below, the classes ui-search-filter-dl contain the specific sections from the menu from above image; while ui-search-filter-container classes contain the sub-sections displayed by the site (e.g. Casas, Departamento & Terrenos for Inmueble). With the intention of obtaining the link from "ver todos" button from "Inmueble" section, I was using this line of code:

ver_todos = response.css('div.ui-search-filter-dl')[2].css('a.ui-search-modal__link').attrib['href']

But since "Tour virtual" and "Publicados hoy" are not always in the page, I cannot be sure that ui-search-filter-dl at index 2 is always the index corresponding to "ver todos" button.

enter image description here

I was trying to get the link from "ver todos" by using this line of code:

response.css(''':contains("Inmueble") ~ .ui-search-filter-dt-title
                            .ui-search-modal__link::attr(href)''').extract()

Basically, I was trying to get the href from a ui-search-filter-dt-title class that contains the title "Inmueble". Unfortunately, the output is an empty list. I would like to find the link from "ver todos" by using css and regex but I'm having trouble with it. How may I achieve that?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

放血 2025-01-16 10:40:18

我认为在大多数情况下xpath更容易选择目标元素:

代码:

xpath = "//div[contains(text(), 'Inmueble')]/following-sibling::ul//a[contains(@class,'ui-search-modal__link')]/@href"
url = response.xpath(xpath).extract()[0]

实际上,我没有创建一个scrapy项目来检查你的代码。或者,我实现了以下代码:

from lxml import html
import requests

res = requests.get( "https://inmuebles.mercadolibre.com.mx/venta/")

dom = html.fromstring(res.text)

xpath = "//div[contains(text(), 'Inmueble')]/following-sibling::ul//a[contains(@class,'ui-search-modal__link')]/@href"
url = dom.xpath(xpath)[0]

assert url == 'https://inmuebles.mercadolibre.com.mx/venta/_FiltersAvailableSidebar?filter=PROPERTY_TYPE'

由于scrapy和lxml之间的xpath应该是相同的,当然,我希望开头显示的代码也能在您的scrapy项目中正常工作。

I think xpath is easier to select the target elements in most cases:

Code:

xpath = "//div[contains(text(), 'Inmueble')]/following-sibling::ul//a[contains(@class,'ui-search-modal__link')]/@href"
url = response.xpath(xpath).extract()[0]

Actually, I didn't create a scrapy project to check your code. Alternatively, I implemented the following code:

from lxml import html
import requests

res = requests.get( "https://inmuebles.mercadolibre.com.mx/venta/")

dom = html.fromstring(res.text)

xpath = "//div[contains(text(), 'Inmueble')]/following-sibling::ul//a[contains(@class,'ui-search-modal__link')]/@href"
url = dom.xpath(xpath)[0]

assert url == 'https://inmuebles.mercadolibre.com.mx/venta/_FiltersAvailableSidebar?filter=PROPERTY_TYPE'

Since the xpath should be the same among scrapy and lxml, of course, I hope the code shown in the beginning will also work fine in your scrapy project.

万水千山粽是情ミ 2025-01-16 10:40:18

一种简单的方法是获取所有链接 ,然后检查其任何文本是否与 ver todos 匹配。

import requests
from bs4 import BeautifulSoup

link = "https://inmuebles.mercadolibre.com.mx/venta/"

def main():
  res = requests.get(link)
  if res.status_code == 200:
    soup = BeautifulSoup(res.text, "html.parser")
    links = [a["href"] for a in soup.select("a") if a.text.strip().lower() == "ver todos"]
    print(links)


if __name__ == "__main__":
  main()

An easy way you could do it is by getting all the link <a> and then checking if any of their text matches ver todos.

import requests
from bs4 import BeautifulSoup

link = "https://inmuebles.mercadolibre.com.mx/venta/"

def main():
  res = requests.get(link)
  if res.status_code == 200:
    soup = BeautifulSoup(res.text, "html.parser")
    links = [a["href"] for a in soup.select("a") if a.text.strip().lower() == "ver todos"]
    print(links)


if __name__ == "__main__":
  main()

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文