我想从HTML网站上使用美丽的汤从标签中获得HREF链接

发布于 2025-02-12 04:49:55 字数 2147 浏览 0 评论 0原文

我正在抓取此产品页面: https://www.hugoboss.com/us/interlock-cotton-t-shirt-with-exclusive-artwork/hbna50487153_739.html

我希望从此HTML Code的每种颜色的链接: “在此处输入图像说明”

当前代码:当前代码:

 import numpy as np
    import pandas as pd
    from selenium import webdriver
    from selenium.webdriver.common.keys import Keys
    import time 
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    from bs4 import BeautifulSoup as soup
    from selenium import webdriver 
    import time
    import requests
    
    driverfile = r'C:\Users\Main\Documents\Work\Projects\Scraping Websites\extra\chromedriver'
    
    #driver.implicitly_wait(10)
    
    url = "https://www.hugoboss.com/us/interlock-cotton-t-shirt-with-exclusive-artwork/hbna50487153_739.html"
    
    def make_soup(url):
      page = requests.get(url)
      page_soup = soup(page.content, 'lxml')
      return page_soup 
    
    product_page_soup = make_soup(url)
    print(product_page_soup.select('a.slides__slide slides__slide--color-selector.js-slide.js- 
    product-swatch.widget-initialized'))

`

当前输出:当前输出:是空列表<是一个空列表<代码> []

预期输出:a tag

fyi的html:在同一产品页面上选择另一个标签,例如: print(product_page_soup.select('a.dch-links-item.dch-links-item- real.dch-links-item- unstyled-selector.dch-links-links-item-item-item--bold-bold-innerscore。 dch-links-item-tracking')[0] .Text.Strip()):此使用相同的方法输出所需的文本,所以我是感到困惑为什么它不适用于有问题的标签'a.slides__slide slides__slide-color-selector.js-slide.js-js- product-swatch.widget-initialized'

我也尝试使用> product_page_soup.findall('a', {“ class”:'slides__slide.slides__slide--color-selector.js-slide.js-product-swatch.widget-initialized'}),但获得了相同的空列表

I am scraping this product page: https://www.hugoboss.com/us/interlock-cotton-t-shirt-with-exclusive-artwork/hbna50487153_739.html

I want the links of each color of this product from this HTML code:enter image description here

Current code:

 import numpy as np
    import pandas as pd
    from selenium import webdriver
    from selenium.webdriver.common.keys import Keys
    import time 
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    from bs4 import BeautifulSoup as soup
    from selenium import webdriver 
    import time
    import requests
    
    driverfile = r'C:\Users\Main\Documents\Work\Projects\Scraping Websites\extra\chromedriver'
    
    #driver.implicitly_wait(10)
    
    url = "https://www.hugoboss.com/us/interlock-cotton-t-shirt-with-exclusive-artwork/hbna50487153_739.html"
    
    def make_soup(url):
      page = requests.get(url)
      page_soup = soup(page.content, 'lxml')
      return page_soup 
    
    product_page_soup = make_soup(url)
    print(product_page_soup.select('a.slides__slide slides__slide--color-selector.js-slide.js- 
    product-swatch.widget-initialized'))

`

Current output: is an empty list []

Expected Output: HTML of the a tag

FYI: Selecting another A tag on the same product page works e.g: print(product_page_soup.select('a.dch-links-item.dch-links-item--released.dch-links-item--unstyled-selector.dch-links-item--bold--underscore.dch-links-item-tracking')[0].text.strip()) : This outputs desired text using the same method so I am confused why it would not work for a tag in question 'a.slides__slide slides__slide--color-selector.js-slide.js- product-swatch.widget-initialized'

I also tried using product_page_soup.findAll ('a', {"class":'slides__slide.slides__slide--color-selector.js-slide.js-product-swatch.widget-initialized'}) but got the same empty list

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

秋凉 2025-02-19 04:49:55

以下使用BS4的CSS表达式将获得所需的链接

[class="stage__left-wrapper"] div nav a')

完整的工作代码:

import time
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get('https://www.hugoboss.com/us/interlock-cotton-t-shirt-with-exclusive-artwork/hbna50487153_739.html')
driver.maximize_window()
time.sleep(5)
soup = BeautifulSoup(driver.page_source, 'html.parser')
for u in soup.select('[class="stage__left-wrapper"] div nav a'):
    link = 'https://www.hugoboss.com' + u.get('href')
    print(link)

输出:

https://www.hugoboss.com/us/interlock-cotton-t-shirt-with-exclusive-artwork/hbna50487153_739.html
https://www.hugoboss.com/us/interlock-cotton-t-shirt-with-exclusive-artwork/hbna50487153_100.html
https://www.hugoboss.com/us/interlock-cotton-t-shirt-with-exclusive-artwork/hbna50487153_001.html

The following CSS expression with bs4 will grab the desired links

[class="stage__left-wrapper"] div nav a')

Full working code:

import time
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get('https://www.hugoboss.com/us/interlock-cotton-t-shirt-with-exclusive-artwork/hbna50487153_739.html')
driver.maximize_window()
time.sleep(5)
soup = BeautifulSoup(driver.page_source, 'html.parser')
for u in soup.select('[class="stage__left-wrapper"] div nav a'):
    link = 'https://www.hugoboss.com' + u.get('href')
    print(link)

Output:

https://www.hugoboss.com/us/interlock-cotton-t-shirt-with-exclusive-artwork/hbna50487153_739.html
https://www.hugoboss.com/us/interlock-cotton-t-shirt-with-exclusive-artwork/hbna50487153_100.html
https://www.hugoboss.com/us/interlock-cotton-t-shirt-with-exclusive-artwork/hbna50487153_001.html
说谎友 2025-02-19 04:49:55

在页面源链接中具有@class “ widget”。我猜它用替换为“小部件initialized”呈现页面后。因此,请尝试

.widget

而不是这样

.widget-initialized

,因此也应该

a.slides__slide.slides__slide--color-selector.js-slide.js-product-swatch.widget

为了更好的可读性,我建议使用CSS选择器

'nav > a[data-as-click="productClick"]'

In page source link has @class "widget". I guess it replaced with "widget-initialized" after page rendered. So try

.widget

instead of

.widget-initialized

And so complete selector should be

a.slides__slide.slides__slide--color-selector.js-slide.js-product-swatch.widget

Also for better readability I would recommend to use CSS selector

'nav > a[data-as-click="productClick"]'
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文