单击使用循环使用同一类名称的多个Divs

发布于 2025-02-13 00:41:49 字数 3227 浏览 2 评论 0原文

我正在尝试单击具有同一类名称的多个DIV。解析HTML页面,提取一些信息并返回同一页面。 在此 page

  1. 选择项目并提取相关信息
  2. 返回 page < /a>
  3. 单击下一个项目。

这在for循环之外完美地工作。

WebDriverWait(wd, 20).until(EC.element_to_be_clickable((By.XPATH,'//*[@class="product__wrapper"][1]'))).click()

但是,当我在循环中使用上述命令时。它引发了错误的错误sectectorexception

for i in range(1,len(all_profile_url)):
        
        
        WebDriverWait(wd, 20).until(EC.element_to_be_clickable((By.XPATH,'//*[@class="product__wrapper"][{i}]'))).click()
        time.sleep(10)
        wd.execute_script('window.scrollTo(0,1000)')
        
        page_source = BeautifulSoup(wd.page_source, 'html.parser')

        info_div = page_source.find('div', class_='ProductInfoCard__Breadcrumb-sc-113r60q-4 cfIqZP')

        info_block = info_div.find_all('a')
        try:
            info_category = info_block[1].get_text().strip()
        except IndexError:
            info_category ="Null"
        wd.back()
        time.sleep(5)

,我想使用以下代码从每个页面中提取什么

page_source = BeautifulSoup(wd.page_source, 'html.parser')

info_div = page_source.find('div', class_='ProductInfoCard__Breadcrumb-sc-113r60q-4 cfIqZP')

info_block = info_div.find_all('a')
try:
    info_category = info_block[1].get_text().strip()
except IndexError:
    info_category ="Null"

try:
    info_sub_category = info_block[2].get_text().strip()
except IndexError:
    info_sub_category='Null'

try:
    info_product_name = info_div.find_all('span')[0].get_text().strip()
except IndexError:
    info_product_name='null'


# Extract Brand name
info_div_1 = page_source.find('div', class_='ProductInfoCard__BrandContainer-sc-113r60q-9 exyKqL')
try:
    info_brand = info_div_1.find_all('a')[0].get_text().strip()
except IndexError:
    info_brand='null'


# Extract details for rest of the page
info_div_2 = page_source.find('div', class_='ProductDetails__RemoveMaxHeight-sc-z5f4ag-3 fOPLcr')
info_block_2 = info_div_2.find_all('div', class_='ProductAttribute__ProductAttributesDescription-sc-dyoysr-2 lnLDYa')
try:
    info_shelf_life = info_block_2[0].get_text().strip()
except IndexError:
    info_shelf_life = 'null'

try:
    info_country_of_origin = info_block_2[3].get_text().strip()
except IndexError:
    info_country_of_origin='null'

try:
    info_weight = info_block_2[9].get_text().strip()
except IndexError:
    info_weight ='null'

try:
    info_expiry_date = info_block_2[7].get_text().strip()
except IndexError:
    info_expiry_date='null'
# Extract MRP and price
# Extract MRP and price
info_div_3 = page_source.find('div', class_='ProductVariants__VariantDetailsContainer-sc-1unev4j-7 fvkqJd')
info_block_3 = info_div_3.find_all('div', class_='ProductVariants__PriceContainer-sc-1unev4j-9 jjiIua')
info_price_raw = info_block_3[0].get_text().strip()
info_price = info_block_3[0].get_text().strip()[1:3]
info_MRP = info_price_raw[-2:]

I'm trying to click on multiple div with same class name. Parse the HTML page, extract some information and get back to same page.
On this page.

  1. Select item and extract relevant information
  2. Get back to same page
  3. Click on next item.

This works perfectly outside the for loop.

WebDriverWait(wd, 20).until(EC.element_to_be_clickable((By.XPATH,'//*[@class="product__wrapper"][1]'))).click()

But when I use the above command inside my loop. It throws error InvalidSelectorException

for i in range(1,len(all_profile_url)):
        
        
        WebDriverWait(wd, 20).until(EC.element_to_be_clickable((By.XPATH,'//*[@class="product__wrapper"][{i}]'))).click()
        time.sleep(10)
        wd.execute_script('window.scrollTo(0,1000)')
        
        page_source = BeautifulSoup(wd.page_source, 'html.parser')

        info_div = page_source.find('div', class_='ProductInfoCard__Breadcrumb-sc-113r60q-4 cfIqZP')

        info_block = info_div.find_all('a')
        try:
            info_category = info_block[1].get_text().strip()
        except IndexError:
            info_category ="Null"
        wd.back()
        time.sleep(5)

WHAT I want to extract from each page using the code below

page_source = BeautifulSoup(wd.page_source, 'html.parser')

info_div = page_source.find('div', class_='ProductInfoCard__Breadcrumb-sc-113r60q-4 cfIqZP')

info_block = info_div.find_all('a')
try:
    info_category = info_block[1].get_text().strip()
except IndexError:
    info_category ="Null"

try:
    info_sub_category = info_block[2].get_text().strip()
except IndexError:
    info_sub_category='Null'

try:
    info_product_name = info_div.find_all('span')[0].get_text().strip()
except IndexError:
    info_product_name='null'


# Extract Brand name
info_div_1 = page_source.find('div', class_='ProductInfoCard__BrandContainer-sc-113r60q-9 exyKqL')
try:
    info_brand = info_div_1.find_all('a')[0].get_text().strip()
except IndexError:
    info_brand='null'


# Extract details for rest of the page
info_div_2 = page_source.find('div', class_='ProductDetails__RemoveMaxHeight-sc-z5f4ag-3 fOPLcr')
info_block_2 = info_div_2.find_all('div', class_='ProductAttribute__ProductAttributesDescription-sc-dyoysr-2 lnLDYa')
try:
    info_shelf_life = info_block_2[0].get_text().strip()
except IndexError:
    info_shelf_life = 'null'

try:
    info_country_of_origin = info_block_2[3].get_text().strip()
except IndexError:
    info_country_of_origin='null'

try:
    info_weight = info_block_2[9].get_text().strip()
except IndexError:
    info_weight ='null'

try:
    info_expiry_date = info_block_2[7].get_text().strip()
except IndexError:
    info_expiry_date='null'
# Extract MRP and price
# Extract MRP and price
info_div_3 = page_source.find('div', class_='ProductVariants__VariantDetailsContainer-sc-1unev4j-7 fvkqJd')
info_block_3 = info_div_3.find_all('div', class_='ProductVariants__PriceContainer-sc-1unev4j-9 jjiIua')
info_price_raw = info_block_3[0].get_text().strip()
info_price = info_block_3[0].get_text().strip()[1:3]
info_MRP = info_price_raw[-2:]

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

柳絮泡泡 2025-02-20 00:41:49

我们不需要使用BeautifulSoup来解析数据。硒的方法对于我们的用例就足够了。

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
import pandas as pd
    

chrome_path = r"C:\Users\hpoddar\Desktop\Tools\chromedriver_win32\chromedriver.exe"
s = Service(chrome_path)
url = 'https://blinkit.com/cn/masala-oil-more/whole-spices/cid/1557/930'
driver = webdriver.Chrome(service=s)
driver.get(url)

click_location_tooltip = driver.find_element(by=By.XPATH, value="//button[@data-test-id='address-correct-btn']")
click_location_tooltip.click()

cards_elements_list = driver.find_elements(by=By.XPATH, value="//a[@data-test-id='plp-product']")
card_link_list = [x.get_attribute('href') for x in cards_elements_list]

df = pd.DataFrame(columns=['info_category','info_sub_category','info_product_name','info_brand','info_shelf_life','info_country_of_origin','info_weight','info_expiry_date','price','mrp'])

for url in card_link_list:
  driver.get(url)
  try:
      WebDriverWait(driver, 15).until(EC.presence_of_element_located((By.CLASS_NAME, 'ProductInfoCard__BreadcrumbLink-sc-113r60q-5')))
  except TimeoutException:
      print(url + ' cannot be loaded')
      continue
  bread_crumb_links = driver.find_elements(by=By.XPATH, value="//a[@class='ProductInfoCard__BreadcrumbLink-sc-113r60q-5 hRvdxN']")
  info_category = bread_crumb_links[1].text.strip()
  info_sub_category = bread_crumb_links[2].text.strip()

  product_name = driver.find_element(by=By.XPATH, value="//span[@class='ProductInfoCard__BreadcrumbProductName-sc-113r60q-6 lhxiqc']")
  info_product_name = product_name.text

  brand_name = driver.find_element(by=By.XPATH, value="//div[@class='ProductInfoCard__BrandContainer-sc-113r60q-9 exyKqL']")
  info_brand = brand_name.text

  product_details = driver.find_elements(by=By.XPATH, value="//div[@class='ProductAttribute__ProductAttributesDescription-sc-dyoysr-2 lnLDYa']")
  info_shelf_life = product_details[0].text.strip()
  info_country_of_origin = product_details[1].text.strip()
  info_weight = product_details[7].text.strip()
  info_expiry_date = product_details[5].text.strip()

  div_containing_radio = driver.find_element(by=By.XPATH, value="//div[starts-with(@class, 'ProductVariants__RadioButtonInner')]//ancestor::div[starts-with(@class, 'ProductVariants__VariantCard')]")

  price_mrp_div = div_containing_radio.find_element(by=By.CSS_SELECTOR, value=".ProductVariants__PriceContainer-sc-1unev4j-9.jjiIua")
  mrp_price_list = price_mrp_div.text.split("₹")
  price = mrp_price_list[1]
  mrp = ''
  if(len(mrp_price_list) > 2):
    mrp = mrp_price_list[2]

  data_dict = {'info_category' : info_category, 'info_sub_category' : info_sub_category, 'info_product_name' : info_product_name, 'info_brand' : info_brand, 'info_shelf_life' : info_shelf_life, 'info_country_of_origin': info_country_of_origin, 'info_weight' : info_weight, 'info_expiry_date' : info_expiry_date , 'price' : price, 'mrp' : mrp}
  df_dict = pd.DataFrame([data_dict])
  df = pd.concat([df, df_dict])

输出:

在此处输入图像描述“

ps:请注意,product_details如果要为所有URL概括,那么我们不完全是一个结构化元素,只是我们需要使用正则元素来解析的文本,因此,您必须在索引列表product_details时进行一些非凡的处理在您的代码中。

We don't need to use BeautifulSoup to parse the data. Selenium has methods that will be sufficient for our use case.

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
import pandas as pd
    

chrome_path = r"C:\Users\hpoddar\Desktop\Tools\chromedriver_win32\chromedriver.exe"
s = Service(chrome_path)
url = 'https://blinkit.com/cn/masala-oil-more/whole-spices/cid/1557/930'
driver = webdriver.Chrome(service=s)
driver.get(url)

click_location_tooltip = driver.find_element(by=By.XPATH, value="//button[@data-test-id='address-correct-btn']")
click_location_tooltip.click()

cards_elements_list = driver.find_elements(by=By.XPATH, value="//a[@data-test-id='plp-product']")
card_link_list = [x.get_attribute('href') for x in cards_elements_list]

df = pd.DataFrame(columns=['info_category','info_sub_category','info_product_name','info_brand','info_shelf_life','info_country_of_origin','info_weight','info_expiry_date','price','mrp'])

for url in card_link_list:
  driver.get(url)
  try:
      WebDriverWait(driver, 15).until(EC.presence_of_element_located((By.CLASS_NAME, 'ProductInfoCard__BreadcrumbLink-sc-113r60q-5')))
  except TimeoutException:
      print(url + ' cannot be loaded')
      continue
  bread_crumb_links = driver.find_elements(by=By.XPATH, value="//a[@class='ProductInfoCard__BreadcrumbLink-sc-113r60q-5 hRvdxN']")
  info_category = bread_crumb_links[1].text.strip()
  info_sub_category = bread_crumb_links[2].text.strip()

  product_name = driver.find_element(by=By.XPATH, value="//span[@class='ProductInfoCard__BreadcrumbProductName-sc-113r60q-6 lhxiqc']")
  info_product_name = product_name.text

  brand_name = driver.find_element(by=By.XPATH, value="//div[@class='ProductInfoCard__BrandContainer-sc-113r60q-9 exyKqL']")
  info_brand = brand_name.text

  product_details = driver.find_elements(by=By.XPATH, value="//div[@class='ProductAttribute__ProductAttributesDescription-sc-dyoysr-2 lnLDYa']")
  info_shelf_life = product_details[0].text.strip()
  info_country_of_origin = product_details[1].text.strip()
  info_weight = product_details[7].text.strip()
  info_expiry_date = product_details[5].text.strip()

  div_containing_radio = driver.find_element(by=By.XPATH, value="//div[starts-with(@class, 'ProductVariants__RadioButtonInner')]//ancestor::div[starts-with(@class, 'ProductVariants__VariantCard')]")

  price_mrp_div = div_containing_radio.find_element(by=By.CSS_SELECTOR, value=".ProductVariants__PriceContainer-sc-1unev4j-9.jjiIua")
  mrp_price_list = price_mrp_div.text.split("₹")
  price = mrp_price_list[1]
  mrp = ''
  if(len(mrp_price_list) > 2):
    mrp = mrp_price_list[2]

  data_dict = {'info_category' : info_category, 'info_sub_category' : info_sub_category, 'info_product_name' : info_product_name, 'info_brand' : info_brand, 'info_shelf_life' : info_shelf_life, 'info_country_of_origin': info_country_of_origin, 'info_weight' : info_weight, 'info_expiry_date' : info_expiry_date , 'price' : price, 'mrp' : mrp}
  df_dict = pd.DataFrame([data_dict])
  df = pd.concat([df, df_dict])

Output :

enter image description here

P.S : Please note that product_details is not exactly a structured element and just text which we need to parse using regex if want to generalize it for all urls, hence you will have to do some exceptional handling while indexing the list product_details which you have done in your code.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文