如何在弹出窗口中将硒链接到webccrape的第二页。

发布于 2025-01-31 07:45:43 字数 1287 浏览 2 评论 0原文

我正在尝试为各种结果覆盖。第一页工作正常,但是当我切换到下一页时,不幸的是,它只是再次将结果的第一页网络覆盖。结果不会返回新的URL,因此它不起作用,而是在URL打开页面顶部的窗口。我似乎也无法弄清楚如何附加第一页的结果以添加第二页,它们以单独的列表出现。以下是我拥有的代码。

from selenium import webdriver
import time
import requests
from bs4 import BeautifulSoup
from selenium.webdriver.common.keys import Keys

#original webscraping code to get the names of locations from page 1
url = r'https://autochek.africa/en/ng/fix-your-car/service/scheduled-car-service'
driver = webdriver.Chrome()
driver.get(url)
xpath_get_locations = r'/html/body/div[1]/div/div[2]/div/div[1]/div/div[2]/div[2]/div/div/form/div[7]/div/label'

driver.find_element_by_xpath(xpath_get_locations).click()

soup = BeautifulSoup(driver.page_source, 'html.parser')


location_results = [i.text for i in soup.find_all('div', {'class': 'jsx-1642469937 state'})]

print(location_results)
time.sleep(3)

#finished page 1, finding the next button to go to page 2
xpath_find_next_button = r'/html/body/div[1]/div/div[2]/div/div[1]/div/div[2]/div[2]/div[2]/div[2]/div/div/div[3]/ul/li[13]'
driver.find_element_by_xpath(xpath_find_next_button).click()

#getting the locations from page 2
second_page_results = [i.text for i in soup.find_all('div', {'class': 'jsx-1642469937 state'})]



print(second_page_results)
time.sleep(2)

I am trying to webscrape various pages of results. The first page works fine but when I switch to the next page, unfortunately,it just webscrapes the first page of results again. The results dont return a new URL so that way doesn't work but rather its a window on top of the url opened page. I also cant seem to figure out how to append the results of the first page to add the second page, they come out as separate lists. Below is the code I have.

from selenium import webdriver
import time
import requests
from bs4 import BeautifulSoup
from selenium.webdriver.common.keys import Keys

#original webscraping code to get the names of locations from page 1
url = r'https://autochek.africa/en/ng/fix-your-car/service/scheduled-car-service'
driver = webdriver.Chrome()
driver.get(url)
xpath_get_locations = r'/html/body/div[1]/div/div[2]/div/div[1]/div/div[2]/div[2]/div/div/form/div[7]/div/label'

driver.find_element_by_xpath(xpath_get_locations).click()

soup = BeautifulSoup(driver.page_source, 'html.parser')


location_results = [i.text for i in soup.find_all('div', {'class': 'jsx-1642469937 state'})]

print(location_results)
time.sleep(3)

#finished page 1, finding the next button to go to page 2
xpath_find_next_button = r'/html/body/div[1]/div/div[2]/div/div[1]/div/div[2]/div[2]/div[2]/div[2]/div/div/div[3]/ul/li[13]'
driver.find_element_by_xpath(xpath_find_next_button).click()

#getting the locations from page 2
second_page_results = [i.text for i in soup.find_all('div', {'class': 'jsx-1642469937 state'})]



print(second_page_results)
time.sleep(2)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

谎言月老 2025-02-07 07:45:43

加载新页面或在页面上运行一些JavaScript代码后,您必须再次运行

soup = BeautifulSoup(driver.page_source, 'html.parser')

才能使用新的HTML。


或Skip BeautifulSoup,并在selenium中完成所有操作。

使用find_elements _...在Word Elements中使用char s 。

items = driver.find_elements_by_xpath('//div[@class="jsx-1642469937 state"]')

location_result = [i.text for i in items]

一句:

xpath不需要前缀r,因为它不使用\

顺便说 更可读的xpath

#xpath_get_locations = r'/html/body/div[1]/div/div[2]/div/div[1]/div/div[2]/div[2]/div/div/form/div[7]/div/label'

xpath_get_locations = '//label[text()="Drop-off at Autochek location"]'

使用按钮Next>而不是搜索按钮23等会更简单。

xpath_find_next_button = '//li[@class="next-li"]/a'

编辑:

完整的工作代码,使用 -loop访问所有页面。

我添加了模块webdriver_manager自动下载浏览器的驱动程序(Fresh)驱动程序。

我使用find_elemens(by.xpath,...)是因为find_elemens_by_xpath(...) is deprected

from selenium import webdriver
from selenium.webdriver.common.by import By
#from selenium.webdriver.common.keys import Keys
#from selenium.webdriver.support.ui import WebDriverWait
#from selenium.webdriver.support import expected_conditions as EC
#from selenium.common.exceptions import NoSuchElementException, TimeoutException

import time
#from bs4 import BeautifulSoup

from webdriver_manager.chrome import ChromeDriverManager
#from webdriver_manager.firefox import GeckoDriverManager

driver = webdriver.Chrome(executable_path=ChromeDriverManager().install())
#driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())

# ---

url = 'https://autochek.africa/en/ng/fix-your-car/service/scheduled-car-service'
driver.get(url)

#xpath_get_locations = r'/html/body/div[1]/div/div[2]/div/div[1]/div/div[2]/div[2]/div/div/form/div[7]/div/label'
xpath_get_locations = '//label[text()="Drop-off at Autochek location"]'
driver.find_element(By.XPATH, xpath_get_locations).click()

# ---

all_locations = []

while True:
    
    # --- get locations on page
    
    time.sleep(1) # sometimes `JavaScript` may need time to add new items (and you can't catch it with `WebDriverWait`)

    #items = soup.find_all('div', {'class': 'jsx-1642469937 state'})
    items = driver.find_elements(By.XPATH, '//div[@class="jsx-1642469937 state"]')

    #soup = BeautifulSoup(driver.page_source, 'html.parser')

    locations = [i.text for i in items]

    print(locations)
    print('-------')

    all_locations += locations
    
    # --- find button `next >` and try to click it 
    
    #xpath_find_next_button = r'/html/body/div[1]/div/div[2]/div/div[1]/div/div[2]/div[2]/div[2]/div[2]/div/div/div[3]/ul/li[13]'
    xpath_find_next_button = '//li[@class="next-li"]/a'

    try:
        driver.find_element(By.XPATH, xpath_find_next_button).click()
    except:
        break  # exit loop
    
# ---

#driver.close()

After loading new page or running some JavaScript code on page you have to run again

soup = BeautifulSoup(driver.page_source, 'html.parser')

to work with new HTML.


Or skip BeautifulSoup and do all in Selenium.

Use find_elements_... with char s in word elements.

items = driver.find_elements_by_xpath('//div[@class="jsx-1642469937 state"]')

location_result = [i.text for i in items]

By The Way:

(xpath doesn't need prefix r because it doesn't use \ )

Shorter and more readable xpath.

#xpath_get_locations = r'/html/body/div[1]/div/div[2]/div/div[1]/div/div[2]/div[2]/div/div/form/div[7]/div/label'

xpath_get_locations = '//label[text()="Drop-off at Autochek location"]'

And it would be simpler to use button Next > instead of searching buttons 2, 3, etc.

xpath_find_next_button = '//li[@class="next-li"]/a'

EDIT:

Full working code which uses while-loop to visit all pages.

I added module webdriver_manager which automatically downloads (fresh) driver for browser.

I use find_elemens(By.XPATH, ...) because find_elemens_by_xpath(...) is deprecated.

from selenium import webdriver
from selenium.webdriver.common.by import By
#from selenium.webdriver.common.keys import Keys
#from selenium.webdriver.support.ui import WebDriverWait
#from selenium.webdriver.support import expected_conditions as EC
#from selenium.common.exceptions import NoSuchElementException, TimeoutException

import time
#from bs4 import BeautifulSoup

from webdriver_manager.chrome import ChromeDriverManager
#from webdriver_manager.firefox import GeckoDriverManager

driver = webdriver.Chrome(executable_path=ChromeDriverManager().install())
#driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())

# ---

url = 'https://autochek.africa/en/ng/fix-your-car/service/scheduled-car-service'
driver.get(url)

#xpath_get_locations = r'/html/body/div[1]/div/div[2]/div/div[1]/div/div[2]/div[2]/div/div/form/div[7]/div/label'
xpath_get_locations = '//label[text()="Drop-off at Autochek location"]'
driver.find_element(By.XPATH, xpath_get_locations).click()

# ---

all_locations = []

while True:
    
    # --- get locations on page
    
    time.sleep(1) # sometimes `JavaScript` may need time to add new items (and you can't catch it with `WebDriverWait`)

    #items = soup.find_all('div', {'class': 'jsx-1642469937 state'})
    items = driver.find_elements(By.XPATH, '//div[@class="jsx-1642469937 state"]')

    #soup = BeautifulSoup(driver.page_source, 'html.parser')

    locations = [i.text for i in items]

    print(locations)
    print('-------')

    all_locations += locations
    
    # --- find button `next >` and try to click it 
    
    #xpath_find_next_button = r'/html/body/div[1]/div/div[2]/div/div[1]/div/div[2]/div[2]/div[2]/div[2]/div/div/div[3]/ul/li[13]'
    xpath_find_next_button = '//li[@class="next-li"]/a'

    try:
        driver.find_element(By.XPATH, xpath_find_next_button).click()
    except:
        break  # exit loop
    
# ---

#driver.close()
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文