硒 - 迭代分页地点,有额外的随机数

发布于 2025-02-09 16:27:57 字数 1201 浏览 2 评论 0 原文

我想刮擦的网站是分页的,但我不能仅迭代页面,因为每个下一页都有一些额外的随机数。

这是页面:

https://market.biset.burset.burset.bursnis.com/bursa-saham/2 /20220621181040 (第二页)

如果我只是更改(页面),则将导致空白页,这是我的代码,谢谢!

options = Options()
options.add_argument("start-maximized")
options.add_argument('--no-sandbox')
  
element_list = []
  
for page in range(1,3, 1):
    
    page_url = "https://market.bisnis.com/bursa-saham/" + str(page)
    driver = webdriver.Chrome("C:/Users/krish/Desktop/chromedriver_win32/chromedriver.exe", chrome_options=options,)
    driver.get(page_url)
    title = driver.find_elements(By.TAG_NAME, 'h2')
  
    for i in range(len(title)):
        element_list.append([title[i].text])
  
with xlsxwriter.Workbook('result2.xlsx') as workbook:
    worksheet = workbook.add_worksheet()
  
    for row_num, data in enumerate(element_list):
        worksheet.write_row(row_num, 0, data)
  
driver.close()

The website I want to scrape is paginated but I can't just iterate over pages since every next page has some extra random number in it.

Here is the page :

https://market.bisnis.com/bursa-saham/2/20220621181040 (second page)
https://market.bisnis.com/bursa-saham/(page)/20220621181040

If i just change the (page) it will result blank page, here is my code btw, thanks!

options = Options()
options.add_argument("start-maximized")
options.add_argument('--no-sandbox')
  
element_list = []
  
for page in range(1,3, 1):
    
    page_url = "https://market.bisnis.com/bursa-saham/" + str(page)
    driver = webdriver.Chrome("C:/Users/krish/Desktop/chromedriver_win32/chromedriver.exe", chrome_options=options,)
    driver.get(page_url)
    title = driver.find_elements(By.TAG_NAME, 'h2')
  
    for i in range(len(title)):
        element_list.append([title[i].text])
  
with xlsxwriter.Workbook('result2.xlsx') as workbook:
    worksheet = workbook.add_worksheet()
  
    for row_num, data in enumerate(element_list):
        worksheet.write_row(row_num, 0, data)
  
driver.close()

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

明月夜 2025-02-16 16:27:57

而不是通过URL导航到下一页(我相信您不知道您不知道的日期和时间)尝试单击Next按钮:

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

next_button = driver.find_element(By.ID, 'nextbtn')
next_button.click()
WebDriverWait(driver, 10).until(EC.staleness_of(next_button))

PS也最好将

driver = webdriver.Chrome("C:/Users/krish/Desktop/chromedriver_win32/chromedriver.exe", chrome_options=options,)

行从循环移出,以使用相同的浏览器实例进行刮擦所有页面

Instead of navigating to next page by URL (URL containing date and time which I believe you don't know in advance) try to click Next button:

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

next_button = driver.find_element(By.ID, 'nextbtn')
next_button.click()
WebDriverWait(driver, 10).until(EC.staleness_of(next_button))

P.S. Also you'd better move

driver = webdriver.Chrome("C:/Users/krish/Desktop/chromedriver_win32/chromedriver.exe", chrome_options=options,)

line out from loop to use the same browser instance for scraping all the pages

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文