如何刮擦动态性的页面?
所以这是我的问题。我编写了一个程序,该程序完全能够在我加载的第一页上获取我想要的所有信息。但是,当我单击nextpage
按钮时,它会运行一个脚本,该脚本将加载下一个产品的脚本而无需实际移至另一个页面。
因此,当我运行下一个循环时,所有发生的事情就是我得到了第一个循环的内容,即使我仿真的浏览器上的内容也不同。
这是我运行的代码:
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import time
driver.get("https://www.my-website.com/search/results-34y1i")
soup = BeautifulSoup(driver.page_source, 'html.parser')
time.sleep(2)
# /////////// code to find total number of pages
currentPage = 0
button_NextPage = driver.find_element(By.ID, 'nextButton')
while currentPage != totalPages:
# ///////// code to find the products
currentPage += 1
button_NextPage = driver.find_element(By.ID, 'nextButton')
button_NextPage.click()
time.sleep(5)
我有什么办法可以刮擦浏览器上的内容吗?
So here's my problem. I wrote a program that is perfectly able to get all of the information I want on the first page that I load. But when I click on the nextPage
button it runs a script that loads the next bunch of products without actually moving to another page.
So when I run the next loop all that happens is that I get the same content of the first one, even when the ones on the browser I'm emulating itself is different.
This is the code I run:
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import time
driver.get("https://www.my-website.com/search/results-34y1i")
soup = BeautifulSoup(driver.page_source, 'html.parser')
time.sleep(2)
# /////////// code to find total number of pages
currentPage = 0
button_NextPage = driver.find_element(By.ID, 'nextButton')
while currentPage != totalPages:
# ///////// code to find the products
currentPage += 1
button_NextPage = driver.find_element(By.ID, 'nextButton')
button_NextPage.click()
time.sleep(5)
Is there any way for me to scrape exactly what's loaded on my browser?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
问题似乎是因为您只是在下一行中获取第1页:
但是如您所见,在URL中有一个Query参数,称为
page
,它决定了您是哪个HTML页面提取。因此,您要做的就是每次循环到新页面时,都必须通过更改page
查询参数来获取新的HTML内容。例如,在您的循环中,这将是这样的事情:获取新的HTML结构后,您将能够根据需要访问不同页面中存在的新元素。
The issue it seems to be because you're just fetching the page 1 as shown in the next line:
But as you can see there's a query parameter called
page
in the url that determines which html's page you are fetching. So what you'll have to do is every time you're looping to a new page you'll have to fetch the new html content with the driver by changing thepage
query parameter. For example in your loop it will be something like this:And after you fetch the new html structure you'll be able to access to the new elements that are present in the differente pages as you require.