如何刮擦动态性的页面？

发布于 2025-02-11 07:33:40 字数 852 浏览 2 评论 0原文

所以这是我的问题。我编写了一个程序，该程序完全能够在我加载的第一页上获取我想要的所有信息。但是，当我单击nextpage按钮时，它会运行一个脚本，该脚本将加载下一个产品的脚本而无需实际移至另一个页面。

因此，当我运行下一个循环时，所有发生的事情就是我得到了第一个循环的内容，即使我仿真的浏览器上的内容也不同。

这是我运行的代码：

from selenium import webdriver 
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import time

driver.get("https://www.my-website.com/search/results-34y1i") 
soup = BeautifulSoup(driver.page_source, 'html.parser')  
time.sleep(2)

#     ///////////       code to find total number of pages
currentPage = 0
button_NextPage = driver.find_element(By.ID, 'nextButton')

while currentPage != totalPages:
#    /////////       code to find the products
    currentPage += 1
    button_NextPage = driver.find_element(By.ID, 'nextButton')
    button_NextPage.click()
    time.sleep(5)

我有什么办法可以刮擦浏览器上的内容吗？

原文

So here's my problem. I wrote a program that is perfectly able to get all of the information I want on the first page that I load. But when I click on the nextPage button it runs a script that loads the next bunch of products without actually moving to another page.

So when I run the next loop all that happens is that I get the same content of the first one, even when the ones on the browser I'm emulating itself is different.

This is the code I run:

from selenium import webdriver 
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import time

driver.get("https://www.my-website.com/search/results-34y1i") 
soup = BeautifulSoup(driver.page_source, 'html.parser')  
time.sleep(2)

#     ///////////       code to find total number of pages
currentPage = 0
button_NextPage = driver.find_element(By.ID, 'nextButton')

while currentPage != totalPages:
#    /////////       code to find the products
    currentPage += 1
    button_NextPage = driver.find_element(By.ID, 'nextButton')
    button_NextPage.click()
    time.sleep(5)

Is there any way for me to scrape exactly what's loaded on my browser?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

雄赳赳气昂昂 2025-02-18 07:33:40

问题似乎是因为您只是在下一行中获取第1页：

driver.get("https://www.tcgplayer.com/search/magic/commander-streets-of-new-capenna?productLineName=magic&setName=commander-streets-of-new-capenna&page=1&view=grid")

但是如您所见，在URL中有一个Query参数，称为page，它决定了您是哪个HTML页面提取。因此，您要做的就是每次循环到新页面时，都必须通过更改page查询参数来获取新的HTML内容。例如，在您的循环中，这将是这样的事情：

driver.get("https://www.tcgplayer.com/search/magic/commander-streets-of-new-capenna?productLineName=magic&setName=commander-streets-of-new-capenna&page={page}&view=grid".format(page = currentPage))

获取新的HTML结构后，您将能够根据需要访问不同页面中存在的新元素。

The issue it seems to be because you're just fetching the page 1 as shown in the next line:

driver.get("https://www.tcgplayer.com/search/magic/commander-streets-of-new-capenna?productLineName=magic&setName=commander-streets-of-new-capenna&page=1&view=grid")

But as you can see there's a query parameter called page in the url that determines which html's page you are fetching. So what you'll have to do is every time you're looping to a new page you'll have to fetch the new html content with the driver by changing the page query parameter. For example in your loop it will be something like this:

driver.get("https://www.tcgplayer.com/search/magic/commander-streets-of-new-capenna?productLineName=magic&setName=commander-streets-of-new-capenna&page={page}&view=grid".format(page = currentPage))

And after you fetch the new html structure you'll be able to access to the new elements that are present in the differente pages as you require.

回复收藏 0 原文

~没有更多了~