解析网页时，无法从iframe（内部HTML页面）提取/加载所有HREF

发布于 2025-01-30 18:32:22 字数 1148 浏览 3 评论 0 原文

我真的在为这种情况而苦苦挣扎，并且整天都在尝试。请我需要您的帮助。我正在尝试刮擦此网页：我想获得所有137 HREF-S（137个文档）。使用的代码L：

   def test(self):
        final_url = 'https://decisions.scc-csc.ca/scc-csc/en/d/s/index.do?cont=&ref=&d1=2012-01-01&d2=2022-01-31&p=&col=1&su=16&or='
        self.driver.get(final_url)
        soup = BeautifulSoup(self.driver.page_source, 'html.parser')
        iframes = soup.find('iframe')
        src = iframes['src']
        base = 'https://decisions.scc-csc.ca/'
        main_url = urljoin(base, src)
        self.driver.get((main_url))
        browser = self.driver
        elem = browser.find_element_by_tag_name("body")
        no_of_pagedowns = 20
        while no_of_pagedowns:
            elem.send_keys(Keys.PAGE_DOWN)
            time.sleep(0.2)
            no_of_pagedowns -= 1

问题在于它仅加载25个第一文档（HREF），并且不知道该怎么做。

原文

l am really struggling with this case and have been trying all day. Please l need your help.I am trying to scrape this webpage: https://decisions.scc-csc.ca/scc-csc/en/d/s/index.do?cont=&ref=&d1=2012-01-01&d2=2022-01-31&p=&col=1&su=16&or=
l want to get all 137 href-s (137 documents).
The code l used:

   def test(self):
        final_url = 'https://decisions.scc-csc.ca/scc-csc/en/d/s/index.do?cont=&ref=&d1=2012-01-01&d2=2022-01-31&p=&col=1&su=16&or='
        self.driver.get(final_url)
        soup = BeautifulSoup(self.driver.page_source, 'html.parser')
        iframes = soup.find('iframe')
        src = iframes['src']
        base = 'https://decisions.scc-csc.ca/'
        main_url = urljoin(base, src)
        self.driver.get((main_url))
        browser = self.driver
        elem = browser.find_element_by_tag_name("body")
        no_of_pagedowns = 20
        while no_of_pagedowns:
            elem.send_keys(Keys.PAGE_DOWN)
            time.sleep(0.2)
            no_of_pagedowns -= 1

The problem is that it loads only 25 first documents (href) and don't know how to do that.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

孤独患者 2025-02-06 18:32:22

此代码向下滚动，直到可见所有元素，然后将PDF的URL保存在列表 pdfs 中。请注意，所有工作都是用硒完成的，而无需使用美丽的套件

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome(options=options, service=Service(your_chromedriver_path))
driver.get('https://decisions.scc-csc.ca/scc-csc/en/d/s/index.do?cont=&ref=&d1=2012-01-01&d2=2022-01-31&p=&col=1&su=16&or=')

# wait for the iframe to be loaded and then switch to it
WebDriverWait(driver, 20).until(EC.frame_to_be_available_and_switch_to_it((By.ID, "decisia-iframe")))

# in this case number_of_results = 137
number_of_results = int(driver.find_element(By.XPATH, "//h2[contains(., 'result')]").text.split()[0])
pdfs = []

while len(pdfs) < number_of_results:
    pdfs = driver.find_elements(By.CSS_SELECTOR, 'a[title="Download the PDF version"]')
    # scroll down to the last visible row
    driver.execute_script('arguments[0].scrollIntoView({block: "center", behavior: "smooth"});', pdfs[-1])
    time.sleep(1)

pdfs = [pdf.get_attribute('href') for pdf in pdfs]

This code scrolls down until all elements are visible, then save the urls of the pdfs in the list pdfs. Notice that all the work is done with selenium, without using BeautifulSoup

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome(options=options, service=Service(your_chromedriver_path))
driver.get('https://decisions.scc-csc.ca/scc-csc/en/d/s/index.do?cont=&ref=&d1=2012-01-01&d2=2022-01-31&p=&col=1&su=16&or=')

# wait for the iframe to be loaded and then switch to it
WebDriverWait(driver, 20).until(EC.frame_to_be_available_and_switch_to_it((By.ID, "decisia-iframe")))

# in this case number_of_results = 137
number_of_results = int(driver.find_element(By.XPATH, "//h2[contains(., 'result')]").text.split()[0])
pdfs = []

while len(pdfs) < number_of_results:
    pdfs = driver.find_elements(By.CSS_SELECTOR, 'a[title="Download the PDF version"]')
    # scroll down to the last visible row
    driver.execute_script('arguments[0].scrollIntoView({block: "center", behavior: "smooth"});', pdfs[-1])
    time.sleep(1)

pdfs = [pdf.get_attribute('href') for pdf in pdfs]

回复收藏 0 原文

~没有更多了~