用美丽的汤来覆盖动态内容

发布于 2025-02-03 09:48:16 字数 1510 浏览 2 评论 0原文

为了培训我的Python技能,我试图从“ Arbeitsagentur”的Webpresence中删除特定给定工作的开放工作数量noreferrer“> https://www.arbeitsagentur.de/jobsuche/ )。我使用Firefox浏览器的Web-Developter检查工具将包含信息的项目中提取文本,例如“ 12.231 JobsfürInformatiker/in”。我的代码:

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.support.ui import WebDriverWait

content = "https://www.arbeitsagentur.de/jobsuche/suche?angebotsart=1&was=Informatiker%2Fin"
options = Options()
options.add_argument('--headless')
driver = webdriver.Firefox(executable_path="C:/Drivers/geckodriver/geckodriver.exe", options=options)
driver.get(content)
soup = BeautifulSoup(driver.page_source, 'html.parser')
num_jobs = soup.select_one('div[class="h1-zeile-suche-speichern-container-content container-fluid"] h2')
print(num_jobs)
driver.close()

结果,我获得了正确的行的提取,但不包括查询的信息。用英语翻译,我得到了此输出:

<h2 _ngcontent-serverapp-c39="" class="h6" id="suchergebnis-h1-anzeige">Jobs for Informatiker/in are loaded</h2>

在Firefox的Web对检查员中,我看到了:

<h2 id="suchergebnis-h1-anzeige" class="h6" _ngcontent-serverapp-c39="">
12.231 Jobs für Informatiker/in</h2>

我尝试了WebDriverWait方法和驱动程序.implicitly_wait()等待,直到完全加载了网页,但没有成功。 可能是由JS-Script(?)计算和插入的。由于我不是网络开发人员,因此我不知道这是如何工作的以及如何正确使用给定数量的作业提取行。我试图使用Firefox开发人员工具的调试器查看值 /如何计算值。但是大多数脚本只是非常神秘的单线。

(通过正则表达式从字符串 /文本线中提取数字 /值根本没有问题)。

我真的很感谢您的支持或任何有用的提示。

To train my python skills I tried to scrap the number of open jobs for a specific given job from the webpresence of the "Arbeitsagentur" (https://www.arbeitsagentur.de/jobsuche/). I used the web-developer inspection tool of the firefox browser to extract the text out of the item containing the information, e.g. "12.231 Jobs für Informatiker/in". My code:

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.support.ui import WebDriverWait

content = "https://www.arbeitsagentur.de/jobsuche/suche?angebotsart=1&was=Informatiker%2Fin"
options = Options()
options.add_argument('--headless')
driver = webdriver.Firefox(executable_path="C:/Drivers/geckodriver/geckodriver.exe", options=options)
driver.get(content)
soup = BeautifulSoup(driver.page_source, 'html.parser')
num_jobs = soup.select_one('div[class="h1-zeile-suche-speichern-container-content container-fluid"] h2')
print(num_jobs)
driver.close()

As result I get the extraction of the correct line but it does not include the information queried. Translated in english I get this output:

<h2 _ngcontent-serverapp-c39="" class="h6" id="suchergebnis-h1-anzeige">Jobs for Informatiker/in are loaded</h2>

In the web-inspector of firefox I see instead:

<h2 id="suchergebnis-h1-anzeige" class="h6" _ngcontent-serverapp-c39="">
12.231 Jobs für Informatiker/in</h2>

I tried the WebDriverWait method and driver.implicitly_wait() to wait until the webpage is loaded completely but without success.
Probably this value is calculated and inserted by a js-script(?). As I am not a web developer I don't know how this works and how to extract the line with the given number of jobs correctly. I tried to use the debugger of the firefox developer tools to see where / how the value is calculated. But most scripts are only very cryptic one-liners.

(Extracting the number/value out of the string / text line by means of a regular expression will be no problem at all).

I really would appreciate your support or any useful hint.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

嗫嚅 2025-02-10 09:48:16

由于内容是动态加载的,因此您只有在可见某个元素可见后才可以解析<代码>作业结果,在这种情况下,所有元素都将被加载,您可以成功地解析所需的数据。

您还可以增加加载所有数据的睡眠时间,但这是一个不好的解决方案。

工作代码 -

import time

from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from webdriver_manager.chrome import ChromeDriverManager

options = webdriver.ChromeOptions()

# options.add_argument("--headless")
options.add_argument("--no-sandbox")
options.add_argument("--disable-gpu")
options.add_argument("--window-size=1920x1080")
options.add_argument("--disable-extensions")

chrome_driver = webdriver.Chrome(
    service=Service(ChromeDriverManager().install()),
    options=options
)


def arbeitsagentur_scraper():
    URL = "https://www.arbeitsagentur.de/jobsuche/suche?angebotsart=1&was=Informatiker%2Fin"
    with chrome_driver as driver:
        driver.implicitly_wait(15)
        driver.get(URL)
        wait = WebDriverWait(driver, 10)
        
        # time.sleep(10) # increase the load time to fetch all element, not advised solution
       
        # wait until this element is visible 
        wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, '.liste-container')))
        
        elem = driver.find_element(By.XPATH,
                                   '/html/body/jb-root/main/jb-jobsuche/jb-jobsuche-suche/div[1]/div/jb-h1zeile/h2')
        print(elem.text)


arbeitsagentur_scraper()

输出 -

12.165 Jobs für Informatiker/in

Since the contents are dynamically loaded, you can parse the number of job result only after a certain element is visible, in that case, all elements will be loaded and you can successfully parse your desired data.

You can also increase the sleep time to load all data but that's a bad solution.

Working code -

import time

from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from webdriver_manager.chrome import ChromeDriverManager

options = webdriver.ChromeOptions()

# options.add_argument("--headless")
options.add_argument("--no-sandbox")
options.add_argument("--disable-gpu")
options.add_argument("--window-size=1920x1080")
options.add_argument("--disable-extensions")

chrome_driver = webdriver.Chrome(
    service=Service(ChromeDriverManager().install()),
    options=options
)


def arbeitsagentur_scraper():
    URL = "https://www.arbeitsagentur.de/jobsuche/suche?angebotsart=1&was=Informatiker%2Fin"
    with chrome_driver as driver:
        driver.implicitly_wait(15)
        driver.get(URL)
        wait = WebDriverWait(driver, 10)
        
        # time.sleep(10) # increase the load time to fetch all element, not advised solution
       
        # wait until this element is visible 
        wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, '.liste-container')))
        
        elem = driver.find_element(By.XPATH,
                                   '/html/body/jb-root/main/jb-jobsuche/jb-jobsuche-suche/div[1]/div/jb-h1zeile/h2')
        print(elem.text)


arbeitsagentur_scraper()

Output -

12.165 Jobs für Informatiker/in
2025-02-10 09:48:16

另外,您可以使用其API URL加载结果。例如:

import json
import requests


api_url = "https://rest.arbeitsagentur.de/jobboerse/jobsuche-service/pc/v4/jobs"

query = {
    "angebotsart": "1",
    "was": "Informatiker/in",
    "page": "1",
    "size": "25",
    "pav": "false",
}

headers = {
    "OAuthAccessToken": "eyJhbGciOiJIUzUxMiJ9.eyAic3ViIjogIklkNFZSNmJoZFpKSjgwQ2VsbHk4MHI4YWpkMD0iLCAiaXNzIjogIk9BRyIsICJpYXQiOiAxNjU0MDM2ODQ1LCAiZXhwIjogMS42NTQwNDA0NDVFOSwgImF1ZCI6IFsgIk9BRyIgXSwgIm9hdXRoLnNjb3BlcyI6ICJhcG9rX21ldGFzdWdnZXN0LCBqb2Jib2Vyc2Vfc3VnZ2VzdC1zZXJ2aWNlLCBhYXMsIGpvYmJvZXJzZV9rYXRhbG9nZS1zZXJ2aWNlLCBqb2Jib2Vyc2Vfam9ic3VjaGUtc2VydmljZSwgaGVhZGVyZm9vdGVyX2hmLCBhcG9rX2hmLCBqb2Jib2Vyc2VfcHJvZmlsLXNlcnZpY2UiLCAib2F1dGguY2xpZW50X2lkIjogImRjZGVhY2JkLTJiNjItNDI2MS1hMWZhLWQ3MjAyYjU3OTg0OCIgfQ.BBkJbJ93fGqQQQGX4-VTzX8P6Twg8Rthq8meXV2WY_CoUmXQWhdgbjkFozP2BJXooSr7yLaTJr7JXEk8hDnCWA",
}

data = requests.get(api_url, params=query, headers=headers).json()

# uncomment to print all data:
# print(json.dumps(data, indent=4))

print(data["maxErgebnisse"])

打印:

12165

Alternatively, you can use their API URL to load the results. For example:

import json
import requests


api_url = "https://rest.arbeitsagentur.de/jobboerse/jobsuche-service/pc/v4/jobs"

query = {
    "angebotsart": "1",
    "was": "Informatiker/in",
    "page": "1",
    "size": "25",
    "pav": "false",
}

headers = {
    "OAuthAccessToken": "eyJhbGciOiJIUzUxMiJ9.eyAic3ViIjogIklkNFZSNmJoZFpKSjgwQ2VsbHk4MHI4YWpkMD0iLCAiaXNzIjogIk9BRyIsICJpYXQiOiAxNjU0MDM2ODQ1LCAiZXhwIjogMS42NTQwNDA0NDVFOSwgImF1ZCI6IFsgIk9BRyIgXSwgIm9hdXRoLnNjb3BlcyI6ICJhcG9rX21ldGFzdWdnZXN0LCBqb2Jib2Vyc2Vfc3VnZ2VzdC1zZXJ2aWNlLCBhYXMsIGpvYmJvZXJzZV9rYXRhbG9nZS1zZXJ2aWNlLCBqb2Jib2Vyc2Vfam9ic3VjaGUtc2VydmljZSwgaGVhZGVyZm9vdGVyX2hmLCBhcG9rX2hmLCBqb2Jib2Vyc2VfcHJvZmlsLXNlcnZpY2UiLCAib2F1dGguY2xpZW50X2lkIjogImRjZGVhY2JkLTJiNjItNDI2MS1hMWZhLWQ3MjAyYjU3OTg0OCIgfQ.BBkJbJ93fGqQQQGX4-VTzX8P6Twg8Rthq8meXV2WY_CoUmXQWhdgbjkFozP2BJXooSr7yLaTJr7JXEk8hDnCWA",
}

data = requests.get(api_url, params=query, headers=headers).json()

# uncomment to print all data:
# print(json.dumps(data, indent=4))

print(data["maxErgebnisse"])

Prints:

12165
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文