Python Selenium - 如何使用 Selenium 和 Python 从 src 属性中抓取 URL

发布于 2025-01-19 09:24:47 字数 1577 浏览 0 评论 0 原文

我正在尝试下载一堆图像并使用 Selenium 将它们分类到文件夹中。为此,我需要获取与 URL 中的每个图像关联的两个 ID。但是,我在从 src 属性中抓取图像链接时遇到问题。无论我尝试通过标签、Xpath 还是其他方法抓取,最终结果都只是“无”。

以下是已检查图像页面的示例:

<html style="height: 100%;"
    ><head><meta name="viewport" content="width=device-width, minimum-scale=0.1"> 
        <title>index.php (2448×3264)</title>
       </head>
    <body style="margin: 0px; background: #0e0e0e; height: 100%">
        <img style="-webkit-user-select: none;margin: auto;cursor: zoom-in;background-color: hsl(0, 0%, 90%);transition: background-color 300ms;" src="https://haalsi.net/haalsi_pride2/custom/picture/index.php?id=LQCMY&amp;fieldname=DT006_picture&amp;p=show" width="444" height="593">
   </body>
 </html>

对于此示例,我需要从上面的 URL 中获取“LQCMY”和“DT006_picture”作为字符串。下面的代码显示了我尝试抓取 URL 链接的尝试(由于我之前点击的屏幕被锁定在我无法给出的密码后面,因此进行了编辑)。

from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

Image = '/html/body/div[1]/div[2]/div/table/tbody/tr[1]/td[1]/a'
driver.find_element_by_xpath(Image).click()
Image_URL = WebDriverWait(driver, 100).until(EC.element_to_be_clickable((By.XPATH, Image))).get_attribute('src')
print(Image_URL)

是否存在某些无法抓取的 src,或者我抓取了错误的标签?

我尝试过按标签抓取,但也返回“无”。

Image_URL = driver.find_element_by_xpath(Image).get_attribute('src')

其他帖子说 WebDriverWait 会有所帮助,但我尝试调整等待时间,但仍然收到“无”

I'm trying to download a bunch of images and categorize them into folders using Selenium. To do so, I need to grab two ID's associated with each image within the URL. However I'm having trouble scraping the image link from the src attribute. Whether I try to grab by tag, Xpath, or other method the end result is merely "None".

Here's an example of an inspected image page:

<html style="height: 100%;"
    ><head><meta name="viewport" content="width=device-width, minimum-scale=0.1"> 
        <title>index.php (2448×3264)</title>
       </head>
    <body style="margin: 0px; background: #0e0e0e; height: 100%">
        <img style="-webkit-user-select: none;margin: auto;cursor: zoom-in;background-color: hsl(0, 0%, 90%);transition: background-color 300ms;" src="https://haalsi.net/haalsi_pride2/custom/picture/index.php?id=LQCMY&fieldname=DT006_picture&p=show" width="444" height="593">
   </body>
 </html>

For this example, I would need to grab "LQCMY" and "DT006_picture" as strings from the URL above. The code below shows my attempt at scraping the URL link (edited down since prior screens I click through are locked behind passwords that I can't give out).

from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

Image = '/html/body/div[1]/div[2]/div/table/tbody/tr[1]/td[1]/a'
driver.find_element_by_xpath(Image).click()
Image_URL = WebDriverWait(driver, 100).until(EC.element_to_be_clickable((By.XPATH, Image))).get_attribute('src')
print(Image_URL)

Are there certain src's that can't be scraped, or am I scraping the wrong tag?

I've tried grabbing by tag but that also returns "None" as well.

Image_URL = driver.find_element_by_xpath(Image).get_attribute('src')

Other posts said WebDriverWait would help, but I've tried adjusting the wait time and am still receiving "None" too

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

爱你是孤单的心事 2025-01-26 09:24:47

要打印src属性的值,您可以使用以下任一定位策略

  • 使用css_selector

    print(driver.find_element_by_css_selector("body img[style*='webkit-user-select'][src^='https://haalsi.net/haalsi_pride2/custom/picture/index.php? id=']").get_attribute("src"))
    
  • 使用xpath

    print(driver.find_element_by_xpath("//body//img[contains(@style, 'webkit-user-select') andstarts-with(@src, 'https://haalsi.net/) haalsi_pride2/custom/picture/index.php?id=')]").get_attribute("src"))
    

理想情况下,您需要为 WebDriverWait /stackoverflow.com/a/50474905/7429447">visibility_of_element_ located() 并且您可以使用以下任一方法定位器策略

  • 使用CSS_SELECTOR

    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_ located((By.CSS_SELECTOR, "body img[style*='webkit-user-select'][src^='https:// haalsi.net/haalsi_pride2/custom/picture/index.php?id=']"))).get_attribute("src"))
    
  • 使用XPATH

    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_ located((By.XPATH, "//body//img[contains(@style, 'webkit-user-select') ) 并开始-与(@src, 'https://haalsi.net/haalsi_pride2/custom/picture/index.php?id=')]"))).get_attribute("src"))
    
  • 注意:您必须添加以下导入:

    从 selenium.webdriver.support.ui 导入 WebDriverWait
    从 selenium.webdriver.common.by 导入
    从 selenium.webdriver.support 导入预期条件作为 EC
    

您可以在Python Selenium - 获取href值中找到相关讨论

To print the value of the src attribute you can use either of the following locator strategies:

  • Using css_selector:

    print(driver.find_element_by_css_selector("body img[style*='webkit-user-select'][src^='https://haalsi.net/haalsi_pride2/custom/picture/index.php?id=']").get_attribute("src"))
    
  • Using xpath:

    print(driver.find_element_by_xpath("//body//img[contains(@style, 'webkit-user-select') and starts-with(@src, 'https://haalsi.net/haalsi_pride2/custom/picture/index.php?id=')]").get_attribute("src"))
    

Ideally you need to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following locator strategies:

  • Using CSS_SELECTOR:

    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "body img[style*='webkit-user-select'][src^='https://haalsi.net/haalsi_pride2/custom/picture/index.php?id=']"))).get_attribute("src"))
    
  • Using XPATH:

    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//body//img[contains(@style, 'webkit-user-select') and starts-with(@src, 'https://haalsi.net/haalsi_pride2/custom/picture/index.php?id=')]"))).get_attribute("src"))
    
  • Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    

You can find a relevant discussion in Python Selenium - get href value

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文