使用BS4从DIV和SRCSET中提取图像链接

发布于 2025-01-27 00:01:14 字数 1961 浏览 2 评论 0原文

HTML中的示例DIV标签:

[<div class="event-info-and-content">
<picture content="https://img.example.image.link.here/954839">
<source sizes="720px" srcset="
                                https://img.example.image.link.here/954839 480w,
                                https://img.example.image.link.here/954839 600w,
                                https://img.example.image.link.here/954839 800w,
                                https://img.example.image.link.here/954839 1080w
                            ">
<img alt="" class="event-info-and-content" data-automation="event-hero-image"/>
</source></picture>
</div>]

所需结果(SRCSET):

https://img.example.image.link.here/954839

我的功能:

def extract_img_link(html):
            with open(html, 'rb') as file:
                content = BeautifulSoup(file)
                for image in content.findAll('div', attrs={'class':'event-info-and-content'}):
                    print(image.get("srcset"))
                    return(image)
    
    #calling out the html and function  
    html = 'data/website/events.html'
    print(extract_img_link(html))

我的功能只返回我正在寻找的整个标签,而不是内部的特定链接:

 [<div class="event-info-and-content">
    <picture content="https://img.example.image.link.here/954839">
    <source sizes="720px" srcset="
                                    https://img.example.image.link.here/954839 480w,
                                    https://img.example.image.link.here/954839 600w,
                                    https://img.example.image.link.here/954839 800w,
                                    https://img.example.image.link.here/954839 1080w
                                ">
    <img alt="" class="event-info-and-content" data-automation="event-hero-image"/>
    </source></picture>
    </div>]

Example div tag within html:

[<div class="event-info-and-content">
<picture content="https://img.example.image.link.here/954839">
<source sizes="720px" srcset="
                                https://img.example.image.link.here/954839 480w,
                                https://img.example.image.link.here/954839 600w,
                                https://img.example.image.link.here/954839 800w,
                                https://img.example.image.link.here/954839 1080w
                            ">
<img alt="" class="event-info-and-content" data-automation="event-hero-image"/>
</source></picture>
</div>]

Desired outcome (srcset):

https://img.example.image.link.here/954839

My function:

def extract_img_link(html):
            with open(html, 'rb') as file:
                content = BeautifulSoup(file)
                for image in content.findAll('div', attrs={'class':'event-info-and-content'}):
                    print(image.get("srcset"))
                    return(image)
    
    #calling out the html and function  
    html = 'data/website/events.html'
    print(extract_img_link(html))

My function simply returns the entire tag i was looking for, rather than the specific link within:

 [<div class="event-info-and-content">
    <picture content="https://img.example.image.link.here/954839">
    <source sizes="720px" srcset="
                                    https://img.example.image.link.here/954839 480w,
                                    https://img.example.image.link.here/954839 600w,
                                    https://img.example.image.link.here/954839 800w,
                                    https://img.example.image.link.here/954839 1080w
                                ">
    <img alt="" class="event-info-and-content" data-automation="event-hero-image"/>
    </source></picture>
    </div>]

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

停滞 2025-02-03 00:01:14

您忘记了内部的额外图层,即图片内部div s

后面为我工作。

from bs4 import BeautifulSoup  

def extract_img_link(html):
    with open(html, 'rb') as file:
        content = BeautifulSoup(file, "html.parser")
        for image in content.find_all('div', attrs={'class':'event-info-and-content'}):
            for picture in image.find_all('picture'):
                print(picture["content"])
    
#calling out the html and function  
html = 'data/website/events.html'
extract_img_link(html)

You forgot about an extra layer inside, namely picture inside div

Following worked for me.

from bs4 import BeautifulSoup  

def extract_img_link(html):
    with open(html, 'rb') as file:
        content = BeautifulSoup(file, "html.parser")
        for image in content.find_all('div', attrs={'class':'event-info-and-content'}):
            for picture in image.find_all('picture'):
                print(picture["content"])
    
#calling out the html and function  
html = 'data/website/events.html'
extract_img_link(html)

疏忽 2025-02-03 00:01:14

要获取图像路径,请更改您的选择,并使用&lt; picture&gt;的单个选择:

for e in soup.select('div.event-info-and-content picture'):
    print(e.get('content'))

&lt; source&gt;

for e in soup.select('div.event-info-and-content source'):
    print(e.get('srcset').split()[0])

示例

from bs4 import BeautifulSoup

html = '''
<div class="event-info-and-content">
<picture content="https://img.example.image.link.here/954839">
<source sizes="720px" srcset="
                                https://img.example.image.link.here/954839 480w,
                                https://img.example.image.link.here/954839 600w,
                                https://img.example.image.link.here/954839 800w,
                                https://img.example.image.link.here/954839 1080w
                            ">
<img alt="" class="event-info-and-content" data-automation="event-hero-image"/>
</source></picture>
</div>
'''

soup = BeautifulSoup(html)

for e in soup.select('div.event-info-and-content picture'):
    print(e.get('content'))
输出
https://img.example.image.link.here/954839

To get the image path change your selection and use the single one from the <picture>:

for e in soup.select('div.event-info-and-content picture'):
    print(e.get('content'))

or the <source>:

for e in soup.select('div.event-info-and-content source'):
    print(e.get('srcset').split()[0])

Example

from bs4 import BeautifulSoup

html = '''
<div class="event-info-and-content">
<picture content="https://img.example.image.link.here/954839">
<source sizes="720px" srcset="
                                https://img.example.image.link.here/954839 480w,
                                https://img.example.image.link.here/954839 600w,
                                https://img.example.image.link.here/954839 800w,
                                https://img.example.image.link.here/954839 1080w
                            ">
<img alt="" class="event-info-and-content" data-automation="event-hero-image"/>
</source></picture>
</div>
'''

soup = BeautifulSoup(html)

for e in soup.select('div.event-info-and-content picture'):
    print(e.get('content'))
Output
https://img.example.image.link.here/954839
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文