当前位置：文江博客话题详情

Python beautifulsoup srcset

使用BS4从DIV和SRCSET中提取图像链接

发布于 2025-01-27 00:01:14 字数 1961 浏览 2 评论 0原文

HTML中的示例DIV标签：

[<div class="event-info-and-content">
<picture content="https://img.example.image.link.here/954839">
<source sizes="720px" srcset="
                                https://img.example.image.link.here/954839 480w,
                                https://img.example.image.link.here/954839 600w,
                                https://img.example.image.link.here/954839 800w,
                                https://img.example.image.link.here/954839 1080w
                            ">
<img alt="" class="event-info-and-content" data-automation="event-hero-image"/>
</source></picture>
</div>]

所需结果（SRCSET）：

https://img.example.image.link.here/954839

我的功能：

def extract_img_link(html):
            with open(html, 'rb') as file:
                content = BeautifulSoup(file)
                for image in content.findAll('div', attrs={'class':'event-info-and-content'}):
                    print(image.get("srcset"))
                    return(image)
    
    #calling out the html and function  
    html = 'data/website/events.html'
    print(extract_img_link(html))

我的功能只返回我正在寻找的整个标签，而不是内部的特定链接：

 [<div class="event-info-and-content">
    <picture content="https://img.example.image.link.here/954839">
    <source sizes="720px" srcset="
                                    https://img.example.image.link.here/954839 480w,
                                    https://img.example.image.link.here/954839 600w,
                                    https://img.example.image.link.here/954839 800w,
                                    https://img.example.image.link.here/954839 1080w
                                ">
    <img alt="" class="event-info-and-content" data-automation="event-hero-image"/>
    </source></picture>
    </div>]

Example div tag within html:

[<div class="event-info-and-content">
<picture content="https://img.example.image.link.here/954839">
<source sizes="720px" srcset="
                                https://img.example.image.link.here/954839 480w,
                                https://img.example.image.link.here/954839 600w,
                                https://img.example.image.link.here/954839 800w,
                                https://img.example.image.link.here/954839 1080w
                            ">
<img alt="" class="event-info-and-content" data-automation="event-hero-image"/>
</source></picture>
</div>]

Desired outcome (srcset):

https://img.example.image.link.here/954839

My function:

def extract_img_link(html):
            with open(html, 'rb') as file:
                content = BeautifulSoup(file)
                for image in content.findAll('div', attrs={'class':'event-info-and-content'}):
                    print(image.get("srcset"))
                    return(image)
    
    #calling out the html and function  
    html = 'data/website/events.html'
    print(extract_img_link(html))

My function simply returns the entire tag i was looking for, rather than the specific link within:

 [<div class="event-info-and-content">
    <picture content="https://img.example.image.link.here/954839">
    <source sizes="720px" srcset="
                                    https://img.example.image.link.here/954839 480w,
                                    https://img.example.image.link.here/954839 600w,
                                    https://img.example.image.link.here/954839 800w,
                                    https://img.example.image.link.here/954839 1080w
                                ">
    <img alt="" class="event-info-and-content" data-automation="event-hero-image"/>
    </source></picture>
    </div>]

收藏 0

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

评论（2）

停滞 2025-02-03 00:01:14

您忘记了内部的额外图层，即图片内部div s

后面为我工作。

from bs4 import BeautifulSoup  

def extract_img_link(html):
    with open(html, 'rb') as file:
        content = BeautifulSoup(file, "html.parser")
        for image in content.find_all('div', attrs={'class':'event-info-and-content'}):
            for picture in image.find_all('picture'):
                print(picture["content"])
    
#calling out the html and function  
html = 'data/website/events.html'
extract_img_link(html)

You forgot about an extra layer inside, namely picture inside div

Following worked for me.

from bs4 import BeautifulSoup  

def extract_img_link(html):
    with open(html, 'rb') as file:
        content = BeautifulSoup(file, "html.parser")
        for image in content.find_all('div', attrs={'class':'event-info-and-content'}):
            for picture in image.find_all('picture'):
                print(picture["content"])
    
#calling out the html and function  
html = 'data/website/events.html'
extract_img_link(html)

回复收藏 0 原文

疏忽 2025-02-03 00:01:14

要获取图像路径，请更改您的选择，并使用＆lt; picture＆gt;的单个选择：

for e in soup.select('div.event-info-and-content picture'):
    print(e.get('content'))

或＆lt; source＆gt;：

for e in soup.select('div.event-info-and-content source'):
    print(e.get('srcset').split()[0])

示例

from bs4 import BeautifulSoup

html = '''
<div class="event-info-and-content">
<picture content="https://img.example.image.link.here/954839">
<source sizes="720px" srcset="
                                https://img.example.image.link.here/954839 480w,
                                https://img.example.image.link.here/954839 600w,
                                https://img.example.image.link.here/954839 800w,
                                https://img.example.image.link.here/954839 1080w
                            ">
<img alt="" class="event-info-and-content" data-automation="event-hero-image"/>
</source></picture>
</div>
'''

soup = BeautifulSoup(html)

for e in soup.select('div.event-info-and-content picture'):
    print(e.get('content'))

输出

https://img.example.image.link.here/954839

To get the image path change your selection and use the single one from the <picture>:

for e in soup.select('div.event-info-and-content picture'):
    print(e.get('content'))

or the <source>:

for e in soup.select('div.event-info-and-content source'):
    print(e.get('srcset').split()[0])

Example

from bs4 import BeautifulSoup

html = '''
<div class="event-info-and-content">
<picture content="https://img.example.image.link.here/954839">
<source sizes="720px" srcset="
                                https://img.example.image.link.here/954839 480w,
                                https://img.example.image.link.here/954839 600w,
                                https://img.example.image.link.here/954839 800w,
                                https://img.example.image.link.here/954839 1080w
                            ">
<img alt="" class="event-info-and-content" data-automation="event-hero-image"/>
</source></picture>
</div>
'''

soup = BeautifulSoup(html)

for e in soup.select('div.event-info-and-content picture'):
    print(e.get('content'))

Output

https://img.example.image.link.here/954839

回复收藏 0 原文

~没有更多了~

关于作者

暂无简介

文章

评论

27 人气

关注发私信

相关话题

热门标签

操作系统程序设计 IT运维 Linux系统管理 JavaScript 服务器应用 solaris C/C++ PHP Shell BSD Vue.js aix Oracle Python HTML 系统管理 HTML5 CSS 前端

推荐作者

十二

文章 0 评论 0

飞烟轻若梦

文章 0 评论 0

OPleyuhuo

文章 0 评论 0

wxb0109

文章 0 评论 0

旧城空念

文章 0 评论 0

-小熊_

文章 0 评论 0

友情链接

我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的隐私政策了解更多相关信息。单击 接受 或继续使用网站，即表示您同意使用 Cookies 和您的相关数据。

原文