Python、BeautifulSoup 或 LXML - 使用 CSS 标签从 HTML 解析图像 URL

发布于 2024-10-04 04:24:22 字数 1768 浏览 0 评论 0原文

我四处寻找有关 BeautifulSoup 或 LXML 如何工作的合理解释。诚然,他们的文档很棒,但对于像我这样的 python/编程新手来说,很难破译我正在寻找的内容。

无论如何,作为我的第一个项目,我使用 Python 来解析 RSS 提要以获取帖子链接 - 我已经使用 Feedparser 完成了这一任务。我的计划是抓取每个帖子的图像。但在我的一生中,我无法弄清楚如何让 BeautifulSoup 或 LXML 来做我想做的事!我花了几个小时阅读文档并谷歌搜索但无济于事,所以我在这里。以下是大局中的一句话(我的刮擦)。

<div class="bpBoth"><a name="photo2"></a><img src="http://inapcache.boston.com/universal/site_graphics/blogs/bigpicture/shanghaifire_11_22/s02_25947507.jpg" class="bpImage" style="height:1393px;width:990px" /><br/><div onclick="this.style.display='none'" class="noimghide" style="margin-top:-1393px;height:1393px;width:990px"></div><div class="bpCaption"><div class="photoNum"><a href="#photo2">2</a></div>In this photo released by China's Xinhua news agency, spectators watch an apartment building on fire in the downtown area of Shanghai on Monday Nov. 15, 2010. (AP Photo/Xinhua) <a href="#photo2">#</a><div class="cf"></div></div></div>

因此,根据我对文档的理解,我应该能够传递以下内容:

soup.find("a", { "class" : "bpImage" })

查找具有该 css 类的所有实例。嗯,它不会返回任何东西。我确信我忽略了一些微不足道的事情,所以我非常感谢您的耐心。

非常感谢您的回复!

对于未来的谷歌用户,我将包含我的 feedparser 代码:

#! /usr/bin/python

# RSS Feed Parser for the Big Picture Blog

# Import applicable libraries

import feedparser

#Import Feed for Parsing
d = feedparser.parse("http://feeds.boston.com/boston/bigpicture/index")

# Print feed name
print d['feed']['title']

# Determine number of posts and set range maximum
posts = len(d['entries'])

# Collect Post URLs
pointer = 0
while pointer < posts:
    e = d.entries[pointer]
    print e.link
    pointer = pointer + 1

I have searched high and low for a decent explanation of how BeautifulSoup or LXML work. Granted, their documentation is great, but for someone like myself, a python/programming novice, it is difficult to decipher what I am looking for.

Anyways, as my first project, I am using Python to parse an RSS feed for post links - I have accomplished this with Feedparser. My plan is to then scrape each posts' images. For the life of me though, I can not figure out how to get either BeautifulSoup or LXML to do what I want! I have spent hours reading through the documentation and googling to no avail, so I am here. The following is a line from the Big Picture (my scrapee).

<div class="bpBoth"><a name="photo2"></a><img src="http://inapcache.boston.com/universal/site_graphics/blogs/bigpicture/shanghaifire_11_22/s02_25947507.jpg" class="bpImage" style="height:1393px;width:990px" /><br/><div onclick="this.style.display='none'" class="noimghide" style="margin-top:-1393px;height:1393px;width:990px"></div><div class="bpCaption"><div class="photoNum"><a href="#photo2">2</a></div>In this photo released by China's Xinhua news agency, spectators watch an apartment building on fire in the downtown area of Shanghai on Monday Nov. 15, 2010. (AP Photo/Xinhua) <a href="#photo2">#</a><div class="cf"></div></div></div>

So, according to my understanding of the documentation, I should be able to pass the following:

soup.find("a", { "class" : "bpImage" })

To find all instances with that css class. Well, it doesn't return anything. I'm sure I'm overlooking something trivial so I greatly appreciate your patience.

Thank you very much for your responses!

For future googlers, I'll include my feedparser code:

#! /usr/bin/python

# RSS Feed Parser for the Big Picture Blog

# Import applicable libraries

import feedparser

#Import Feed for Parsing
d = feedparser.parse("http://feeds.boston.com/boston/bigpicture/index")

# Print feed name
print d['feed']['title']

# Determine number of posts and set range maximum
posts = len(d['entries'])

# Collect Post URLs
pointer = 0
while pointer < posts:
    e = d.entries[pointer]
    print e.link
    pointer = pointer + 1

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

挽梦忆笙歌 2024-10-11 04:24:22

使用 lxml,您可能会执行以下操作:

import feedparser
import lxml.html as lh
import urllib2

#Import Feed for Parsing
d = feedparser.parse("http://feeds.boston.com/boston/bigpicture/index")

# Print feed name
print d['feed']['title']

# Determine number of posts and set range maximum
posts = len(d['entries'])

# Collect Post URLs
for post in d['entries']:
    link=post['link']
    print('Parsing {0}'.format(link))
    doc=lh.parse(urllib2.urlopen(link))
    imgs=doc.xpath('//img[@class="bpImage"]')
    for img in imgs:
        print(img.attrib['src'])

Using lxml, you might do something like this:

import feedparser
import lxml.html as lh
import urllib2

#Import Feed for Parsing
d = feedparser.parse("http://feeds.boston.com/boston/bigpicture/index")

# Print feed name
print d['feed']['title']

# Determine number of posts and set range maximum
posts = len(d['entries'])

# Collect Post URLs
for post in d['entries']:
    link=post['link']
    print('Parsing {0}'.format(link))
    doc=lh.parse(urllib2.urlopen(link))
    imgs=doc.xpath('//img[@class="bpImage"]')
    for img in imgs:
        print(img.attrib['src'])
无声情话 2024-10-11 04:24:22

您发布的代码将查找具有 bpImage 类的所有 a 元素。但是您的示例在 img 元素上有 bpImage 类,而不是 a 。你只需要做:

soup.find("img", { "class" : "bpImage" })

The code you have posted looks for all a elements with the bpImage class. But your example has the bpImage class on the img element, not the a. You just need to do:

soup.find("img", { "class" : "bpImage" })
掩耳倾听 2024-10-11 04:24:22

使用 pyparsing 搜索标签相当直观:

from pyparsing import makeHTMLTags, withAttribute

imgTag,notused = makeHTMLTags('img')

# only retrieve <img> tags with class='bpImage'
imgTag.setParseAction(withAttribute(**{'class':'bpImage'}))

for img in imgTag.searchString(html):
    print img.src

Using pyparsing to search for tags is fairly intuitive:

from pyparsing import makeHTMLTags, withAttribute

imgTag,notused = makeHTMLTags('img')

# only retrieve <img> tags with class='bpImage'
imgTag.setParseAction(withAttribute(**{'class':'bpImage'}))

for img in imgTag.searchString(html):
    print img.src
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文