前言
基础篇
第1章回顾 Python 编程
- 1.1 安装 Python
- 1.2 搭建开发环境
- 1.3 IO 编程
- 1.4 进程和线程
- 1.5 网络编程
- 1.6 小结
第2章 Web前端基础
- 2.1 W3C 标准
- 2.2 HTTP 标准
- 2.3 小结
第3章初识网络爬虫
- 3.1 网络爬虫概述
- 3.2 HTTP 请求的 Python 实现
- 3.3 小结
第4章 HTML 解析大法
- 4.1 初识 Firebug
- 4.2 正则表达式
- 4.3 强大的 BeautifulSoup
- 4.4 小结
第5章数据存储（无数据库版）
- 5.1 HTML 正文抽取
- 5.2 多媒体文件抽取
- 5.3 Email 提醒
- 5.4 小结
第6章实战项目：基础爬虫
- 6.1 基础爬虫架构及运行流程
- 6.2 URL 管理器
- 6.3 HTML 下载器
- 6.4 HTML 解析器
- 6.5 数据存储器
- 6.6 爬虫调度器
- 6.7 小结
第7章实战项目：简单分布式爬虫
- 7.1 简单分布式爬虫结构
- 7.2 控制节点
- 7.3 爬虫节点
- 7.4 小结
中级篇
第8章数据存储（数据库版）
- 8.1 SQLite
- 8.2 MySQL
- 8.3 更适合爬虫的 MongoDB
- 8.4 小结
第9章动态网站抓取
- 9.1 Ajax 和动态 HTML
- 9.2 动态爬虫1：爬取影评信息
- 9.3 PhantomJS
- 9.4 Selenium
- 9.5 动态爬虫2：爬取去哪网
- 9.6 小结
第10章 Web 端协议分析
- 10.1 网页登录 POST 分析
- 10.2 验证码问题
- 10.3 PC 站点和手机站点
- 10.4 小结
第11章终端协议分析
- 11.1 PC 客户端抓包分析
- 11.2 App 抓包分析
- 11.3 API 爬虫：爬取 MP3 资源信息
- 11.4 小结
第12章初窥 Scrapy 爬虫框架
- 12.1 Scrapy 爬虫架构
- 12.2 安装 Scrapy
- 12.3 创建 cnblogs 项目
- 12.4 创建爬虫模块
- 12.5 选择器
- 12.6 命令行工具
- 12.7 定义 Item
- 12.8 翻页功能
- 12.9 构建 Item Pipeline
- 12.10 内置数据存储
- 12.11 内置图片和文件下载方式
- 12.12 启动爬虫
- 12.13 强化爬虫
- 12.14 小结
第13章深入 Scrapy 爬虫框架
- 13.1 再看 Spider
- 13.2 Item Loader
- 13.3 再看 Item Pipeline
- 13.4 请求与响应
- 13.5 下载器中间件
- 13.6 Spider 中间件
- 13.7 扩展
- 13.8 突破反爬虫
- 13.9 小结
第14章实战项目：Scrapy 爬虫
- 14.1 创建知乎爬虫
- 14.2 定义 Item
- 14.3 创建爬虫模块
- 14.4 Pipeline
- 14.5 优化措施
- 14.6 部署爬虫
- 14.7 小结
深入篇
第15章增量式爬虫
- 15.1 去重方案
- 15.2 BloomFilter 算法
- 15.3 Scrapy 和 BloomFilter
- 15.4 小结
第16章分布式爬虫与 Scrapy
- 16.1 Redis 基础
- 16.2 Python 和 Redis
- 16.3 MongoDB 集群
- 16.4 小结
第17章实战项目：Scrapy 分布式爬虫
- 17.1 创建云起书院爬虫
- 17.2 定义 Item
- 17.3 编写爬虫模块
- 17.4 Pipeline
- 17.5 应对反爬虫机制
- 17.6 去重优化
- 17.7 小结
第18章人性化 PySpider 爬虫框架
- 18.1 PySpider 与 Scrapy
- 18.2 安装 PySpider
- 18.3 创建豆瓣爬虫
- 18.4 选择器
- 18.5 Ajax 和 HTTP 请求
- 18.6 PySpider 和 PhantomJS
- 18.7 数据存储
- 18.8 PySpider 爬虫架构
- 18.9 小结

文章来源于网络收集而来，版权归原创者所有，如有侵权请及时联系！

4.3 强大的 BeautifulSoup

发布于 2024-01-26 22:39:51 字数 24303 浏览 0 评论 0 收藏 0

Beautiful Soup是一个可以从HTML或XML文件中提取数据的Python库。它能够通过你喜欢的转换器实现惯用的文档导航、查找、修改文档的方式。在Python爬虫开发中，我们主要用到的是Beautiful Soup的查找提取功能，修改文档的方式很少用到。接下来由浅及深介绍Beautiful Soup在Python爬虫开发中的使用。

4.3.1　安装BeautifulSoup

对于Beautiful Soup，我们推荐使用的是Beautiful Soup 4，已经移植到BS4中，Beautiful Soup 3已经停止开发。安装Beautiful Soup 4有三种方式：

·如果你用的是新版的Debain或ubuntu，那么可以通过系统的软件包管理来安装：apt-get install Python-bs4。

·Beautiful Soup 4通过PyPi发布，可以通过easy_install或pip来安装。包的名字是beautifulsoup4，这个包兼容Python2和Python3。安装命令：easy_installbeautifulsoup4或者pipinstallbeautifulsoup4。

·也可以通过下载源码的方式进行安装，当前最新的版本是4.5.1，源码下载地址为https://pypi.python.org/pypi/beautifulsoup4/ 。运行下面的命令即可完成安装：python setup.py install。

Beautiful Soup支持Python标准库中的HTML解析器，还支持一些第三方的解析器，其中一个是lxml。由于lxml解析速度比标准库中的HTML解析器的速度快得多，我们选择安装lxml作为新的解析器。根据操作系统不同，可以选择下列方法来安装lxml：

·apt-get install Python-lxml

·easy_install lxml

·pip install lxml

另一个可供选择的解析器是纯Python实现的html5lib，html5lib的解析方式与浏览器相同，可以选择下列方法来安装html5lib：

·apt-get install Python-html5lib

·easy_install html5lib

·pip install html5lib

表4-9列出了主要的解析器，以及它们的优缺点。

表4-9　解析器比较

从表4-9中可以看出推荐使用lxml作为解析器的原因，因为它效率更高。

4.3.2　BeautifulSoup的使用

安装完BeautifulSoup，接下来开始讲解BeautifulSoup的使用。

1.快速开始

首先导入bs4库：from bs4import BeautifulSoup。接着创建包含HTML代码的字符串，用来进行解析。字符串如下：

  html_str = """
  <html><head><title>The Dormouse's story</title></head>
  <body>
  <p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
  <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
  <a href="http://example.com/lacie" class="sister" id="link2"><!-- Lacie --></a> and
  <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
  and they lived at the bottom of a well.</p>
  <p class="story">...</p>
  """

接下来的数据解析和提取都是以这个字符串为例子。

然后创建BeautifulSoup对象，创建BeautifulSoup对象有两种方式。一种直接通过字符串创建：

  soup = BeautifulSoup(html_str,'lxml', from_encoding='utf-8')

另一种通过文件来创建，假如将html_str字符串保存为index.html文件，创建方式如下：

  soup = BeautifulSoup(open('index.html'))

文档被转换成Unicode，并且HTML的实例都被转换成Unicode编码。打印soup对象的内容，格式化输出：

  print soup.prettify()

输入结果如下：

  <html>
     <head>
       <title>
            The Dormouse's story
       </title>
     </head>
     <body>
       <p class="title">
            <b>
            The Dormouse's story
            </b>
       </p>
       <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
              <!--Elsie -->
            </a>
            ,
            <a class="sister" href="http://example.com/lacie" id="link2">
              <!--Lacie-->
            </a>
       and
       <a class="sister" href="http://example.com/tillie" id="link3">
            Tillie
       </a>
       ;
     and they lived at the bottom of a well.
       </p>
       <p class="story">
            ...
       </p>
     </body>
  </html>

Beautiful Soup选择最合适的解析器来解析这段文档，如果手动指定解析器那么Beautiful Soup会选择指定的解析器来解析文档，使用方法如表4-9所示。

2.对象种类

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构，每个节点都是Python对象，所有对象可以归纳为4种：

·Tag

·NavigableString

·BeautifulSoup

·Comment

1）Tag

首先说一下Tag对象，Tag对象与XML或HTML原生文档中的Tag相同，通俗点说就是标记。比如<title>The Dormouse‘s story</title>或者<a href=“http://example.com/elsie”class=“sister”id=“link1”>Elsie</a>，title和a标记及其里面的内容称为Tag对象。怎样从html_str中抽取Tag呢？示例如下：

·抽取title：print soup.title

·抽取a：print soup.a

·抽取p：print soup.a

从例子中可以看到利用soup加标记名就可以获取这些标记的内容，比之前讲的正则表达式简单多了。不过利用这种方式，查找的是在所有内容中第一个符合要求的标记，如果要查询所有的标记，后面的内容进行讲解。

Tag中有两个最重要的属性：name和attributes。每个Tag都有自己的名字，通过.name来获取。示例如下：

  print soup.name
  print soup.title.name

输出结果：

  [document]
  title

soup对象本身比较特殊，它的name为[document]，对于其他内部标记，输出的值便为标记本身的名称。

Tag不仅可以获取name，还可以修改name，改变之后将影响所有通过当前Beautiful Soup对象生成的HTML文档。示例如下：

  soup.title.name = 'mytitle'
  print soup.title
  print soup.mytitle

输出结果：

  None
  <mytitle>The Dormouse's story</mytitle>

这里已经将title标记成功修改为mytitle。

再说一下Tag中的属性，The Dormouse’s story有一个“class”属性，值为“title”。Tag的属性的操作方法与字典相同：

  print soup.p['class']
  print soup.p.get('class')

输出结果：

  ['title']
  ['title']

也可以直接“点”取属性，比如：.attrs，用于获取Tag中所有属性：

  print soup.p.attrs

输出结果：

  {'class': ['title']}

和name一样，我们可以对标记中的这些属性和内容等进行修改，示例如下：

  soup.p['class']="myClass"
  print soup.p

输出结果：

  <p class="myClass"><b>The Dormouse's story</b></p>

2）NavigableString

我们已经得到了标记的内容，要想获取标记内部的文字怎么办呢？需要用到.string。

示例如下：

  print soup.p.string
  print type(soup.p.string)

输出结果：

  The Dormouse's story
  <class 'bs4.element.NavigableString'>

Beautiful Soup用NavigableString类来包装Tag中的字符串，一个NavigableString字符串与Python中的Unicode字符串相同，通过unicode（）方法可以直接将NavigableString对象转换成Unicode字符串：

  unicode_string = unicode(soup.p.string)

3）BeautifulSoup

BeautifulSoup对象表示的是一个文档的全部内容。大部分时候，可以把它当作Tag对象，是一个特殊的Tag，因为BeautifulSoup对象并不是真正的HTML或XML的标记，所以它没有name和attribute属性。但为了将BeautifulSoup对象标准化为Tag对象，实现接口的统一，我们依然可以分别获取它的name和attribute属性。示例如下：

  print type(soup.name)
  print soup.name
  print soup.attrs

输出结果：

  <type 'unicode'>
  [document]
  {}

4）Comment

Tag、NavigableString、BeautifulSoup几乎覆盖了HTML和XML中的所有内容，但是还有一些特殊对象。容易让人担心的内容是文档的注释部分：

  print soup.a.string
  print type(soup.a.string)

输出结果：

  Elsie
  <class 'bs4.element.Comment'>

a标记里的内容实际上是注释，但是如果我们利用.string来输出它的内容，会发现它已经把注释符号去掉了。另外如果打印输出它的类型，会发现它是一个Comment类型。如果在我们不清楚这个标记.string的情况下，可能造成数据提取混乱。因此在提取字符串时，可以判断一下类型：

  if type(soup.a.string)==bs4.element.Comment:
     print soup.a.string

3.遍历文档树

BeautifulSoup会将HTML转化为文档树进行搜索，既然是树形结构，节点的概念必不可少。

1）子节点

首先说一下直接子节点，Tag中的.contents和.children是非常重要的。Tag的.content属性可以将Tag子节点以列表的方式输出：

  print soup.head.contents

输出结果：

  [<title>The Dormouse's story</title>]

既然输出方式是列表，我们就可以获取列表的大小，并通过列表索引获取里面的值：

  print len(soup.head.contents)
  print soup.head.contents[0].string

输出结果：

  1
  The Dormouse's story

有一点需要注意：字符串没有.contents属性，因为字符串没有子节点。

.children属性返回的是一个生成器，可以对Tag的子节点进行循环：

  for child in soup.head.children:
     print(child)

输出结果：

  <title>The Dormouse's story</title>

.contents和.children属性仅包含Tag的直接子节点。例如，<head>标记只有一个直接子节点<title>。但是<title>标记也包含一个子节点：字符串“The Dormouse’s story”，这种情况下字符串“The Dormouse’s story”也属于<head>标记的子孙节点。.descendants属性可以对所有tag的子孙节点进行递归循环：

  for child in soup.head.descendants:
     print(child)

输出结果：

  <title>The Dormouse's story</title>
  The Dormouse's story

以上都是关于如何获取子节点，接下来说一下如何获取节点的内容，这就涉及.string、.strings、stripped_strings三个属性。

.string这个属性很有特点：如果一个标记里面没有标记了，那么.string就会返回标记里面的内容。如果标记里面只有唯一的一个标记了，那么.string也会返回最里面的内容。如果tag包含了多个子节点，tag就无法确定，string方法应该调用哪个子节点的内容，.string的输出结果是None。示例如下：

  print soup.head.string
  print soup.title.string
  print soup.html.string

输出结果：

  The Dormouse's story
  The Dormouse's story
  None

.strings属性主要应用于tag中包含多个字符串的情况，可以进行循环遍历，示例如下：

  for string in soup.strings:
     print(repr(string))

输出结果：

  u"The Dormouse's story"
  u'\n'
  u'\n'
  u"The Dormouse's story"
  u'\n'
  u'Once upon a time there were three little sisters; and their names were\n'
  u',\n'
  u' and\n'
  u'Tillie'
  u';\nand they lived at the bottom of a well.'
  u'\n'
  u'...'
  u'\n'

.stripped_strings属性可以去掉输出字符串中包含的空格或空行，示例如下：

  for string in soup.stripped_strings:
     print(repr(string))

输出结果：

  u"The Dormouse's story"
  u"The Dormouse's story"
  u'Once upon a time there were three little sisters; and their names were'
  u','
  u'and'
  u'Tillie'
  u';\nand they lived at the bottom of a well.'
  u'...'

2）父节点

继续分析文档树，每个Tag或字符串都有父节点：被包含在某个Tag中。

通过.parent属性来获取某个元素的父节点。在html_str中，<head>标记是<title>标记的父节点：

  print soup.title
  print soup.title.parent

输出结果：

  <title>The Dormouse's story</title>
  <head><title>The Dormouse's story</title></head>

通过元素的.parents属性可以递归得到元素的所有父辈节点，下面的例子使用了.parents方法遍历了<a>标记到根节点的所有节点：

  print soup.a
  for parent in soup.a.parents:
     if parent is None:
       print(parent)
     else:
       print(parent.name)

输出结果：

  <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
  p
  body
  html
  [document]

3）兄弟节点

从soup.prettify（）的输出结果中，我们可以看到<a>有很多兄弟节点。兄弟节点可以理解为和本节点处在同一级的节点，.next_sibling属性可以获取该节点的下一个兄弟节点，.previous_sibling则与之相反，如果节点不存在，则返回None。示例如下：

  print soup.p.next_sibling
  print soup.p.prev_sibling
  print soup.p.next_sibling.next_sibling

输出结果：

  None
  <p class="story">Once upon a time there were three little sisters; and their names were
  <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
  <a class="sister" href="http://example.com/lacie" id="link2"><!-- Lacie --></a> and
  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
  and they lived at the bottom of a well.</p>

第一个输出结果为空白，因为空白或者换行也可以被视作一个节点，所以得到的结果可能是空白或者换行。

通过.next_siblings和.previous_siblings属性可以对当前节点的兄弟节点迭代输出：

  for sibling in soup.a.next_siblings:
     print(repr(sibling))

输出结果：

  u',\n'
  <a class="sister" href="http://example.com/lacie" id="link2"><!-- Lacie --></a>
  u' and\n'
  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
  u';\nand they lived at the bottom of a well.'

4）前后节点

前后节点需要使用.next_element、.previous_element这两个属性，与.next_sibling.previous_sibling不同，它并不是针对于兄弟节点，而是针对所有节点，不分层次，例如<head><title>The Dormouse‘s story</title></head>中的下一个节点就是title：

  print soup.head
  print soup.head.next_element

输出结果：

  <head><title>The Dormouse's story</title></head>
  <title>The Dormouse's story</title>

如果想遍历所有的前节点或者后节点，通过.next_elements和.previous_elements的迭代器就可以向前或向后访问文档的解析内容，就好像文档正在被解析一样：

  for element in soup.a.next_elements:
     print(repr(element))

输出结果：

  u' Elsie '
  u',\n'
  <a class="sister" href="http://example.com/lacie" id="link2"><!-- Lacie --></a>
  u' Lacie '
  u' and\n'
  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
  u'Tillie'
  u';\nand they lived at the bottom of a well.'
  u'\n'
  <p class="story">...</p>
  u'...'
  u'\n'

以上就是遍历文档树的用法，接下来开始讲解比较核心的内容：搜索文档树。

4.搜索文档树

Beautiful Soup定义了很多搜索方法，这里着重介绍find_all（）方法，其他方法的参数和用法类似，请大家举一反三。

首先看一下find_all方法，用于搜索当前Tag的所有Tag子节点，并判断是否符合过滤器的条件，函数原型如下：

  find_all( name , attrs , recursive , text , **kwargs )

接下来分析函数中各个参数，不过需要打乱函数参数顺序，这样方便例子的讲解。

1）name参数

name参数可以查找所有名字为name的标记，字符串对象会被自动忽略掉。name参数取值可以是字符串、正则表达式、列表、True和方法。

最简单的过滤器是字符串。在搜索方法中传入一个字符串参数，Beautiful Soup会查找与字符串完整匹配的内容，下面的例子用于查找文档中所有的标记，返回值为列表：

  print soup.find_all('b')

输出结果：

  [<b>The Dormouse's story</b>]

如果传入正则表达式作为参数，Beautiful Soup会通过正则表达式的match（）来匹配内容。下面的例子中找出所有以b开头的标记，这表示<body>和标记都应该被找到：

  import re
  for tag in soup.find_all(re.compile("^b")):
     print(tag.name)

输出结果：

  body
  b

如果传入列表参数，Beautiful Soup会将与列表中任一元素匹配的内容返回。下面的代码找到文档中所有<a>标记和标记：

  print soup.find_all(["a", "b"])

输出结果：

[<b>The Dormouse's story</b>, <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2"><!-- Lacie --></a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

如果传入的参数是True，True可以匹配任何值，下面代码查找到所有的tag，但是不会返回字符串节点：

  for tag in soup.find_all(True):
     print(tag.name)

输出结果：

  html
  head
  title
  body
  p
  b
  p
  a
  a
  a
  p

如果没有合适过滤器，那么还可以定义一个方法，方法只接受一个元素参数Tag节点，如果这个方法返回True表示当前元素匹配并且被找到，如果不是则返回False。比如过滤包含class属性，也包含id属性的元素，程序如下：

  def hasClass_Id(tag):
     return tag.has_attr('class') and tag.has_attr('id')
  print soup.find_all(hasClass_Id)

输出结果：

[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2"><!-- Lacie --></a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

2）kwargs参数

kwargs参数在Python中表示为keyword参数。如果一个指定名字的参数不是搜索内置的参数名，搜索时会把该参数当作指定名字Tag的属性来搜索。搜索指定名字的属性时可以使用的参数值包括字符串、正则表达式、列表、True。

如果包含id参数，Beautiful Soup会搜索每个tag的“id”属性。示例如下：

  print soup.find_all(id='link2')

输出结果：

  [<a class="sister" href="http://example.com/lacie" id="link2"><!-- Lacie --></a>]

如果传入href参数，Beautiful Soup会搜索每个Tag的“href”属性。比如查找href属性中含有“elsie”的tag：

  import re
  print soup.find_all(href=re.compile("elsie"))

输出结果：

  [<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

下面的代码在文档树中查找所有包含id属性的Tag，无论id的值是什么：

  print soup.find_all(id=True)

输出结果：

[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2"><!-- Lacie --></a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

如果我们想用class过滤，但是class是python的关键字，需要在class后面加个下划线：

  print soup.find_all("a", class_="sister")

输出结果：

[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2"><!-- Lacie --></a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

使用多个指定名字的参数可以同时过滤tag的多个属性：

  print soup.find_all(href=re.compile("elsie"), id='link1')

输出结果：

  [<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

有些tag属性在搜索不能使用，比如HTML5中的data-*属性：

  data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
  data_soup.find_all(data-foo="value")

这样的代码在Python中是不合法的，但是可以通过find_all（）方法的attrs参数定义一个字典参数来搜索包含特殊属性的tag，示例代码如下：

  data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
  data_soup.find_all(attrs={"data-foo": "value"})

输出结果：

  [<div data-foo="value">foo!</div>]

3）text参数

通过text参数可以搜索文档中的字符串内容。与name参数的可选值一样，text参数接受字符串、正则表达式、列表、True。示例如下：

  print soup.find_all(text="Elsie")
  print soup.find_all(text=["Tillie", "Elsie", "Lacie"])
  print soup.find_all(text=re.compile("Dormouse"))

输出结果：

  [u'Elsie']
  [u'Elsie', u'Lacie', u'Tillie']
  [u"The Dormouse's story", u"The Dormouse's story"]

虽然text参数用于搜索字符串，还可以与其他参数混合使用来过滤tag。Beautiful Soup会找到.string方法与text参数值相符的tag。下面的代码用来搜索内容里面包含“Elsie”的<a>标记：

  print soup.find_all("a", text="Elsie")

输出结果：

  [<a class="sister" href="http://example.com/elsie" id="link1"><!--Elsie--></a>]

4）limit参数

find_all（）方法返回全部的搜索结构，如果文档树很大那么搜索会很慢。如果我们不需要全部结果，可以使用limit参数限制返回结果的数量。效果与SQL中的limit关键字类似，当搜索到的结果数量达到limit的限制时，就停止搜索返回结果。下面的例子中，文档树中有3个tag符合搜索条件，但结果只返回了2个，因为我们限制了返回数量。

  print soup.find_all("a", limit=2)

输出结果：

[<a class="sister" href="http://example.com/elsie" id="link1"><!--Elsie--></a>, <a class="sister" href="http://example.com/lacie" id="link2"><!--Lacie--></a>]

5）recursive参数

调用tag的find_all（）方法时，Beautiful Soup会检索当前tag的所有子孙节点，如果只想搜索tag的直接子节点，可以使用参数recursive=False。示例如下：

  print soup.find_all("title")
  print soup.find_all("title", recursive=False)

输出结果：

  [<title>The Dormouse's story</title>]
  []

以上将find_all函数的各个参数基本上讲解完毕，其他函数的使用方法和这个类似，表4-10列举了其他函数。

表4-10　搜索函数

5.CSS选择器

在之前Web前端的章节中，我们讲到了CSS的语法，通过CSS也可以定位元素的位置。在写CSS时，标记名不加任何修饰，类名前加点“.”，id名前加“#”，在这里我们也可以利用类似的方法来筛选元素，用到的方法是soup.select（），返回类型是list。

1）通过标记名称进行查找

通过标记名称可以直接查找、逐层查找，也可以找到某个标记下的直接子标记和兄弟节点标记。示例如下：

  # 直接查找title标记
  print soup.select("title")
  # 逐层查找title标记
  print soup.select("html head title")
  # 查找直接子节点
  # 查找head下的title标记
  print soup.select("head > title")
  # 查找p下的id="link1"的标记
  print soup.select("p > # link1")
  # 查找兄弟节点
  # 查找id="link1"之后class=sisiter的所有兄弟标记
  print soup.select("# link1 ~ .sister")
  # 查找紧跟着id="link1"之后class=sisiter的子标记
  print soup.select("# link1 + .sister")

输出结果：

  [<title>The Dormouse's story</title>]
  [<title>The Dormouse's story</title>]
  [<title>The Dormouse's story</title>]
  [<a class="sister" href="http://example.com/elsie" id="link1"><!--Elsie--></a>]
[<a class="sister" href="http://example.com/lacie" id="link2"><!--Lacie--></a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
  [<a class="sister" href="http://example.com/lacie" id="link2"><!--Lacie--></a>]

2）通过CSS的类名查找

示例如下：

  print soup.select(".sister")
  print soup.select("[class~=sister]")

输出结果：

[<a class="sister" href="http://example.com/elsie" id="link1"><!--Elsie--></a>, <a class="sister" href="http://example.com/lacie" id="link2"><!--Lacie--></a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
[<a class="sister" href="http://example.com/elsie" id="link1"><!--Elsie--></a>, <a class="sister" href="http://example.com/lacie" id="link2"><!--Lacie--></a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

3）通过tag的id查找

示例如下：

  print soup.select("# link1")
  print soup.select("a# link2")

输出结果：

  [<a class="sister" href="http://example.com/elsie" id="link1"><!--Elsie--></a>]
  [<a class="sister" href="http://example.com/lacie" id="link2"><!--Lacie--></a>]

4）通过是否存在某个属性来查找

示例如下：

  print soup.select('a[href]')

输出结果：

[<a class="sister" href="http://example.com/elsie" id="link1"><!--Elsie--></a>, <a class="sister" href="http://example.com/lacie" id="link2"><!--Lacie--></a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

5）通过属性值来查找

示例如下：

  print soup.select('a[href="http://example.com/elsie"]')
  print soup.select('a[href^="http://example.com/"]')
  print soup.select('a[href$="tillie"]')
  print soup.select('a[href*=".com/el"]')

输出结果：

  [<a class="sister" href="http://example.com/elsie" id="link1"><!--Elsie--></a>]
[<a class="sister" href="http://example.com/elsie" id="link1"><!--Elsie--></a>, <a class="sister" href="http://example.com/lacie" id="link2"><!--Lacie--></a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
  [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
  [<a class="sister" href="http://example.com/elsie" id="link1"><!--Elsie--></a>]

以上就是CSS选择器的查找方式，如果大家对CSS选择器的写法不是很熟悉，可以搜索一下W3CSchool的CSS选择器参考手册进行学习。除此之外，还可以使用Firebug中的FirePath功能自动获取网页元素的CSS选择器表达式，如图4-25所示。

图4-25　FirePath CSS选择器

4.3.3　lxml的XPath解析

BeautifulSoup可以将lxml作为默认的解析器使用，同样lxml可以单独使用。下面比较一下这两者之间的优缺点：

·BeautifulSoup和lxml的原理不一样，BeautifulSoup是基于DOM的，会载入整个文档，解析整个DOM树，因此时间和内存开销都会大很多。而lxml是使用XPath技术查询和处理HTML/XML文档的库，只会局部遍历，所以速度会快一些。幸好现在BeautifulSoup可以使用lxml作为默认解析库。

·BeautifulSoup用起来比较简单，API非常人性化，支持CSS选择器，适合新手。lxml的XPath写起来麻烦，开发效率不如BeautifulSoup，当然这也是因人而异，如果你能熟练使用XPath，那么使用lxml是更好的选择，况且现在又有了FirePath这样的自动生成XPath表达式的利器。

第2章已经讲过了XPath的用法，所以现在直接介绍如何使用lxml库来解析网页。示例如下：

  from lxml import etree
  html_str = """
  <html><head><title>The Dormouse's story</title></head>
  <body>
  <p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
  <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
  <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
  <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
  and they lived at the bottom of a well.</p>
  <p class="story">...</p>
  """
  html = etree.HTML(html_str)
  result = etree.tostring(html)
  print(result)

输出结果：

  <html><head><title>The Dormouse's story</title></head>
  <body>
  <p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
  <a href="http://example.com/elsie" class="sister" id="link1"><!--Elsie--></a>,
  <a href="http://example.com/lacie" class="sister" id="link2"><!--Lacie--></a> and
  <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
  and they lived at the bottom of a well.</p>
  <p class="story">...</p>
  </body></html>

大家看到html_str最后是没有</html>和</body>标签的，没有进行闭合，但是通过输出结果我们可以看到lxml的一个非常实用的功能就是自动修正html代码。

除了读取字符串之外，lxml还可以直接读取html文件。假如将html_str存储为index.html文件，利用parse方法进行解析，示例如下：

  from lxml import etree
  html = etree.parse('index.html')
  result = etree.tostring(html, pretty_print=True)
  print(result)

接下来使用XPath语法抽取出其中所有的URL，示例如下：

  html = etree.HTML(html_str)
  urls = html.xpath(".// *[@class='sister']/@href")
  print urls

输出结果：

  ['http://example.com/elsie', 'http://example.com/lacie', 'http://example.com/tillie']

使用lxml的关键是构造XPath表达式，如果大家对XPath不熟悉，可以复习一下第2章中XPath内容。

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

列表为空，暂无数据

4.3 强大的 BeautifulSoup

4.3.1 安装BeautifulSoup

4.3.2 BeautifulSoup的使用

4.3.3 lxml的XPath解析

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

4.3.1　安装BeautifulSoup

4.3.2　BeautifulSoup的使用

4.3.3　lxml的XPath解析

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。