18.4 选择器

发布于 2024-01-26 22:39:51 字数 6320 浏览 0 评论 0 收藏 0

PySpider内置了PyQuery来解析网页数据。PyQuery是Python仿照jQuery的严格实现，语法与jQuery几乎完全相同，因此非常适合有Web前端基础的读者快速入手。下面讲解一下PyQuery的基本用法。

18.4.1　PyQuery的用法

PySpider已经内置了PyQuery库，不需要我们进行安装。下面从四个方面进行讲解。

1.PyQuery对象初始化

·使用HTML字符串进行初始化，示例如下：

  from pyquery import PyQuery as pq
  d = pq("<html></html>")

·可以使用lxml对HTML代码进行规范化处理，将其转化为清晰完整的HTML代码，示例如下：

  from pyquery import PyQuery as pq
  from lxml import etree
  d = pq(etree.fromstring("<html></html>"))

·通过传入URL的方式进行初始化，相当于直接访问网页。示例如下：

  from pyquery import PyQuery as pq
  d = pq('http://www.google.com')

·通过指定HTML文件的路径完成初始化。示例如下：

  from pyquery import PyQuery as pq
  d = pq(filename='index.html')

2.属性操作

在PyQuery中，可以完全按照jQuery的语法来进行PyQuery的操作。示例如下：

  from pyquery import PyQuery as pq
  p = pq('<p id="hello" class="hello"></p>')('p')
  print p.attr("id")
  print p.attr["id"]
  print p.attr("id", "plop")
  print p.attr("id", "hello")
  print p.attr(id='hello', class_='hello2')
  p.attr.class_ = 'world'
  p.addClass("!!!")
  print p
  print p.css("font-size", "15px")
  print p.attr.style

输出结果为：

  hello
  hello
  <p id="plop" class="hello"/>
  <p id="hello" class="hello"/>
  <p id="hello" class="hello2"/>
  <p id="hello" class="world !!!"/>
  <p id="hello" class="world !!!"/>
  font-size: 15px

PyQuery不仅可以读取属性和样式的值，还可以任意修改属性和样式的值。

3.DOM操作

DOM操作和jQuery一样，示例如下：

  from pyquery import PyQuery as pq
  d = pq('<p class="hello" id="hello">you know Python rocks</p>')
  d('p').append(' check out <a href="http://reddit.com/r/python"><span>reddit</span>
</a>')
  print d
  p = d('p')
  p.prepend('check out <a href="http://reddit.com/r/python">reddit</a>')
  print p

输出结果为：

  <p class="hello" id="hello">you know Python rocks check out <a
  href="http://reddit.com/r/python"><span>reddit</span></a></p>
  <p class="hello" id="hello">check out <a href="http://reddit.com/r/python">reddit
     </a>you know Python rocks check out <a href="http://reddit.com/r/python"><span>reddit
     </span></a></p>

4.元素遍历

对于网页数据抽取来说，更多的时候是抽取出同一类型的数据，这就需要用到元素的遍历。示例代码如下：

  from pyquery import PyQuery as pq
  html_cont = '''
  <div>
     <ul>
        <li class="one">first item</li>
        <li class="two"><a href="link2.html">second</a></li>
        <li class="four"><a href="link3.html">third</a></li>
        <li class="three"><a href="link4.html"><span class="bold">fourth</span>
            </a></li>
      </ul>
   </div>
  
  '''
  doc = pq(html_cont)
  lis = doc('li')
  for li in lis.items():
     print li.html()

输出结果为：

  first item
  <a href="link2.html">second</a>
  <a href="link3.html">third</a>
  <a href="link4.html"><span class="bold">fourth</span></a>

以上讲解了一些PyQuery的基础知识，如果你对jQuery语法不熟，建议先学习jQuery，再回来使用PyQuery或者使用第三方的包（比如lxml和bs4）对response进行解析。

18.4.2　解析数据

讲解完PyQuery，下面继续进行doubanMovie项目。我们需要从如图18-7的页面中提取出电影列表页的url，可以使用Firebug获取电影列表页的url的CSS表达式，也可以使用enable css selector helper工具获取（有时候不好用）。index_page代码如下：

  @config(age=10 * 24 * 60 * 60)
  def index_page(self, response):
     for each in response.doc('.tagCol>tbody>tr>td>a').items():
       self.crawl(each.attr.href, callback=self.list_page)

经过index_page方法之后生成新的请求，如图18-10所示。

图18-10　抽取效果

下面继续点击每个请求后面的箭头进行发送，获取响应后切换到web选项，对图18-8所示页面进行电影url的抽取和翻页操作。

·电影url的CSS表达式为.pl2>a。

·翻页链接的CSS表达式为.next>a。

list_page代码如下：

      def list_page(self,response):
       # 获取电影url，然后调用detail_page方法解析电影详情
       for each in response.doc('.pl2>a').items():
            self.crawl(each.attr.href, callback=self.detail_page)
       # 进行翻页操作
       for each in response.doc('.next>a').items():
            self.crawl(each.attr.href, callback=self.list_page)

保存代码，点击Run就会看到抽取到的电影url，如图18-11所示。

继续重复上述步骤，点击每一行后面的箭头，发送请求。获取响应后切换到web选项，对图18-9所示页面进行电影详情的分析和抽取。下面直接给出详情页的CSS表达式：

·电影名称：#content>h1>span[property=“v：itemreviewed”]

·电影年份：#content>h1>span[class=“year”]

·电影导演：.attrs>a[rel=“v：directedBy”]

·电影主演：.attrs>span>a[rel=“v：starring”]

·电影类型：#info>span[property=“v：genre”]

·电影评分：.ll.rating_num

图18-11　电影URL

detail_page方法代码如下：

      def detail_page(self, response):
       title = response.doc('# content>h1>span[property="v:itemreviewed"]').text()
       time  = response.doc('# content>h1>span[class="year"]').text()
       director = response.doc('.attrs>a[rel="v:directedBy"]').text()
       actor=[]
       genre=[]
       for each in response.doc('a[rel="v:starring"]').items():
            actor.append(each.text())
       for each in response.doc('# info>span[property="v:gen

将代码进行保存，点击Run就会看到抽取到的电影详情，如图18-12所示。