9.2 动态爬虫1：爬取影评信息

发布于 2024-01-26 22:39:51 字数 14039 浏览 0 评论 0 收藏 0

接下来就以MTime电影网（www.mtime.com ）为例进行分析。首先先判断一下是不是动态网站，使用Firefox浏览器访问http://movie.mtime.com/217130/ 其中一部电影，打开Firebug，监听网络，如图9-1所示。

图9-1　MTime电影网

在网络响应中搜索“票房”是搜索不到的，但是在网页中确实显示了票房是多少，这基本上可以确定使用了动态加载技术。这个时候我们需要做的是找出哪个JavaScript文件进行了加载请求。将Firebug中网络选项的JavaScript分类选中，然后查看一下包含敏感内容的链接，比如含有Ajax字符串。如图9-2所示，在一个链接http://service.library.mtime.com/Movie.apiAjax_CallBack=true&Ajax_CallBackType=Mtime.Library.Services&Ajax_CallBackMethod=GetMovieOverviewRating&Ajax_CrossDomain=1&Ajax_RequestUrl=http%3A%2F%2Fmovie.mtime.com%2F217130%2F&t=2016111321341844484&Ajax_CallBackArgument0=217130 中，找到了评分、票房的信息。

图9-2　Ajax链接

找到了我们所需要的链接和响应内容，接下来需要做两件事情，第一件事是如何构造这样的链接，链接中的参数有什么特征，第二件事是如何提取响应信息的内容，为我所用。

在http://service.library.mtime.com/Movie.apiAjax_CallBack=true&Ajax_CallBackType=Mtime.Library.Services&Ajax_CallBackMethod=GetMovieOverviewRating&Ajax_CrossDomain=1&Ajax_RequestUrl=http%3A%2F%2Fmovie.mtime.com%2F217130%2F&t=2016111321341844484&Ajax_CallBackArgument0=217130 这个GET请求中，总共有7个参数，这些参数中哪些是变化的？哪些是不变化的？我们首先要确定一下。最有效的办法就是从另外的一部电影的访问请求中找到加载票房和评分的链接，进行一下对比。比如我访问http://movie.mtime.com/108737/ 这个网页，动态加载票房的链接为：http://service.library.mtime.com/Movie.apiAjax_CallBack=true&Ajax_CallBackType=Mtime.Library.Services&Ajax_CallBackMethod=GetMovieOverviewRating&Ajax_CrossDomain=1&Ajax_RequestUrl=http%3A%2F%2Fmovie.mtime.com%2F108737%2F&t=201611132231493282&Ajax_CallBackArgument0=108737 。通过对比，我们可以发现只有Ajax_RequestUrl、t和Ajax_CallBackArgument0这三个参数是变化的。通过分析，还会发现Ajax_RequestUrl是当前网页的链接，Ajax_CallBackArgument0是http://movie.mtime.com/108737/ 链接中的数字，t为当前的时间。知道以上信息，我们就可以构造一个获取票房和评分的链接了。

最后要提取响应中的内容，首先看一下响应内容的格式。响应内容主要分三种，一种是正在上映的电影信息，一种是即将上映的电影信息，最后一种是还有较长时间才能上映的电影信息。

正在上映的电影信息格式如下：

  var result_201611132231493282 = { "value":{"isRelease":true,"movieRating":　{"MovieId":108737,"RatingFinal":7.7,"RDirectorFinal":7.7,"ROtherFinal":7,"RPictureFinal":8.4,"RShowFinal":10,"RStoryFinal":7.3,"RTotalFinal":10,"Usercount":4067,"AttitudeCount":4300,"UserId":0,"EnterTime":0,"JustTotal":0,"RatingCount":0,"TitleCn":"","TitleEn":"","Year":"","IP":0},"movieTitle":"奇异博士","tweetId":0,"userLastComment　":"","userLastCommentUrl":"","releaseType":1,"boxOffice":{"Rank":1,"TotalBoxOffice":"5.66","TotalBoxOfficeUnit":"亿","TodayBoxOffice":"4776.8","TodayBoxOfficeUnit":"万","ShowDays":10,"EndDate":"2016-11-13 22:00","FirstDayBoxOffice":"8146.21","　FirstDayBoxOfficeUnit":"万"}},"error":null};var movieOverviewRatingResult=result_　201611132231493282;

即将上映的电影信息格式如下：

  var result_2016111414381839596 ={ "value":{"isRelease":true,"movieRating":{"Mo
vieId":229639,"RatingFinal":-1,"RDirectorFinal":0,"ROtherFinal":0,"RPictureFinal":0,"RShowFinal":0,"RStoryFinal":0,"RTotalFinal":0,"Usercount":130,"AttitudeCount":2119,"UserId":0,"EnterTime":0,"JustTotal":0,"RatingCount":0,"TitleCn":"","TitleEn":"","Year":"","IP":0},"movieTitle":"我不是潘金莲
  ","tweetId":0,"userLastComment":"","userLastCommentUrl":"","releaseType":2,"hotValue":{"MovieId":229639,"Ranking":1,"Changing":0,"YesterdayRanking":1}},"error":null};var movieOverviewRatingResult=result_2016111414381839596;

还有较长时间才能上映的电影信息格式如下：

  var result_201611141343063282 = { "value":{"isRelease":false,"movieRating":
  {"MovieId":236608,"RatingFinal":-1,"RDirectorFinal":0,"ROtherFinal":0,
  "RPictureFinal":0,"RShowFinal":0,"RStoryFinal":0,"RTotalFinal":0,
  "Usercount":5,"AttitudeCount":19,"UserId":0,"EnterTime":0,
  "JustTotal":0,"RatingCount":0,"TitleCn":"","TitleEn":"","Year":"",
  "IP":0},"movieTitle":"江南灵异录之白云桥","tweetId":0,
  "userLastComment":"","userLastCommentUrl":"","releaseType":2,
  "hotValue":{"MovieId":236608,"Ranking":53,"Changing":4,
  "YesterdayRanking":57}},"error":null};
  var movieOverviewRatingResult=result_201611141343063282;

这三种格式的区别只是多了或者少了一些字段，需要在异常处理时加一些判断。

“=”和“；”之间的内容是一个标准的JSON格式，我们要提取的字段含义如表9-1所示。

表9-1　字段的定义

确定了链接和提取字段，接下来写一个动态爬虫来爬取电影的评分和票房信息。

1.网页下载器

网页下载器的实现方式和第6章的一样，代码如下：

  # coding:utf-8
  import requests
  class HtmlDownloader(object):
  
     def download(self,url):
       if url is None:
            return None
       user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
       headers={'User-Agent':user_agent}
       r = requests.get(url,headers=headers)
       if r.status_code==200:
            r.encoding='utf-8'
            return r.text
       return None

2.网页解析器

网页解析器中主要包括两个部分，一个是从当前网页中提取所有正在上映的电影链接，另一个是从动态加载的链接中提取我们所需的字段。

提取当前正在上映的电影链接，使用正则表达式，电影页面链接类似http://movie.mtime.com/17681/ 这个样子，正则表达式可以写成如下的样子进行匹配：

http://movie.mtime.com/\d+/ 。在HtmlParser类定义一个parser_url方法，代码如下：

  def parser_url(self,page_url,response):
     pattern = re.compile(r'(http://movie.mtime.com/(\d+)/)')
     urls = pattern.findall(response)
     if urls!=None :
       # 将urls进行去重
       return list(set(urls))
     else:
       return None

接着从动态加载的链接中提取我们所需的字段，首先使用正则表达式取出“=”和“；”之间的内容，接着就可以使用JSON模块进行处理了。下面只需要提取不同格式的信息，其中parser_json为主方法，负责解析响应，同时又使用了两个辅助方法_parser_no_release和_parser_release。代码如下：

  def parser_json(self,page_url,response):
     '''
     解析响应
     :param response:
     :return:
     '''
     # 将“=”和“；”之间的内容提取出来
     pattern = re.compile(r'=(.*);')
     result = pattern.findall(response)[0]
     if result!=None:
       # json模块加载字符串
       value = json.loads(result)
       try:
            isRelease = value.get('value').get('isRelease')
       except Exception,e:
            print e
            return None
       if isRelease:
            if value.get('value').get('hotValue')==None:
              return self._parser_release(page_url,value)
            else:
              return self._parser_no_release(page_url,value,isRelease=2)
       else:
            return self._parser_no_release(page_url,value)
  
  def _parser_release(self,page_url,value):
     '''
     解析已经上映的影片
     :param page_url:电影链接
     :param value:json数据
     :return:
     '''
     try:
       isRelease = 1
       movieRating = value.get('value').get('movieRating')
       boxOffice = value.get('value').get('boxOffice')
       movieTitle = value.get('value').get('movieTitle')
  
       RPictureFinal = movieRating.get('RPictureFinal')
       RStoryFinal = movieRating.get('RStoryFinal')
       RDirectorFinal = movieRating.get('RDirectorFinal')
       ROtherFinal = movieRating.get('ROtherFinal')
       RatingFinal = movieRating.get('RatingFinal')
  
       MovieId =  movieRating.get('MovieId')
       Usercount = movieRating.get('Usercount')
       AttitudeCount =  movieRating.get('AttitudeCount')
  
       TotalBoxOffice =  boxOffice.get('TotalBoxOffice')
       TotalBoxOfficeUnit =  boxOffice.get('TotalBoxOfficeUnit')
       TodayBoxOffice =  boxOffice.get('TodayBoxOffice')
       TodayBoxOfficeUnit =  boxOffice.get('TodayBoxOfficeUnit')
  
       ShowDays = boxOffice.get('ShowDays')
       try:
            Rank = boxOffice.get('Rank')
       except Exception,e:
            Rank=0
       # 返回所提取的内容
       return (MovieId,movieTitle,RatingFinal,
              ROtherFinal,RPictureFinal,RDirectorFinal,
              RStoryFinal,Usercount,AttitudeCount,
              TotalBoxOffice+TotalBoxOfficeUnit,
              TodayBoxOffice+TodayBoxOfficeUnit,
              Rank,ShowDays,isRelease )
     except Exception,e:
       print e,page_url,value
       return None
  
  def _parser_no_release(self,page_url,value,isRelease = 0):
     '''
     解析未上映的电影信息
     :param page_url:
     :param value:
     :return:
     '''
     try:
       movieRating = value.get('value').get('movieRating')
       movieTitle = value.get('value').get('movieTitle')
  
       RPictureFinal = movieRating.get('RPictureFinal')
       RStoryFinal = movieRating.get('RStoryFinal')
       RDirectorFinal = movieRating.get('RDirectorFinal')
       ROtherFinal = movieRating.get('ROtherFinal')
       RatingFinal = movieRating.get('RatingFinal')
  
       MovieId =  movieRating.get('MovieId')
       Usercount = movieRating.get('Usercount')
       AttitudeCount =  movieRating.get('AttitudeCount')
       try:
            Rank = value.get('value').get('hotValue').get('Ranking')
       except Exception,e:
            Rank = 0
       return (MovieId,movieTitle,RatingFinal,
              ROtherFinal,RPictureFinal,RDirectorFinal,
              RStoryFinal, Usercount,AttitudeCount,u'无',
              u'无',Rank,0,isRelease )
     except Exception,e:
       print e,page_url,value
       return None

3.数据存储器

数据存储器将返回的数据插入sqlite数据库中，主要包括建表，插入和关闭数据库等操作，表中设置了15个字段，用来存储电影信息。代码如下：

  import sqlite3
  class DataOutput(object):
     def __init__(self):
       self.cx = sqlite3.connect("MTime.db")
       self.create_table('MTime')
       self.datas=[]
  
     def create_table(self,table_name):
       '''
       创建数据表
       :param table_name:表名称
       :return:
       '''
       values = '''
       id integer primary key,
       MovieId integer,
       MovieTitle varchar(40) NOT NULL,
       RatingFinal REAL NOT NULL DEFAULT 0.0,
       ROtherFinal REAL NOT NULL DEFAULT 0.0,
       RPictureFinal REAL NOT NULL DEFAULT 0.0,
       RDirectorFinal REAL NOT NULL DEFAULT 0.0,
       RStoryFinal REAL NOT NULL DEFAULT 0.0,
       Usercount integer NOT NULL DEFAULT 0,
       AttitudeCount integer NOT NULL DEFAULT 0,
       TotalBoxOffice varchar(20) NOT NULL,
       TodayBoxOffice varchar(20) NOT NULL,
       Rank integer NOT NULL DEFAULT 0,
       ShowDays integer NOT NULL DEFAULT 0,
       isRelease integer NOT NULL
       '''
       self.cx.execute('CREATE TABLE IF NOT EXISTS  %s( %s ) '%(table_name,　values))
  
     def store_data(self,data):
       '''
       数据存储
       :param data:
       :return:
       '''
       if data is None:
            return
       self.datas.append(data)
       if len(self.datas)>10:
            self.output_db('MTime')
  
     def output_db(self,table_name):
       '''
       将数据存储到sqlite
       :return:
       '''
       for data in self.datas:
            self.cx.execute("INSERT INTO %s (MovieId,MovieTitle,"
              "RatingFinal,ROtherFinal,RPictureFinal,"
              "RDirectorFinal,RStoryFinal, Usercount,"
              "AttitudeCount,TotalBoxOffice,TodayBoxOffice,"
              "Rank,ShowDays,isRelease) VALUES (,,,,,,,,,,,,,) "
              ""%table_name,data)
            self.datas.remove(data)
       self.cx.commit()
  
     def output_end(self):
       '''
       关闭数据库
       :return:
       '''
       if len(self.datas)>0:
            self.output_db('MTime')
       self.cx.close()

4.爬虫调度器

爬虫调度器的工作主要是协调以上模块，同时还负责AJax动态链接的构造。代码如下：

  class SpiderMan(object):
     def __init__(self):
       self.downloader = HtmlDownloader()
       self.parser = HtmlParser()
       self.output = DataOutput()
     def crawl(self,root_url):
       content = self.downloader.download(root_url)
       urls = self.parser.parser_url(root_url,content)
       # 构造一个获取评分和票房链接
       for url in urls:
            try:
              t = time.strftime("%Y%m%d%H%M%S3282", time.localtime())
              rank_url ='http://service.library.mtime.com/Movie.api' \
                'Ajax_CallBack=true' \
                '&Ajax_CallBackType=Mtime.Library.Services' \
                '&Ajax_CallBackMethod=GetMovieOverviewRating' \
                '&Ajax_CrossDomain=1' \
                '&Ajax_RequestUrl=%s' \
                '&t=%s' \
                '&Ajax_CallBackArgument0=%s'%(url[0],t,url[1])
              rank_content = self.downloader.download(rank_url)
              data = self.parser.parser_json(rank_url,rank_content)
              self.output.store_data(data)
            except Exception,e:
               print "Crawl failed"
       self.output.output_end()
       print "Crawl finish"
  
  if __name__=='__main__':
     spider = SpiderMan()
     spider.crawl('http://theater.mtime.com/China_Beijing/')

当以上四个模块都完成后，启动爬虫。由于数据量小，大约一分钟后，爬取结束。在shell中使用sqlite命令，查看爬取的结果，如图9-3所示。