9.5 动态爬虫2：爬取去哪网

发布于 2024-01-26 22:39:51 字数 4657 浏览 0 评论 0 收藏 0

讲解完了Selenium，接下来编写一个爬取去哪网酒店信息的简单动态爬虫。目标是爬取上海今天的酒店信息，并将这些信息存成文本文件。下面将整个目标进行功能分解：

1）搜索功能，在搜索框输出地点和入住时间，点击搜索按钮

2）获取一页完整的数据。由于去哪网一个页面数据分为两次加载，第一次加载15条数据，这时候需要将页面拉到底部，完成第二次数据加载。

3）获取一页完整且渲染过的HTML文档后，使用BeautifulSoup将其中的酒店信息提取出来进行存储。

4）解析完成，点击下一页，继续抽取数据。

第一步：找到酒店信息的搜索页面，如图9-15所示。

使用Firebug查看Html结果，可以通过selenium获取目的地框、入住日期、离店日期和搜索按钮的元素位置，输入内容，并点击搜索按钮。

  ele_toCity = driver.find_element_by_name('toCity')
  ele_fromDate = driver.find_element_by_id('fromDate')
  ele_toDate = driver.find_element_by_id('toDate')
  ele_search = driver.find_element_by_class_name('search-btn')
  ele_toCity.clear()
  ele_toCity.send_keys(to_city)
  ele_toCity.click()
  ele_fromDate.clear()
  ele_fromDate.send_keys(fromdate)
  ele_toDate.clear()
  ele_toDate.send_keys(todate)
  ele_search.click()

图9-15　搜索页面

第二步：分两次获取一页完整的数据，第二次让driver执行js脚本，把网页拉到底部。

  try:
     WebDriverWait(driver, 10).until(
       EC.title_contains(unicode(to_city))
     )
  except Exception,e:
     print e
     break
  time.sleep(5)
  
  js = "window.scrollTo(0, document.body.scrollHeight);"
  driver.execute_script(js)
  time.sleep(5)
     htm_const = driver.page_source

第三步：使用BeautifulSoup解析酒店信息，并将数据进行清洗和存储。

  soup = BeautifulSoup(htm_const,'html.parser', from_encoding='utf-8')
  infos = soup.find_all(class_="item_hotel_info")
  f = codecs.open(unicode(to_city)+unicode(fromdate)+u'.html', 'a', 'utf-8')
  for info in infos:
     f.write(str(page_num)+'--'*50)
     content = info.get_text().replace(" ","").replace("\t","").strip()
     for line in [ln for ln in content.splitlines() if ln.strip()]:
       f.write(line)
       f.write('\r\n')
  f.close()

第四步：点击下一页，继续重复这一个过程。

  next_page = WebDriverWait(driver, 10).until(
  EC.visibility_of(driver.find_element_by_css_selector(".item.next"))
  )
  next_page.click()

这个小例子只是简单实现了功能，完整代码如下：

  class QunaSpider(object):
  
     def get_hotel(self,driver, to_city,fromdate,todate):
  
       ele_toCity = driver.find_element_by_name('toCity')
       ele_fromDate = driver.find_element_by_id('fromDate')
       ele_toDate = driver.find_element_by_id('toDate')
       ele_search = driver.find_element_by_class_name('search-btn')
       ele_toCity.clear()
       ele_toCity.send_keys(to_city)
       ele_toCity.click()
       ele_fromDate.clear()
       ele_fromDate.send_keys(fromdate)
       ele_toDate.clear()
       ele_toDate.send_keys(todate)
       ele_search.click()
       page_num=0
       while True:
            try:
              WebDriverWait(driver, 10).until(
                EC.title_contains(unicode(to_city))
              )
            except Exception,e:
              print e
              break
            time.sleep(5)
  
            js = "window.scrollTo(0, document.body.scrollHeight);"
            driver.execute_script(js)
            time.sleep(5)
  
            htm_const = driver.page_source
            soup = BeautifulSoup(htm_const,'html.parser', from_encoding='utf-8')
            infos = soup.find_all(class_="item_hotel_info")
            f = codecs.open(unicode(to_city)+unicode(fromdate)+u'.html', 'a', 
              'utf-8')
            for info in infos:
              f.write(str(page_num)+'--'*50)
              content = info.get_text().replace(" ","").replace("\t","").strip()
              for line in [ln for ln in content.splitlines() if ln.strip()]:
                f.write(line)
                f.write('\r\n')
            f.close()
            try:
              next_page = WebDriverWait(driver, 10).until(
              EC.visibility_of(driver.find_element_by_css_selector(".item.next"))
              )
              next_page.click()
              page_num+=1
              time.sleep(10)
            except Exception,e:
              print e
              break
  
     def crawl(self,root_url,to_city):
        today = datetime.date.today().strftime('%Y-%m-%d')
        tomorrow=datetime.date.today() + datetime.timedelta(days=1)
        tomorrow = tomorrow.strftime('%Y-%m-%d')
        driver = webdriver.Firefox(executable_path='D:\geckodriver_win32\gecko-
       driver.exe')
        driver.set_page_load_timeout(50)
        driver.get(root_url)
        driver.maximize_window() # 将浏览器最大化显示
        driver.implicitly_wait(10) # 控制间隔时间，等待浏览器反映
        self.get_hotel(driver,to_city,today,tomorrow)
  
  
  if __name__=='__main__':
     spider = QunaSpider()
     spider.crawl('http://hotel.qunar.com/',u"上海")

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

列表为空，暂无数据

9.5 动态爬虫2：爬取去哪网

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。