返回介绍

11.3 项目实战:爬取 toscrape 中的名人名言

发布于 2024-02-05 21:13:20 字数 3715 浏览 0 评论 0 收藏 0

11.3.1 项目需求

爬取网站http://quotes.toscrape.com/js中的名人名言信息。

11.3.2 页面分析

该网站的页面已在本章开头部分分析过,大家可以回头看相关内容。

11.3.3 编码实现

首先,在splash_examples项目目录下使用scrapy genspider命令创建Spider:

scrapy genspider quotes quotes.toscrape.com

在这个案例中,我们只需使用Splash的render.html端点渲染页面,再进行爬取即可实现QuotesSpider,代码如下:

# -*- coding: utf-8 -*-
import scrapy
from scrapy_splash import SplashRequest

class QuotesSpider(scrapy.Spider):
 name = "quotes"
 allowed_domains = ["quotes.toscrape.com"]
 start_urls = ['http://quotes.toscrape.com/js/']

 def start_requests(self):
  for url in self.start_urls:
    yield SplashRequest(url, args={'images': 0, 'timeout': 3})

 def parse(self, response):
  for sel in response.css('div.quote'):
    quote = sel.css('span.text::text').extract_first()
    author = sel.css('small.author::text').extract_first()
    yield {'quote': quote, 'author': author}
  href = response.css('li.next > a::attr(href)').extract_first()
  if href:
    url = response.urljoin(href)
    yield SplashRequest(url, args={'images': 0, 'timeout': 3})

上述代码中,使用SplashRequest提交请求,在SplashRequest的构造器中无须传递endpoint参数,因为该参数默认值便是'render.html'。使用args参数禁止Splash加载图片,并设置渲染超时时间。

运行爬虫,观察结果:

$ scrapy crawl quotes -o quotes.csv
...
  $ cat -n quotes.csv
    1  quote,author
    2  “The world as we have created it is a process of our thinking. It cannot be changed without
changing our thinking.”,Albert Einstein
    3  "“It is our choices, Harry, that show what we truly are, far more than our abilities.”",J.K.
Rowling
    4  “There are only two ways to live your life. One is as though nothing is a miracle. The other is
as though everything is a miracle.”,Albert Einstein
    5  "“The person, be it gentleman or lady, who has not pleasure in a good novel, must be
intolerably stupid.”",Jane Austen
    6  "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than
absolutely boring.”",Marilyn Monroe
    7  “Try not to become a man of success. Rather become a man of value.”,Albert Einstein
    8  “It is better to be hated for what you are than to be loved for what you are not.”,André Gide
    9  "“I have not failed. I've just found 10,000 ways that won't work.”",Thomas A. Edison
    10  “A woman is like a tea bag; you never know how strong it is until it's in hot water.”,Eleanor
Roosevelt
  ...
    91  "“I believe in Christianity as I believe that the sun has risen: not only because I see it, but
because by it I see everything else.”",C.S. Lewis
    92  "“The truth."" Dumbledore sighed. ""It is a beautiful and terrible thing, and should therefore
be treated with great caution.”",J.K. Rowling
    93  "“I'm the one that's got to die when it's time for me to die, so let me live my life the way I
want to.”",Jimi Hendrix
    94  “To die will be an awfully big adventure.”,J.M. Barrie
    95  “It takes courage to grow up and become who you really are.”,E.E. Cummings
    96  “But better to get hurt by the truth than comforted with a lie.”,Khaled Hosseini
    97  “You never really understand a person until you consider things from his point of view...
Until you climb inside of his skin and walk around in it.”,Harper Lee
    98  "“You have to write the book that wants to be written. And if the book will be too difficult for
grown-ups, then you write it for children.”",Madeleine L'Engle
    99  “Never tell the truth to people who are not worthy of it.”,Mark Twain
   100  "“A person's a person, no matter how small.”",Dr. Seuss
   101  "“... a mind needs books as a sword needs a whetstone, if it is to keep its edge.”",George R.R.
Martin

运行结果显示,我们成功爬取了10个页面中的100条名人名言。

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
    我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
    原文