文章来源于网络收集而来，版权归原创者所有，如有侵权请及时联系！

9.2 项目实战：爬取 matplotlib 例子源码文件

发布于 2024-02-05 21:13:20 字数 11275 浏览 0 评论 0 收藏 0

下面我们来完成一个使用FilesPipeline下载文件的实战项目。matplotlib是一个非常著名的Python绘图库，广泛应用于科学计算和数据分析等领域。在matplotlib网站上提供了许多应用例子代码，在浏览器中访问http://matplotlib.org/examples/index.html，可看到图9-1所示的例子列表页面。

其中有几百个例子，被分成多个类别，单击第一个例子，进入其页面，如图9-2所示。

用户可以在每个例子页面中阅读源码，也可以点击页面中的source code按钮下载源码文件。如果我们想把所有例子的源码文件都下载到本地，可以编写一个爬虫程序完成这个任务。

9.2.1　项目需求

下载http://matplotlib.org网站中所有例子的源码文件到本地。

图9-1

图9-2

9.2.2　页面分析

先来看如何在例子列表页面http://matplotlib.org/examples/index.html中获取所有例子页面的链接。使用scrapy shell命令下载页面，然后调用view函数在浏览器中查看页面，如图9-3所示。

$ scrapy shell http://matplotlib.org/examples/index.html
...
>>> view(response)

图9-3

观察发现，所有例子页面的链接都在<div class="toctree-wrapper compound">下的每一个<li class="toctree-l2">中，例如：

<a class="reference internal" href="animation/animate_decay.html">animate_decay</a>

使用LinkExtractor提取所有例子页面的链接，代码如下：

>>> from scrapy.linkextractors import LinkExtractor
>>> le = LinkExtractor(restrict_css='div.toctree-wrapper.compound li.toctree-l2')
>>> links = le.extract_links(response)
>>> [link.url for link in links]
['http://matplotlib.org/examples/animation/animate_decay.html',
'http://matplotlib.org/examples/animation/basic_example.html',
'http://matplotlib.org/examples/animation/basic_example_writer.html',
 'http://matplotlib.org/examples/animation/bayes_update.html',
 'http://matplotlib.org/examples/animation/double_pendulum_animated.html',
 'http://matplotlib.org/examples/animation/dynamic_image.html',
 'http://matplotlib.org/examples/animation/dynamic_image2.html',
 'http://matplotlib.org/examples/animation/histogram.html',
 'http://matplotlib.org/examples/animation/moviewriter.html',
 'http://matplotlib.org/examples/animation/rain.html',
 'http://matplotlib.org/examples/animation/random_data.html',
 'http://matplotlib.org/examples/animation/simple_3danim.html',
 'http://matplotlib.org/examples/animation/simple_anim.html',
 'http://matplotlib.org/examples/animation/strip_chart_demo.html',
 'http://matplotlib.org/examples/animation/subplots.html',
 'http://matplotlib.org/examples/animation/unchained.html',
 'http://matplotlib.org/examples/api/agg_oo.html',
 'http://matplotlib.org/examples/api/barchart_demo.html',
 'http://matplotlib.org/examples/api/bbox_intersect.html',
 ...
 'http://matplotlib.org/examples/user_interfaces/svg_tooltip.html',
 'http://matplotlib.org/examples/user_interfaces/toolmanager.html',
 'http://matplotlib.org/examples/user_interfaces/wxcursor_demo.html',
 'http://matplotlib.org/examples/widgets/buttons.html',
 'http://matplotlib.org/examples/widgets/check_buttons.html',
 'http://matplotlib.org/examples/widgets/cursor.html',
 'http://matplotlib.org/examples/widgets/lasso_selector_demo.html',
 'http://matplotlib.org/examples/widgets/menu.html',
 'http://matplotlib.org/examples/widgets/multicursor.html',
 'http://matplotlib.org/examples/widgets/radio_buttons.html',
 'http://matplotlib.org/examples/widgets/rectangle_selector.html',
 'http://matplotlib.org/examples/widgets/slider_demo.html',
 'http://matplotlib.org/examples/widgets/span_selector.html']
>>> len(links)
507

例子列表页面分析完毕，总共找到了507个例子。

接下来分析例子页面。调用fetch函数下载第一个例子页面，并调用view函数在浏览器中查看页面，如图9-4所示。

>>> fetch('http://matplotlib.org/examples/animation/animate_decay.html')
...
>>> view(response)

图9-4

在一个例子页面中，例子源码文件的下载地址可在<a class="reference external">中找到：

>>> href = response.css('a.reference.external::attr(href)').extract_first()
>>> href
'animate_decay.py'
>>> response.urljoin(href)
'http://matplotlib.org/examples/animation/animate_decay.py'

到此，页面分析的工作完成了。

9.2.3　编码实现

接下来，我们按以下4步完成该项目：

（1）创建Scrapy项目，并使用scrapy genspider命令创建Spider。

（2）在配置文件中启用FilesPipeline，并指定文件下载目录。

（3）实现ExampleItem（可选）。

（4）实现ExamplesSpider。

步骤　01　首先创建Scrapy项目，取名为matplotlib_examples，再使用scrapy genspider命令创建Spider：

$ scrapy startproject matplotlib_examples
$ cd matplotlib_examples
$ scrapy genspider examples matplotlib.org

步骤　02　在配置文件settings.py中启用FilesPipeline，并指定文件下载目录，代码如下：

ITEM_PIPELINES = {
  'scrapy.pipelines.files.FilesPipeline': 1,
}
FILES_STORE = 'examples_src'

步骤　03　实现ExampleItem，需定义file_urls和files两个字段，在items.py中完成如下代码：

class ExampleItem(scrapy.Item):
 file_urls = scrapy.Field()
 files = scrapy.Field()

步骤　04　实现ExamplesSpider。首先设置起始爬取点：

import scrapy

class ExamplesSpider(scrapy.Spider):
 name = "examples"
 allowed_domains = ["matplotlib.org"]
 start_urls = ['http://matplotlib.org/examples/index.html']

 def parse(self, response):
   pass

parse方法是例子列表页面的解析函数，在该方法中提取每个例子页面的链接，用其构造Request对象并提交，提取链接的细节已在页面分析时讨论过，实现parse方法的代码如下：

import scrapy
from scrapy.linkextractors import LinkExtractor

class ExamplesSpider(scrapy.Spider):
 name = "examples"
 allowed_domains = ["matplotlib.org"]
 start_urls = ['http://matplotlib.org/examples/index.html']
def parse(self, response):

 le = LinkExtractor(restrict_css='div.toctree-wrapper.compound',
          deny='/index.html$')
 print(len(le.extract_links(response)))
 for link in le.extract_links(response):
  yield scrapy.Request(link.url, callback=self.parse_example)

def parse_example(self, response):
 pass

上面代码中，我们将例子页面的解析函数设置为parse_example方法，下面来实现这个方法。例子页面中包含了例子源码文件的下载链接，在parse_example方法中获取源码文件的url，将其放入一个列表，赋给ExampleItem的file_urls字段。实现parse_example方法的代码如下：

import scrapy
from scrapy.linkextractors import LinkExtractor
from ..items import ExampleItem
class ExamplesSpider(scrapy.Spider):
 name = "examples"
 allowed_domains = ["matplotlib.org"]
 start_urls = ['http://matplotlib.org/examples/index.html']
 def parse(self, response):
   le = LinkExtractor(restrict_css='div.toctree-wrapper.compound',
          deny='/index.html$')
   print(len(le.extract_links(response)))
   for link in le.extract_links(response):
    yield scrapy.Request(link.url, callback=self.parse_example)
 def parse_example(self, response):
   href = response.css('a.reference.external::attr(href)').extract_first()
   url = response.urljoin(href)
   example = ExampleItem()
   example['file_urls'] = [url]
   return example

编码完成后，运行爬虫，并观察结果：

$ scrapy crawl examples -o examples.json
...
$ ls
examples.json  examples_src  matplotlib_examples  scrapy.cfg

运行结束后，在文件examples.json中可以查看到文件下载结果信息：

  $ cat examples.json
  [
  {"file_urls": ["http://matplotlib.org/mpl_examples/axes_grid/demo_floating_axes.py"], "files": [{"url":
"http://matplotlib.org/mpl_examples/axes_grid/demo_floating_axes.py", "checksum":
"502d1cd62086fb1d4de033cef2e495c0", "path":
"full/d9b551310a6668ccf43871e896f2fe6e0228567d.py"}]},
  {"file_urls": ["http://matplotlib.org/mpl_examples/axes_grid/demo_curvelinear_grid.py"], "files":
[{"url": "http://matplotlib.org/mpl_examples/axes_grid/demo_curvelinear_grid.py", "checksum":
"5cb91103f11079b40400afc0c1f4a508", "path":
"full/366386c23c5b715c49801efc7f8d55d2c74252e2.py"}]},
  {"file_urls":
["http://matplotlib.org/mpl_examples/axes_grid/make_room_for_ylabel_using_axesgrid.py"], "files":
[{"url": "http://matplotlib.org/mpl_examples/axes_grid/make_room_for_ylabel_using_axesgrid.py",
"checksum": "dcf561f97ab0905521c1957cacd2da00", "path":
"full/919cbbe6d725237e3b6051f544f6109e7189b4fe.py"}]},
  ...省略部分内容...
  {"file_urls": ["http://matplotlib.org/mpl_examples/api/custom_projection_example.py"], "files":
[{"url": "http://matplotlib.org/mpl_examples/api/custom_projection_example.py", "checksum":
"bde485f9d5ceb4b4cc969ef692df5eee", "path":
"full/d56af342d7130ddd9dbf55c00664eae9a432bf70.py"}]},
  {"file_urls": ["http://matplotlib.org/examples/animation/dynamic_image2.py"], "files": [{"url":
"http://matplotlib.org/examples/animation/dynamic_image2.py", "checksum":
"98b6a6021ba841ef4a2cd36c243c516d", "path":
"full/fe635002562e8685583c1b35a8e11e8cde0a6321.py"}]},
  {"file_urls": ["http://matplotlib.org/examples/animation/basic_example.py"], "files": [{"url":
"http://matplotlib.org/examples/animation/basic_example.py", "checksum":
"1d4afc0910f6abc519e6ecd32c66896a", "path":
"full/083c113c1dac96bbc74adfc5b08cad68ec9c16db.py"}]}

再来查看文件下载目录exmaples_src：

如上所示，507个源码文件被下载到了examples_src/full目录下，并且每个文件的名字都是一串长度相等的奇怪数字，这些数字是下载文件url的sha1散列值。例如，某文件url为：

http://matplotlib.org/mpl_examples/axes_grid/demo_floating_axes.py

该url的sha1散列值为：

d9b551310a6668ccf43871e896f2fe6e0228567d

那么该文件的存储路径为：

# [FILES_STORE]/full/[SHA1_HASH_VALUE].py
examples_src/full/d9b551310a6668ccf43871e896f2fe6e0228567d.py

这种命名方式可以防止重名的文件相互覆盖，但这样的文件名太不直观了，无法从文件名了解文件内容，我们期望把这些例子文件按照类别下载到不同目录下，为完成这个任务，可以写一个单独的脚本，依据examples.json文件中的信息将文件重命名，也可以修改FilesPipeline为文件命名的规则，这里采用后一种方式。

阅读FilesPipeline的源码发现，原来是其中的file_path方法决定了文件的命名，相关代码如下：

class FilesPipeline(MediaPipeline):
 ...
 def file_path(self, request, response=None, info=None):
 ...
 # check if called from file_key with url as first argument
 if not isinstance(request, Request):
  _warn()
  url = request
 else:
  url = request.url
 # detect if file_key() method has been overridden
 if not hasattr(self.file_key, '_base'):
  _warn()
  return self.file_key(url)
 ## end of deprecation warning block
 media_guid = hashlib.sha1(to_bytes(url)).hexdigest()
 media_ext = os.path.splitext(url)[1]
 return 'full/%s%s' % (media_guid, media_ext)
...

现在，我们实现一个FilesPipeline的子类，覆写file_path方法来实现所期望的文件命名规则，这些源码文件url的最后两部分是类别和文件名，例如：

http://matplotlib.org/mpl_examples/(axes_grid/demo_floating_axes.py)

可用以上括号中的部分作为文件路径，在pipelines.py实现MyFilesPipeline，代码如下：

from scrapy.pipelines.files import FilesPipeline
from urllib.parse import urlparse
from os.path import basename, dirname, join

class MyFilesPipeline(FilesPipeline):

 def file_path(self, request, response=None, info=None):
   path = urlparse(request.url).path
   return join(basename(dirname(path)), basename(path))

修改配置文件，使用MyFilesPipeline替代FilesPipeline：

ITEM_PIPELINES = {
  #'scrapy.pipelines.files.FilesPipeline': 1,
  'matplotlib_examples.pipelines.MyFilesPipeline': 1,
}

删除之前下载的所有文件，重新运行爬虫后，再来查看examples_src目录：

从上述结果看出，507个文件按类别被下载到26个目录下，这正是我们所期望的。

到此，文件下载的项目完成了。

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

列表为空，暂无数据

9.2 项目实战：爬取 matplotlib 例子源码文件

9.2.1 项目需求

9.2.2 页面分析

9.2.3 编码实现

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

9.2.1　项目需求

9.2.2　页面分析

9.2.3　编码实现

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。