当前位置：文江博客话题详情

Scrapy 单元测试

发布于 2024-11-17 00:02:57 字数 528 浏览 10 评论 0 原文

我想在 Scrapy（屏幕抓取器/网络爬虫）中实现一些单元测试。由于项目是通过“scrapy scrapy”命令运行的，所以我可以通过鼻子之类的东西运行它。既然scrapy是建立在twisted之上的，我可以使用它的单元测试框架Trial吗？如果是这样，怎么办？否则我想让鼻子工作。

更新：

我一直在谈论Scrapy-Users我想我应该“在测试代码中构建响应，然后使用响应调用该方法并断言[I]在输出中获得预期的项目/请求”。但我似乎无法让它发挥作用。

我可以构建一个单元测试测试类并在测试中：

创建一个响应对象
尝试使用响应对象调用我的蜘蛛的解析方法

但是它最终生成这个回溯。有什么见解吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

守望孤独 2024-11-24 00:02:58

我正在使用 scrapy 1.3.0 和函数： fake_response_from_file，引发错误：

response = Response(url=url, request=request, body=file_content)

我得到：

raise AttributeError("Response content isn't text")

解决方案是使用 TextResponse 代替，并且它工作正常，例如：

response = TextResponse(url=url, request=request, body=file_content)

非常感谢。

I'm using scrapy 1.3.0 and the function: fake_response_from_file, raise an error:

response = Response(url=url, request=request, body=file_content)

I get:

raise AttributeError("Response content isn't text")

The solution is to use TextResponse instead, and it works ok, as example:

response = TextResponse(url=url, request=request, body=file_content)

Thanks a lot.

回复收藏 0 原文

澜川若宁 2024-11-24 00:02:58

类似于 Hadrien 的答案，但对于 pytest： pytest-vcr。

import requests
import pytest
from scrapy.http import HtmlResponse

@pytest.mark.vcr()
def test_parse(url, target):
    response = requests.get(url)
    scrapy_response = HtmlResponse(url, body=response.content)
    assert Spider().parse(scrapy_response) == target

Similar to Hadrien's answer but for pytest: pytest-vcr.

import requests
import pytest
from scrapy.http import HtmlResponse

@pytest.mark.vcr()
def test_parse(url, target):
    response = requests.get(url)
    scrapy_response = HtmlResponse(url, body=response.content)
    assert Spider().parse(scrapy_response) == target

回复收藏 0 原文

近箐 2024-11-24 00:02:58

您可以按照 scrapy 站点中的此代码片段从脚本运行它。然后您可以对退回的物品做出任何您想要的断言。

回复收藏 0 原文

红玫瑰 2024-11-24 00:02:58

https://github.com/ThomasAitken/Scrapy-Testmaster

这是我写的一个包显着扩展了 Scrapy Autounit 库的功能，并将其带向不同的方向（允许轻松动态更新测试用例并合并调试/测试用例生成过程）。它还包括 Scrapy parse 命令的修改版本（https://docs.scrapy.org/en/latest/topics/commands.html#std-command-parse）

回复收藏 0 原文

爱格式化 2024-11-24 00:02:57

我这样做的方法是创建假响应，这样您就可以离线测试解析函数。但通过使用真实的 HTML，您可以获得真实的情况。

这种方法的一个问题是您的本地 HTML 文件可能无法反映最新的在线状态。因此，如果 HTML 在线更改，您可能会遇到一个大错误，但您的测试用例仍然会通过。所以这种方式可能不是最好的测试方法。

我当前的工作流程是，每当出现错误时，我都会向管理员发送一封电子邮件，其中包含网址。然后，对于该特定错误，我创建一个 html 文件，其中包含导致错误的内容。然后我为它创建一个单元测试。

这是我用来创建示例 Scrapy http 响应以从本地 html 文件进行测试的代码：

# scrapyproject/tests/responses/__init__.py

import os

from scrapy.http import Response, Request

def fake_response_from_file(file_name, url=None):
    """
    Create a Scrapy fake HTTP response from a HTML file
    @param file_name: The relative filename from the responses directory,
                      but absolute paths are also accepted.
    @param url: The URL of the response.
    returns: A scrapy HTTP response which can be used for unittesting.
    """
    if not url:
        url = 'http://www.example.com'

    request = Request(url=url)
    if not file_name[0] == '/':
        responses_dir = os.path.dirname(os.path.realpath(__file__))
        file_path = os.path.join(responses_dir, file_name)
    else:
        file_path = file_name

    file_content = open(file_path, 'r').read()

    response = Response(url=url,
        request=request,
        body=file_content)
    response.encoding = 'utf-8'
    return response

示例 html 文件位于 scrapyproject/tests/responses/osdir/sample.html

然后测试用例可能如下所示：
测试用例位置是 scrapyproject/tests/test_osdir.py

import unittest
from scrapyproject.spiders import osdir_spider
from responses import fake_response_from_file

class OsdirSpiderTest(unittest.TestCase):

    def setUp(self):
        self.spider = osdir_spider.DirectorySpider()

    def _test_item_results(self, results, expected_length):
        count = 0
        permalinks = set()
        for item in results:
            self.assertIsNotNone(item['content'])
            self.assertIsNotNone(item['title'])
        self.assertEqual(count, expected_length)

    def test_parse(self):
        results = self.spider.parse(fake_response_from_file('osdir/sample.html'))
        self._test_item_results(results, 10)

这基本上就是我测试解析方法的方式，但它不仅适用于解析方法。如果它变得更复杂，我建议查看 Mox

The way I've done it is create fake responses, this way you can test the parse function offline. But you get the real situation by using real HTML.

A problem with this approach is that your local HTML file may not reflect the latest state online. So if the HTML changes online you may have a big bug, but your test cases will still pass. So it may not be the best way to test this way.

My current workflow is, whenever there is an error I will sent an email to admin, with the url. Then for that specific error I create a html file with the content which is causing the error. Then I create a unittest for it.

This is the code I use to create sample Scrapy http responses for testing from an local html file:

# scrapyproject/tests/responses/__init__.py

import os

from scrapy.http import Response, Request

def fake_response_from_file(file_name, url=None):
    """
    Create a Scrapy fake HTTP response from a HTML file
    @param file_name: The relative filename from the responses directory,
                      but absolute paths are also accepted.
    @param url: The URL of the response.
    returns: A scrapy HTTP response which can be used for unittesting.
    """
    if not url:
        url = 'http://www.example.com'

    request = Request(url=url)
    if not file_name[0] == '/':
        responses_dir = os.path.dirname(os.path.realpath(__file__))
        file_path = os.path.join(responses_dir, file_name)
    else:
        file_path = file_name

    file_content = open(file_path, 'r').read()

    response = Response(url=url,
        request=request,
        body=file_content)
    response.encoding = 'utf-8'
    return response

The sample html file is located in scrapyproject/tests/responses/osdir/sample.html

Then the testcase could look as follows:
The test case location is scrapyproject/tests/test_osdir.py

import unittest
from scrapyproject.spiders import osdir_spider
from responses import fake_response_from_file

class OsdirSpiderTest(unittest.TestCase):

    def setUp(self):
        self.spider = osdir_spider.DirectorySpider()

    def _test_item_results(self, results, expected_length):
        count = 0
        permalinks = set()
        for item in results:
            self.assertIsNotNone(item['content'])
            self.assertIsNotNone(item['title'])
        self.assertEqual(count, expected_length)

    def test_parse(self):
        results = self.spider.parse(fake_response_from_file('osdir/sample.html'))
        self._test_item_results(results, 10)

That's basically how I test my parsing methods, but its not only for parsing methods. If it gets more complex I suggest looking at Mox

回复收藏 0 原文

疯狂的代价 2024-11-24 00:02:57

我使用 Betamax 第一次在真实站点上运行测试并保留 http本地响应，以便接下来的测试运行速度超快：

Betamax 会拦截您发出的每个请求，并尝试查找已被拦截和记录的匹配请求。

当您需要获取最新版本的站点时，只需删除 betamax 记录的内容并重新运行测试即可。

示例：

from scrapy import Spider, Request
from scrapy.http import HtmlResponse


class Example(Spider):
    name = 'example'

    url = 'http://doc.scrapy.org/en/latest/_static/selectors-sample1.html'

    def start_requests(self):
        yield Request(self.url, self.parse)

    def parse(self, response):
        for href in response.xpath('//a/@href').extract():
            yield {'image_href': href}


# Test part
from betamax import Betamax
from betamax.fixtures.unittest import BetamaxTestCase


with Betamax.configure() as config:
    # where betamax will store cassettes (http responses):
    config.cassette_library_dir = 'cassettes'
    config.preserve_exact_body_bytes = True


class TestExample(BetamaxTestCase):  # superclass provides self.session

    def test_parse(self):
        example = Example()

        # http response is recorded in a betamax cassette:
        response = self.session.get(example.url)

        # forge a scrapy response to test
        scrapy_response = HtmlResponse(body=response.content, url=example.url)

        result = example.parse(scrapy_response)

        self.assertEqual({'image_href': u'image1.html'}, result.next())
        self.assertEqual({'image_href': u'image2.html'}, result.next())
        self.assertEqual({'image_href': u'image3.html'}, result.next())
        self.assertEqual({'image_href': u'image4.html'}, result.next())
        self.assertEqual({'image_href': u'image5.html'}, result.next())

        with self.assertRaises(StopIteration):
            result.next()

仅供参考，我在 pycon 2015 上发现了 betamax，感谢伊恩·科达斯科的演讲。

I use Betamax to run test on real site the first time and keep http responses locally so that next tests run super fast after:

Betamax intercepts every request you make and attempts to find a matching request that has already been intercepted and recorded.

When you need to get latest version of site, just remove what betamax has recorded and re-run test.

Example:

from scrapy import Spider, Request
from scrapy.http import HtmlResponse


class Example(Spider):
    name = 'example'

    url = 'http://doc.scrapy.org/en/latest/_static/selectors-sample1.html'

    def start_requests(self):
        yield Request(self.url, self.parse)

    def parse(self, response):
        for href in response.xpath('//a/@href').extract():
            yield {'image_href': href}


# Test part
from betamax import Betamax
from betamax.fixtures.unittest import BetamaxTestCase


with Betamax.configure() as config:
    # where betamax will store cassettes (http responses):
    config.cassette_library_dir = 'cassettes'
    config.preserve_exact_body_bytes = True


class TestExample(BetamaxTestCase):  # superclass provides self.session

    def test_parse(self):
        example = Example()

        # http response is recorded in a betamax cassette:
        response = self.session.get(example.url)

        # forge a scrapy response to test
        scrapy_response = HtmlResponse(body=response.content, url=example.url)

        result = example.parse(scrapy_response)

        self.assertEqual({'image_href': u'image1.html'}, result.next())
        self.assertEqual({'image_href': u'image2.html'}, result.next())
        self.assertEqual({'image_href': u'image3.html'}, result.next())
        self.assertEqual({'image_href': u'image4.html'}, result.next())
        self.assertEqual({'image_href': u'image5.html'}, result.next())

        with self.assertRaises(StopIteration):
            result.next()

FYI, I discover betamax at pycon 2015 thanks to Ian Cordasco's talk.

回复收藏 0 原文

淡淡の花香 2024-11-24 00:02:57

新添加的蜘蛛合约值得尝试。它为您提供了一种添加测试的简单方法，而不需要大量代码。

回复收藏 0 原文

单调的奢华 2024-11-24 00:02:57

这是一个很晚的答案，但我对 scrapy 测试感到恼火，所以我写了 scrapy-test 根据定义的规范测试 scrapy 爬虫的框架。

它通过定义测试规范而不是静态输出来工作。
例如，如果我们正在爬取此类项目：

{
    "name": "Alex",
    "age": 21,
    "gender": "Female",
}

我们可以定义 scrapy-test ItemSpec：

from scrapytest.tests import Match, MoreThan, LessThan
from scrapytest.spec import ItemSpec

class MySpec(ItemSpec):
    name_test = Match('{3,}')  # name should be at least 3 characters long
    age_test = Type(int), MoreThan(18), LessThan(99)
    gender_test = Match('Female|Male')

还有与 scrapy stats 相同的想法测试 StatsSpec：

from scrapytest.spec import StatsSpec
from scrapytest.tests import Morethan

class MyStatsSpec(StatsSpec):
    validate = {
        "item_scraped_count": MoreThan(0),
    }

之后可以对其运行实时或缓存结果：

$ scrapy-test 
# or
$ scrapy-test --cache

我一直在运行缓存运行以进行开发更改，并运行每日 cronjobs 以检测网站更改。

This is a very late answer but I've been annoyed with scrapy testing so I wrote scrapy-test a framework for testing scrapy crawlers against defined specifications.

It works by defining test specifications rather than static output.
For example if we are crawling this sort of item:

{
    "name": "Alex",
    "age": 21,
    "gender": "Female",
}

We can defined scrapy-test ItemSpec:

from scrapytest.tests import Match, MoreThan, LessThan
from scrapytest.spec import ItemSpec

class MySpec(ItemSpec):
    name_test = Match('{3,}')  # name should be at least 3 characters long
    age_test = Type(int), MoreThan(18), LessThan(99)
    gender_test = Match('Female|Male')

There's also same idea tests for scrapy stats as StatsSpec:

from scrapytest.spec import StatsSpec
from scrapytest.tests import Morethan

class MyStatsSpec(StatsSpec):
    validate = {
        "item_scraped_count": MoreThan(0),
    }

Afterwards it can be run against live or cached results:

$ scrapy-test 
# or
$ scrapy-test --cache

I've been running cached runs for development changes and daily cronjobs for detecting website changes.

回复收藏 0 原文

忆悲凉 2024-11-24 00:02:57

我使用 Twisted 的 Trial 来运行测试，类似于 Scrapy 自己的测试。它已经启动了一个反应器，因此我使用 CrawlerRunner 而不必担心在测试中启动和停止反应器。

从 check 和 parse Scrapy 命令中窃取一些想法，我最终得到了以下基本 TestCase 类来针对实时站点运行断言：

from twisted.trial import unittest

from scrapy.crawler import CrawlerRunner
from scrapy.http import Request
from scrapy.item import BaseItem
from scrapy.utils.spider import iterate_spider_output

class SpiderTestCase(unittest.TestCase):
    def setUp(self):
        self.runner = CrawlerRunner()

    def make_test_class(self, cls, url):
        """
        Make a class that proxies to the original class,
        sets up a URL to be called, and gathers the items
        and requests returned by the parse function.
        """
        class TestSpider(cls):
            # This is a once used class, so writing into
            # the class variables is fine. The framework
            # will instantiate it, not us.
            items = []
            requests = []

            def start_requests(self):
                req = super(TestSpider, self).make_requests_from_url(url)
                req.meta["_callback"] = req.callback or self.parse
                req.callback = self.collect_output
                yield req

            def collect_output(self, response):
                try:
                    cb = response.request.meta["_callback"]
                    for x in iterate_spider_output(cb(response)):
                        if isinstance(x, (BaseItem, dict)):
                            self.items.append(x)
                        elif isinstance(x, Request):
                            self.requests.append(x)
                except Exception as ex:
                    print("ERROR", "Could not execute callback: ",     ex)
                    raise ex

                # Returning any requests here would make the     crawler follow them.
                return None

        return TestSpider

示例：

@defer.inlineCallbacks
def test_foo(self):
    tester = self.make_test_class(FooSpider, 'https://foo.com')
    yield self.runner.crawl(tester)
    self.assertEqual(len(tester.items), 1)
    self.assertEqual(len(tester.requests), 2)

或在设置中执行一个请求并对结果运行多个测试：

@defer.inlineCallbacks
def setUp(self):
    super(FooTestCase, self).setUp()
    if FooTestCase.tester is None:
        FooTestCase.tester = self.make_test_class(FooSpider, 'https://foo.com')
        yield self.runner.crawl(self.tester)

def test_foo(self):
    self.assertEqual(len(self.tester.items), 1)

I'm using Twisted's trial to run tests, similar to Scrapy's own tests. It already starts a reactor, so I make use of the CrawlerRunner without worrying about starting and stopping one in the tests.

Stealing some ideas from the check and parse Scrapy commands I ended up with the following base TestCase class to run assertions against live sites:

from twisted.trial import unittest

from scrapy.crawler import CrawlerRunner
from scrapy.http import Request
from scrapy.item import BaseItem
from scrapy.utils.spider import iterate_spider_output

class SpiderTestCase(unittest.TestCase):
    def setUp(self):
        self.runner = CrawlerRunner()

    def make_test_class(self, cls, url):
        """
        Make a class that proxies to the original class,
        sets up a URL to be called, and gathers the items
        and requests returned by the parse function.
        """
        class TestSpider(cls):
            # This is a once used class, so writing into
            # the class variables is fine. The framework
            # will instantiate it, not us.
            items = []
            requests = []

            def start_requests(self):
                req = super(TestSpider, self).make_requests_from_url(url)
                req.meta["_callback"] = req.callback or self.parse
                req.callback = self.collect_output
                yield req

            def collect_output(self, response):
                try:
                    cb = response.request.meta["_callback"]
                    for x in iterate_spider_output(cb(response)):
                        if isinstance(x, (BaseItem, dict)):
                            self.items.append(x)
                        elif isinstance(x, Request):
                            self.requests.append(x)
                except Exception as ex:
                    print("ERROR", "Could not execute callback: ",     ex)
                    raise ex

                # Returning any requests here would make the     crawler follow them.
                return None

        return TestSpider

Example:

@defer.inlineCallbacks
def test_foo(self):
    tester = self.make_test_class(FooSpider, 'https://foo.com')
    yield self.runner.crawl(tester)
    self.assertEqual(len(tester.items), 1)
    self.assertEqual(len(tester.requests), 2)

or perform one request in the setup and run multiple tests against the results:

@defer.inlineCallbacks
def setUp(self):
    super(FooTestCase, self).setUp()
    if FooTestCase.tester is None:
        FooTestCase.tester = self.make_test_class(FooSpider, 'https://foo.com')
        yield self.runner.crawl(self.tester)

def test_foo(self):
    self.assertEqual(len(self.tester.items), 1)

回复收藏 0 原文

时光磨忆 2024-11-24 00:02:57

稍微简单一点，通过从所选答案中删除 def fake_response_from_file ：

import unittest
from spiders.my_spider import MySpider
from scrapy.selector import Selector


class TestParsers(unittest.TestCase):


    def setUp(self):
        self.spider = MySpider(limit=1)
        self.html = Selector(text=open("some.htm", 'r').read())


    def test_some_parse(self):
        expected = "some-text"
        result = self.spider.some_parse(self.html)
        self.assertEqual(result, expected)


if __name__ == '__main__':
    unittest.main()

Slightly simpler, by removing the def fake_response_from_file from the chosen answer:

import unittest
from spiders.my_spider import MySpider
from scrapy.selector import Selector


class TestParsers(unittest.TestCase):


    def setUp(self):
        self.spider = MySpider(limit=1)
        self.html = Selector(text=open("some.htm", 'r').read())


    def test_some_parse(self):
        expected = "some-text"
        result = self.spider.some_parse(self.html)
        self.assertEqual(result, expected)


if __name__ == '__main__':
    unittest.main()

回复收藏 0 原文

~没有更多了~