Scrapy 单元测试

发布于 2024-11-17 00:02:57 字数 528 浏览 10 评论 0 原文

我想在 Scrapy(屏幕抓取器/网络爬虫)中实现一些单元测试。由于项目是通过“scrapy scrapy”命令运行的,所以我可以通过鼻子之类的东西运行它。既然scrapy是建立在twisted之上的,我可以使用它的单元测试框架Trial吗?如果是这样,怎么办?否则我想让鼻子工作。

更新:

我一直在谈论Scrapy-Users我想我应该“在测试代码中构建响应,然后使用响应调用该方法并断言[I]在输出中获得预期的项目/请求”。但我似乎无法让它发挥作用。

我可以构建一个单元测试测试类并在测试中:

  • 创建一个响应对象
  • 尝试使用响应对象调用我的蜘蛛的解析方法

但是它最终生成 这个回溯。有什么见解吗?

I'd like to implement some unit tests in a Scrapy (screen scraper/web crawler). Since a project is run through the "scrapy crawl" command I can run it through something like nose. Since scrapy is built on top of twisted can I use its unit testing framework Trial? If so, how? Otherwise I'd like to get nose working.

Update:

I've been talking on Scrapy-Users and I guess I am supposed to "build the Response in the test code, and then call the method with the response and assert that [I] get the expected items/requests in the output". I can't seem to get this to work though.

I can build a unit-test test class and in a test:

  • create a response object
  • try to call the parse method of my spider with the response object

However it ends up generating this traceback. Any insight as to why?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(10

守望孤独 2024-11-24 00:02:58

我正在使用 scrapy 1.3.0 和函数: fake_response_from_file,引发错误:

response = Response(url=url, request=request, body=file_content)

我得到:

raise AttributeError("Response content isn't text")

解决方案是使用 TextResponse 代替,并且它工作正常,例如:

response = TextResponse(url=url, request=request, body=file_content)     

非常感谢。

I'm using scrapy 1.3.0 and the function: fake_response_from_file, raise an error:

response = Response(url=url, request=request, body=file_content)

I get:

raise AttributeError("Response content isn't text")

The solution is to use TextResponse instead, and it works ok, as example:

response = TextResponse(url=url, request=request, body=file_content)     

Thanks a lot.

澜川若宁 2024-11-24 00:02:58

类似于 Hadrien 的答案,但对于 pytest: pytest-vcr

import requests
import pytest
from scrapy.http import HtmlResponse

@pytest.mark.vcr()
def test_parse(url, target):
    response = requests.get(url)
    scrapy_response = HtmlResponse(url, body=response.content)
    assert Spider().parse(scrapy_response) == target

Similar to Hadrien's answer but for pytest: pytest-vcr.

import requests
import pytest
from scrapy.http import HtmlResponse

@pytest.mark.vcr()
def test_parse(url, target):
    response = requests.get(url)
    scrapy_response = HtmlResponse(url, body=response.content)
    assert Spider().parse(scrapy_response) == target

近箐 2024-11-24 00:02:58

您可以按照 scrapy 站点中的 代码片段从脚本运行它。然后您可以对退回的物品做出任何您想要的断言。

You can follow this snippet from the scrapy site to run it from a script. Then you can make any kind of asserts you'd like on the returned items.

红玫瑰 2024-11-24 00:02:58

https://github.com/ThomasAitken/Scrapy-Testmaster

这是我写的一个包显着扩展了 Scrapy Autounit 库的功能,并将其带向不同的方向(允许轻松动态更新测试用例并合并调试/测试用例生成过程)。它还包括 Scrapy parse 命令的修改版本(https://docs.scrapy.org/en/latest/topics/commands.html#std-command-parse

https://github.com/ThomasAitken/Scrapy-Testmaster

This is a package I wrote that significantly extends the functionality of the Scrapy Autounit library and takes it in a different direction (allowing for easy dynamic updating of testcases and merging the processes of debugging/testcase-generation). It also includes a modified version of the Scrapy parse command (https://docs.scrapy.org/en/latest/topics/commands.html#std-command-parse)

爱格式化 2024-11-24 00:02:57

我这样做的方法是创建假响应,这样您就可以离线测试解析函数。但通过使用真实的 HTML,您可以获得真实的情况。

这种方法的一个问题是您的本地 HTML 文件可能无法反映最新的在线状态。因此,如果 HTML 在线更改,您可能会遇到一个大错误,但您的测试用例仍然会通过。所以这种方式可能不是最好的测试方法。

我当前的工作流程是,每当出现错误时,我都会向管理员发送一封电子邮件,其中包含网址。然后,对于该特定错误,我创建一个 html 文件,其中包含导致错误的内容。然后我为它创建一个单元测试。

这是我用来创建示例 Scrapy http 响应以从本地 html 文件进行测试的代码:

# scrapyproject/tests/responses/__init__.py

import os

from scrapy.http import Response, Request

def fake_response_from_file(file_name, url=None):
    """
    Create a Scrapy fake HTTP response from a HTML file
    @param file_name: The relative filename from the responses directory,
                      but absolute paths are also accepted.
    @param url: The URL of the response.
    returns: A scrapy HTTP response which can be used for unittesting.
    """
    if not url:
        url = 'http://www.example.com'

    request = Request(url=url)
    if not file_name[0] == '/':
        responses_dir = os.path.dirname(os.path.realpath(__file__))
        file_path = os.path.join(responses_dir, file_name)
    else:
        file_path = file_name

    file_content = open(file_path, 'r').read()

    response = Response(url=url,
        request=request,
        body=file_content)
    response.encoding = 'utf-8'
    return response

示例 html 文件位于 scrapyproject/tests/responses/osdir/sample.html

然后测试用例可能如下所示:
测试用例位置是 scrapyproject/tests/test_osdir.py

import unittest
from scrapyproject.spiders import osdir_spider
from responses import fake_response_from_file

class OsdirSpiderTest(unittest.TestCase):

    def setUp(self):
        self.spider = osdir_spider.DirectorySpider()

    def _test_item_results(self, results, expected_length):
        count = 0
        permalinks = set()
        for item in results:
            self.assertIsNotNone(item['content'])
            self.assertIsNotNone(item['title'])
        self.assertEqual(count, expected_length)

    def test_parse(self):
        results = self.spider.parse(fake_response_from_file('osdir/sample.html'))
        self._test_item_results(results, 10)

这基本上就是我测试解析方法的方式,但它不仅适用于解析方法。如果它变得更复杂,我建议查看 Mox

The way I've done it is create fake responses, this way you can test the parse function offline. But you get the real situation by using real HTML.

A problem with this approach is that your local HTML file may not reflect the latest state online. So if the HTML changes online you may have a big bug, but your test cases will still pass. So it may not be the best way to test this way.

My current workflow is, whenever there is an error I will sent an email to admin, with the url. Then for that specific error I create a html file with the content which is causing the error. Then I create a unittest for it.

This is the code I use to create sample Scrapy http responses for testing from an local html file:

# scrapyproject/tests/responses/__init__.py

import os

from scrapy.http import Response, Request

def fake_response_from_file(file_name, url=None):
    """
    Create a Scrapy fake HTTP response from a HTML file
    @param file_name: The relative filename from the responses directory,
                      but absolute paths are also accepted.
    @param url: The URL of the response.
    returns: A scrapy HTTP response which can be used for unittesting.
    """
    if not url:
        url = 'http://www.example.com'

    request = Request(url=url)
    if not file_name[0] == '/':
        responses_dir = os.path.dirname(os.path.realpath(__file__))
        file_path = os.path.join(responses_dir, file_name)
    else:
        file_path = file_name

    file_content = open(file_path, 'r').read()

    response = Response(url=url,
        request=request,
        body=file_content)
    response.encoding = 'utf-8'
    return response

The sample html file is located in scrapyproject/tests/responses/osdir/sample.html

Then the testcase could look as follows:
The test case location is scrapyproject/tests/test_osdir.py

import unittest
from scrapyproject.spiders import osdir_spider
from responses import fake_response_from_file

class OsdirSpiderTest(unittest.TestCase):

    def setUp(self):
        self.spider = osdir_spider.DirectorySpider()

    def _test_item_results(self, results, expected_length):
        count = 0
        permalinks = set()
        for item in results:
            self.assertIsNotNone(item['content'])
            self.assertIsNotNone(item['title'])
        self.assertEqual(count, expected_length)

    def test_parse(self):
        results = self.spider.parse(fake_response_from_file('osdir/sample.html'))
        self._test_item_results(results, 10)

That's basically how I test my parsing methods, but its not only for parsing methods. If it gets more complex I suggest looking at Mox

疯狂的代价 2024-11-24 00:02:57

我使用 Betamax 第一次在真实站点上运行测试并保留 http本地响应,以便接下来的测试运行速度超快:

Betamax 会拦截您发出的每个请求,并尝试查找已被拦截和记录的匹配请求。

当您需要获取最新版本的站点时,只需删除 betamax 记录的内容并重新运行测试即可。

示例:

from scrapy import Spider, Request
from scrapy.http import HtmlResponse


class Example(Spider):
    name = 'example'

    url = 'http://doc.scrapy.org/en/latest/_static/selectors-sample1.html'

    def start_requests(self):
        yield Request(self.url, self.parse)

    def parse(self, response):
        for href in response.xpath('//a/@href').extract():
            yield {'image_href': href}


# Test part
from betamax import Betamax
from betamax.fixtures.unittest import BetamaxTestCase


with Betamax.configure() as config:
    # where betamax will store cassettes (http responses):
    config.cassette_library_dir = 'cassettes'
    config.preserve_exact_body_bytes = True


class TestExample(BetamaxTestCase):  # superclass provides self.session

    def test_parse(self):
        example = Example()

        # http response is recorded in a betamax cassette:
        response = self.session.get(example.url)

        # forge a scrapy response to test
        scrapy_response = HtmlResponse(body=response.content, url=example.url)

        result = example.parse(scrapy_response)

        self.assertEqual({'image_href': u'image1.html'}, result.next())
        self.assertEqual({'image_href': u'image2.html'}, result.next())
        self.assertEqual({'image_href': u'image3.html'}, result.next())
        self.assertEqual({'image_href': u'image4.html'}, result.next())
        self.assertEqual({'image_href': u'image5.html'}, result.next())

        with self.assertRaises(StopIteration):
            result.next()

仅供参考,我在 pycon 2015 上发现了 betamax,感谢 伊恩·科达斯科的演讲

I use Betamax to run test on real site the first time and keep http responses locally so that next tests run super fast after:

Betamax intercepts every request you make and attempts to find a matching request that has already been intercepted and recorded.

When you need to get latest version of site, just remove what betamax has recorded and re-run test.

Example:

from scrapy import Spider, Request
from scrapy.http import HtmlResponse


class Example(Spider):
    name = 'example'

    url = 'http://doc.scrapy.org/en/latest/_static/selectors-sample1.html'

    def start_requests(self):
        yield Request(self.url, self.parse)

    def parse(self, response):
        for href in response.xpath('//a/@href').extract():
            yield {'image_href': href}


# Test part
from betamax import Betamax
from betamax.fixtures.unittest import BetamaxTestCase


with Betamax.configure() as config:
    # where betamax will store cassettes (http responses):
    config.cassette_library_dir = 'cassettes'
    config.preserve_exact_body_bytes = True


class TestExample(BetamaxTestCase):  # superclass provides self.session

    def test_parse(self):
        example = Example()

        # http response is recorded in a betamax cassette:
        response = self.session.get(example.url)

        # forge a scrapy response to test
        scrapy_response = HtmlResponse(body=response.content, url=example.url)

        result = example.parse(scrapy_response)

        self.assertEqual({'image_href': u'image1.html'}, result.next())
        self.assertEqual({'image_href': u'image2.html'}, result.next())
        self.assertEqual({'image_href': u'image3.html'}, result.next())
        self.assertEqual({'image_href': u'image4.html'}, result.next())
        self.assertEqual({'image_href': u'image5.html'}, result.next())

        with self.assertRaises(StopIteration):
            result.next()

FYI, I discover betamax at pycon 2015 thanks to Ian Cordasco's talk.

淡淡の花香 2024-11-24 00:02:57

新添加的蜘蛛合约值得尝试。它为您提供了一种添加测试的简单方法,而不需要大量代码。

The newly added Spider Contracts are worth trying. It gives you a simple way to add tests without requiring a lot of code.

单调的奢华 2024-11-24 00:02:57

这是一个很晚的答案,但我对 scrapy 测试感到恼火,所以我写了 scrapy-test 根据定义的规范测试 scrapy 爬虫的框架。

它通过定义测试规范而不是静态输出来工作。
例如,如果我们正在爬取此类项目:

{
    "name": "Alex",
    "age": 21,
    "gender": "Female",
}

我们可以定义 scrapy-test ItemSpec

from scrapytest.tests import Match, MoreThan, LessThan
from scrapytest.spec import ItemSpec

class MySpec(ItemSpec):
    name_test = Match('{3,}')  # name should be at least 3 characters long
    age_test = Type(int), MoreThan(18), LessThan(99)
    gender_test = Match('Female|Male')

还有与 scrapy stats 相同的想法测试 StatsSpec

from scrapytest.spec import StatsSpec
from scrapytest.tests import Morethan

class MyStatsSpec(StatsSpec):
    validate = {
        "item_scraped_count": MoreThan(0),
    }

之后可以对其运行实时或缓存结果:

$ scrapy-test 
# or
$ scrapy-test --cache

我一直在运行缓存运行以进行开发更改,并运行每日 cronjobs 以检测网站更改。

This is a very late answer but I've been annoyed with scrapy testing so I wrote scrapy-test a framework for testing scrapy crawlers against defined specifications.

It works by defining test specifications rather than static output.
For example if we are crawling this sort of item:

{
    "name": "Alex",
    "age": 21,
    "gender": "Female",
}

We can defined scrapy-test ItemSpec:

from scrapytest.tests import Match, MoreThan, LessThan
from scrapytest.spec import ItemSpec

class MySpec(ItemSpec):
    name_test = Match('{3,}')  # name should be at least 3 characters long
    age_test = Type(int), MoreThan(18), LessThan(99)
    gender_test = Match('Female|Male')

There's also same idea tests for scrapy stats as StatsSpec:

from scrapytest.spec import StatsSpec
from scrapytest.tests import Morethan

class MyStatsSpec(StatsSpec):
    validate = {
        "item_scraped_count": MoreThan(0),
    }

Afterwards it can be run against live or cached results:

$ scrapy-test 
# or
$ scrapy-test --cache

I've been running cached runs for development changes and daily cronjobs for detecting website changes.

忆悲凉 2024-11-24 00:02:57

我使用 Twisted 的 Trial 来运行测试,类似于 Scrapy 自己的测试。它已经启动了一个反应器,因此我使用 CrawlerRunner 而不必担心在测试中启动和停止反应器。

checkparse Scrapy 命令中窃取一些想法,我最终得到了以下基本 TestCase 类来针对实时站点运行断言:

from twisted.trial import unittest

from scrapy.crawler import CrawlerRunner
from scrapy.http import Request
from scrapy.item import BaseItem
from scrapy.utils.spider import iterate_spider_output

class SpiderTestCase(unittest.TestCase):
    def setUp(self):
        self.runner = CrawlerRunner()

    def make_test_class(self, cls, url):
        """
        Make a class that proxies to the original class,
        sets up a URL to be called, and gathers the items
        and requests returned by the parse function.
        """
        class TestSpider(cls):
            # This is a once used class, so writing into
            # the class variables is fine. The framework
            # will instantiate it, not us.
            items = []
            requests = []

            def start_requests(self):
                req = super(TestSpider, self).make_requests_from_url(url)
                req.meta["_callback"] = req.callback or self.parse
                req.callback = self.collect_output
                yield req

            def collect_output(self, response):
                try:
                    cb = response.request.meta["_callback"]
                    for x in iterate_spider_output(cb(response)):
                        if isinstance(x, (BaseItem, dict)):
                            self.items.append(x)
                        elif isinstance(x, Request):
                            self.requests.append(x)
                except Exception as ex:
                    print("ERROR", "Could not execute callback: ",     ex)
                    raise ex

                # Returning any requests here would make the     crawler follow them.
                return None

        return TestSpider

示例:

@defer.inlineCallbacks
def test_foo(self):
    tester = self.make_test_class(FooSpider, 'https://foo.com')
    yield self.runner.crawl(tester)
    self.assertEqual(len(tester.items), 1)
    self.assertEqual(len(tester.requests), 2)

或在设置中执行一个请求并对结果运行多个测试:

@defer.inlineCallbacks
def setUp(self):
    super(FooTestCase, self).setUp()
    if FooTestCase.tester is None:
        FooTestCase.tester = self.make_test_class(FooSpider, 'https://foo.com')
        yield self.runner.crawl(self.tester)

def test_foo(self):
    self.assertEqual(len(self.tester.items), 1)

I'm using Twisted's trial to run tests, similar to Scrapy's own tests. It already starts a reactor, so I make use of the CrawlerRunner without worrying about starting and stopping one in the tests.

Stealing some ideas from the check and parse Scrapy commands I ended up with the following base TestCase class to run assertions against live sites:

from twisted.trial import unittest

from scrapy.crawler import CrawlerRunner
from scrapy.http import Request
from scrapy.item import BaseItem
from scrapy.utils.spider import iterate_spider_output

class SpiderTestCase(unittest.TestCase):
    def setUp(self):
        self.runner = CrawlerRunner()

    def make_test_class(self, cls, url):
        """
        Make a class that proxies to the original class,
        sets up a URL to be called, and gathers the items
        and requests returned by the parse function.
        """
        class TestSpider(cls):
            # This is a once used class, so writing into
            # the class variables is fine. The framework
            # will instantiate it, not us.
            items = []
            requests = []

            def start_requests(self):
                req = super(TestSpider, self).make_requests_from_url(url)
                req.meta["_callback"] = req.callback or self.parse
                req.callback = self.collect_output
                yield req

            def collect_output(self, response):
                try:
                    cb = response.request.meta["_callback"]
                    for x in iterate_spider_output(cb(response)):
                        if isinstance(x, (BaseItem, dict)):
                            self.items.append(x)
                        elif isinstance(x, Request):
                            self.requests.append(x)
                except Exception as ex:
                    print("ERROR", "Could not execute callback: ",     ex)
                    raise ex

                # Returning any requests here would make the     crawler follow them.
                return None

        return TestSpider

Example:

@defer.inlineCallbacks
def test_foo(self):
    tester = self.make_test_class(FooSpider, 'https://foo.com')
    yield self.runner.crawl(tester)
    self.assertEqual(len(tester.items), 1)
    self.assertEqual(len(tester.requests), 2)

or perform one request in the setup and run multiple tests against the results:

@defer.inlineCallbacks
def setUp(self):
    super(FooTestCase, self).setUp()
    if FooTestCase.tester is None:
        FooTestCase.tester = self.make_test_class(FooSpider, 'https://foo.com')
        yield self.runner.crawl(self.tester)

def test_foo(self):
    self.assertEqual(len(self.tester.items), 1)
时光磨忆 2024-11-24 00:02:57

稍微简单一点,通过从所选答案中删除 def fake_response_from_file

import unittest
from spiders.my_spider import MySpider
from scrapy.selector import Selector


class TestParsers(unittest.TestCase):


    def setUp(self):
        self.spider = MySpider(limit=1)
        self.html = Selector(text=open("some.htm", 'r').read())


    def test_some_parse(self):
        expected = "some-text"
        result = self.spider.some_parse(self.html)
        self.assertEqual(result, expected)


if __name__ == '__main__':
    unittest.main()

Slightly simpler, by removing the def fake_response_from_file from the chosen answer:

import unittest
from spiders.my_spider import MySpider
from scrapy.selector import Selector


class TestParsers(unittest.TestCase):


    def setUp(self):
        self.spider = MySpider(limit=1)
        self.html = Selector(text=open("some.htm", 'r').read())


    def test_some_parse(self):
        expected = "some-text"
        result = self.spider.some_parse(self.html)
        self.assertEqual(result, expected)


if __name__ == '__main__':
    unittest.main()
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文