我想在 Scrapy(屏幕抓取器/网络爬虫)中实现一些单元测试。由于项目是通过“scrapy scrapy”命令运行的,所以我可以通过鼻子之类的东西运行它。既然scrapy是建立在twisted之上的,我可以使用它的单元测试框架Trial吗?如果是这样,怎么办?否则我想让鼻子工作。
更新:
我一直在谈论Scrapy-Users我想我应该“在测试代码中构建响应,然后使用响应调用该方法并断言[I]在输出中获得预期的项目/请求”。但我似乎无法让它发挥作用。
我可以构建一个单元测试测试类并在测试中:
- 创建一个响应对象
- 尝试使用响应对象调用我的蜘蛛的解析方法
但是它最终生成 这个回溯。有什么见解吗?
I'd like to implement some unit tests in a Scrapy (screen scraper/web crawler). Since a project is run through the "scrapy crawl" command I can run it through something like nose. Since scrapy is built on top of twisted can I use its unit testing framework Trial? If so, how? Otherwise I'd like to get nose working.
Update:
I've been talking on Scrapy-Users and I guess I am supposed to "build the Response in the test code, and then call the method with the response and assert that [I] get the expected items/requests in the output". I can't seem to get this to work though.
I can build a unit-test test class and in a test:
- create a response object
- try to call the parse method of my spider with the response object
However it ends up generating this traceback. Any insight as to why?
发布评论
评论(10)
我正在使用 scrapy 1.3.0 和函数: fake_response_from_file,引发错误:
我得到:
解决方案是使用 TextResponse 代替,并且它工作正常,例如:
非常感谢。
I'm using scrapy 1.3.0 and the function: fake_response_from_file, raise an error:
I get:
The solution is to use TextResponse instead, and it works ok, as example:
Thanks a lot.
类似于 Hadrien 的答案,但对于 pytest: pytest-vcr。
Similar to Hadrien's answer but for pytest: pytest-vcr.
您可以按照 scrapy 站点中的 此 代码片段从脚本运行它。然后您可以对退回的物品做出任何您想要的断言。
You can follow this snippet from the scrapy site to run it from a script. Then you can make any kind of asserts you'd like on the returned items.
https://github.com/ThomasAitken/Scrapy-Testmaster
这是我写的一个包显着扩展了 Scrapy Autounit 库的功能,并将其带向不同的方向(允许轻松动态更新测试用例并合并调试/测试用例生成过程)。它还包括 Scrapy
parse
命令的修改版本(https://docs.scrapy.org/en/latest/topics/commands.html#std-command-parse)https://github.com/ThomasAitken/Scrapy-Testmaster
This is a package I wrote that significantly extends the functionality of the Scrapy Autounit library and takes it in a different direction (allowing for easy dynamic updating of testcases and merging the processes of debugging/testcase-generation). It also includes a modified version of the Scrapy
parse
command (https://docs.scrapy.org/en/latest/topics/commands.html#std-command-parse)我这样做的方法是创建假响应,这样您就可以离线测试解析函数。但通过使用真实的 HTML,您可以获得真实的情况。
这种方法的一个问题是您的本地 HTML 文件可能无法反映最新的在线状态。因此,如果 HTML 在线更改,您可能会遇到一个大错误,但您的测试用例仍然会通过。所以这种方式可能不是最好的测试方法。
我当前的工作流程是,每当出现错误时,我都会向管理员发送一封电子邮件,其中包含网址。然后,对于该特定错误,我创建一个 html 文件,其中包含导致错误的内容。然后我为它创建一个单元测试。
这是我用来创建示例 Scrapy http 响应以从本地 html 文件进行测试的代码:
示例 html 文件位于 scrapyproject/tests/responses/osdir/sample.html
然后测试用例可能如下所示:
测试用例位置是 scrapyproject/tests/test_osdir.py
这基本上就是我测试解析方法的方式,但它不仅适用于解析方法。如果它变得更复杂,我建议查看 Mox
The way I've done it is create fake responses, this way you can test the parse function offline. But you get the real situation by using real HTML.
A problem with this approach is that your local HTML file may not reflect the latest state online. So if the HTML changes online you may have a big bug, but your test cases will still pass. So it may not be the best way to test this way.
My current workflow is, whenever there is an error I will sent an email to admin, with the url. Then for that specific error I create a html file with the content which is causing the error. Then I create a unittest for it.
This is the code I use to create sample Scrapy http responses for testing from an local html file:
The sample html file is located in scrapyproject/tests/responses/osdir/sample.html
Then the testcase could look as follows:
The test case location is scrapyproject/tests/test_osdir.py
That's basically how I test my parsing methods, but its not only for parsing methods. If it gets more complex I suggest looking at Mox
我使用 Betamax 第一次在真实站点上运行测试并保留 http本地响应,以便接下来的测试运行速度超快:
当您需要获取最新版本的站点时,只需删除 betamax 记录的内容并重新运行测试即可。
示例:
仅供参考,我在 pycon 2015 上发现了 betamax,感谢 伊恩·科达斯科的演讲。
I use Betamax to run test on real site the first time and keep http responses locally so that next tests run super fast after:
When you need to get latest version of site, just remove what betamax has recorded and re-run test.
Example:
FYI, I discover betamax at pycon 2015 thanks to Ian Cordasco's talk.
新添加的蜘蛛合约值得尝试。它为您提供了一种添加测试的简单方法,而不需要大量代码。
The newly added Spider Contracts are worth trying. It gives you a simple way to add tests without requiring a lot of code.
这是一个很晚的答案,但我对 scrapy 测试感到恼火,所以我写了 scrapy-test 根据定义的规范测试 scrapy 爬虫的框架。
它通过定义测试规范而不是静态输出来工作。
例如,如果我们正在爬取此类项目:
我们可以定义 scrapy-test
ItemSpec
:还有与 scrapy stats 相同的想法测试
StatsSpec
:之后可以对其运行实时或缓存结果:
我一直在运行缓存运行以进行开发更改,并运行每日 cronjobs 以检测网站更改。
This is a very late answer but I've been annoyed with scrapy testing so I wrote scrapy-test a framework for testing scrapy crawlers against defined specifications.
It works by defining test specifications rather than static output.
For example if we are crawling this sort of item:
We can defined scrapy-test
ItemSpec
:There's also same idea tests for scrapy stats as
StatsSpec
:Afterwards it can be run against live or cached results:
I've been running cached runs for development changes and daily cronjobs for detecting website changes.
我使用 Twisted 的
Trial
来运行测试,类似于 Scrapy 自己的测试。它已经启动了一个反应器,因此我使用CrawlerRunner
而不必担心在测试中启动和停止反应器。从
check
和parse
Scrapy 命令中窃取一些想法,我最终得到了以下基本TestCase
类来针对实时站点运行断言:示例:
或在设置中执行一个请求并对结果运行多个测试:
I'm using Twisted's
trial
to run tests, similar to Scrapy's own tests. It already starts a reactor, so I make use of theCrawlerRunner
without worrying about starting and stopping one in the tests.Stealing some ideas from the
check
andparse
Scrapy commands I ended up with the following baseTestCase
class to run assertions against live sites:Example:
or perform one request in the setup and run multiple tests against the results:
稍微简单一点,通过从所选答案中删除
def fake_response_from_file
:Slightly simpler, by removing the
def fake_response_from_file
from the chosen answer: