单元测试屏幕刮刀
我正在编写 HTML 屏幕抓取工具。为此创建单元测试的最佳方法是什么?
拥有一个静态 html 文件并在每次测试时从磁盘读取它是否“可以”?
你有什么建议吗?
I'm in the process of writing an HTML screen scraper. What would be the best way to create unit tests for this?
Is it "ok" to have a static html file and read it from disk on every test?
Do you have any suggestions?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(11)
为了保证测试可以一遍又一遍地运行,您应该有一个静态页面来测试。 (即,从磁盘就可以)
如果您编写一个涉及网络上实时页面的测试,那可能不是单元测试,而是集成测试。你也可以拥有那些。
To guarantee that the test can be run over and over again, you should have a static page to test against. (Ie. from disk is OK)
If you write a test that touches the live page on the web, thats probably not a unit test, but an integration test. You could have those too.
对于我的 ruby+mechanize scrapers,我一直在尝试集成测试,这些测试可以对尽可能多的目标页面版本进行透明测试。
在测试中,除了手动保存的“原始”副本之外,我还重载了 scraper HTTP 获取方法,以自动重新缓存较新版本的页面。然后,每个集成测试都针对以下对象运行:
...并引发异常如果它们返回的字段数量不同,例如它们更改了缩略图类的名称,但仍然提供了一些弹性,以防止测试因目标站点已关闭而中断。
For my ruby+mechanize scrapers I've been experimenting with integration tests that transparently test against as many possible versions of the target page as possible.
Inside the tests I'm overloading the scraper HTTP fetch method to automatically re-cache a newer version of the page, in addition to an "original" copy saved manually. Then each integration test runs against:
... and raises an exception if the number of fields returned by them is different, e.g. they've changed the name of a thumbnail class, but still provides some resilience against tests breaking because the target site is down.
文件没问题,但是:您的屏幕抓取工具处理文本。您应该有各种单元测试,“抓取”每个单元测试中硬编码的不同文本片段。每一块都应该“激发”你的刮刀方法的各个部分。
通过这种方式,您可以完全消除对任何外部内容(文件和网页)的依赖关系。而且您的测试将更容易单独维护,因为它们不再依赖于外部文件。您的单元测试也会(稍微)更快地执行;)
Files are ok but: your screen scraper processes text. You should have various unit tests that "scrapes" different pieces of text hard coded within each unit test. Each piece should "provoke" the various parts of your scraper method.
This way you completely remove dependencies to anything external, both files and web pages. And your tests will be easier to maintain individually since they no longer depends on external files. Your unit tests will also execute (slightly) faster ;)
要创建单元测试,您需要了解抓取工具的工作原理以及您认为它应该提取哪些类型的信息。使用简单的网页作为单元测试可能没问题,具体取决于爬虫的复杂性。
对于回归测试,您绝对应该将文件保存在磁盘上。
但如果您的最终目标是抓取网络,您还应该记录常见查询和返回的 HTML。这样,当您的应用程序失败时,您可以快速捕获所有过去感兴趣的查询(例如使用
wget
或curl
)并查明 HTML 是否以及如何更改。换句话说,针对已知 HTML 和来自已知查询的未知 HTML 进行回归测试。如果您发出已知查询并且返回的 HTML 与数据库中的内容相同,则无需对其进行两次测试。
顺便说一句,自从我停止尝试抓取原始 HTML 并开始抓取
w3m -dump
的输出(ASCII 和真是太容易对付了!To create your unit tests, you need to know how your scraper works and what sorts of information you think it should be extracting. Using simple web pages as unit tests could be OK depending on the complexity of your scraper.
For regression testing, you should absolutely keep files on disk.
But if your ultimate goal is to scrape the web, you should also keep a record of common queries and the HTML that comes back. This way, when your application fails, you can quickly capture all past queries of interest (using say
wget
orcurl
) and find out if and how the HTML has changed.In other words, regression test both against known HTML and against unknown HTML from known queries. If you issue a known query and the HTML that comes back is identical to what's in your database, you don't need to test it twice.
Incidentally, I've had much better luck screen scraping ever since I stopped trying to scrape raw HTML and started instead to scrape the output of
w3m -dump
, which is ASCII and is so much easier to deal with!您需要考虑一下您正在刮擦的是什么。
如果 html 是静态的,那么我只需使用磁盘上的几个不同的本地副本。由于您知道 html 不一定会发生巨大变化并破坏您的抓取工具,因此您可以放心地使用本地文件编写测试。
如果 html 是动态的(再次,宽松的术语),那么您可能需要继续并在测试中使用实时请求。如果您在这种情况下使用本地副本并且测试通过,您可能期望实时 html 执行相同的操作,但它可能会失败。在这种情况下,通过每次测试实时 html,您可以在部署之前立即知道您的屏幕抓取工具是否达到标准。
现在,如果您只是不关心 html 的格式、元素的顺序或结构,因为您只是根据某种匹配机制(正则表达式/其他)提取单个元素,那么本地副本可能没问题,但您可能仍然想倾向于针对实时 html 进行测试。如果实时 html 发生变化,特别是您正在寻找的部分内容发生变化,那么如果您使用本地副本,您的测试可能会通过,但部署可能会失败。
如果可以的话,我的意见是针对实时 html 进行测试。当实时 html 可能失败时,这将阻止您的本地测试通过,反之亦然。我不认为屏幕截图有最佳实践,因为屏幕截图本身就是不寻常的小虫子。如果网站或 Web 服务不公开 API,则屏幕截图是获取所需数据的一种简单的解决方法。
You need to think about what it is you are scraping.
If the html is static, then I would just use a couple different local copies on disk. Since you know the html is not bound to change drastically and break your scraper, you can confidently write your test using a local file.
If the html is dynamic (again, loose term), then you may want to go ahead and use live requests in the test. If you use a local copy in this scenario and the test passes you may expect the live html to do the same, whereas it may fail. In this case, by testing against the live html every time, you immediately know if your screen scraper is up to par or not, before deployment.
Now if you simply don't care what format the html is, the order of the elements, or the structure because you are simply pulling out individual elements based on some matching mechanism (Regex/Other), then a local copy may be fine, but you may still want to lean towards testing against live html. If the live html changes, specifically parts of what you are looking for, then your test may pass if you're using a local copy, but come deployment may fail.
My opinion would be to test against live html if you can. This will prevent your local tests from passing when the live html may fail, and visa-versa. I don't think there is a best practice with screenscrapers, because screenscrapers in themselves are unusual little buggers. If a website or web service does not expose a API, a screenscraper is sort of a cheesy workaround to getting the data you want.
你的建议听起来很合理。我可能会有一个包含合适的测试 HTML 文件的目录,以及每个文件的预期数据。当您遇到已知有问题的页面时,您可以进一步填充它们,以形成完整的回归测试套件。
您还应该对实际对话的 HTTP 执行集成测试(不仅包括成功的页面获取,还包括 404 错误、无响应的服务器等)
What you're suggesting sounds sensible. I'd perhaps have a directory of suitable test HTML files, plus data on what to expect for each one. You can further populate that with known problematic pages as/when you come across them, to form a complete regression test suite.
You should also perform integration tests for actually talking HTTP (including not just successful page fetches, but also 404 errors, unresponsive servers etc.)
我想说这取决于您需要运行多少种不同的测试。
如果您需要在单元测试中检查大量不同的内容,那么最好在测试初始化过程中生成 HTML 输出。它仍然是基于文件的,但您将拥有可扩展的模式:
这样,当您将来添加测试 ZZZZZ 时,您将拥有提供测试数据的一致方式。
如果您只是运行有限数量的测试,并且将保持这种状态,那么一些预先编写的静态 HTML 文件应该没问题。
当然,按照 Rich 的建议进行一些集成测试。
I would say that depends on how many different tests you need to run.
If you need to check for a large number of different things in your unit test, you might be better off generating HTML output as part of your test initialization. It would still be file-based, but you would have an extensible pattern:
That way when you add test ZZZZZ down the road, you would have a consistent way of providing test data.
If you are just running a limited number of tests, and it will stay that way, a few pre-written static HTML files should be fine.
Certainly do some integration tests as Rich suggests.
您正在创建一个外部依赖项,这将是脆弱的。
为什么不创建一个 TestContent 项目,填充一堆资源文件?将源 HTML 复制并粘贴到资源文件中,然后您可以在单元测试中引用它们。
You're creating an external dependency, which is going to be fragile.
Why not create a TestContent project, populated with a bunch of resources files? Copy 'n paste your source HTML into the resource file(s) and then you can reference them in your unit tests.
听起来你这里有几个组件:
你应该测试的东西(可能是)独立实现刮刀的这些部分。
您没有理由不能从任何地方获取内容(即没有 HTTP)。
除了刮擦之外,您没有理由不想剥去谷壳以用于其他目的。
没有理由只通过抓取将数据存储到数据库中。
所以..没有理由将所有这些代码片段作为单个大型程序进行构建和测试。
话又说回来……也许我们把事情过于复杂化了?
Sounds like you have several components here:
You should test (and probably) implement these parts of scraper independently.
There's no reason you shouldn't be able to get content from any where (i.e. no HTTP).
There's no reason you wouldn't want to strip away the chaff for purposes other than scraping.
There's no reason to only store data into your database via scraping.
So.. there's no reason to build and test all these pieces of your code as a single large program.
Then again... maybe we're over complicating things?
您可能应该查询磁盘上的静态页面以进行除一两个测试之外的所有测试。但不要忘记那些涉及网络的测试!
You should probably query a static page on disk for all but one or two tests. But don't forget those tests that touch the web!
我不明白为什么 html 的来源对于你的单元测试来说很重要。
澄清一下:您的单元测试正在处理 html 内容,该内容来自何处并不重要,因此从文件中读取它对于您的单元测试来说是很好的。正如您在评论中所说,您当然不想每次测试都连接网络,因为这只是开销。
您可能还需要添加一个或两个集成测试来检查您是否正确处理 url(即您能够连接和处理外部 url)。
I don't see why it matters where the html originates from as far as your unit tests are concerned.
To clarify: Your unit test is processing the html content, where that content comes from is immaterial, so reading it from a file is fine for your unit tests. as you say in your comment you certainly don't want to hit the network for every test as that is just overhead.
You also might want to add an integration test or two to check you're processing urls correctly though (i.e. you are able to connect and process external urls).