通过示例测试分类器

发布于 2024-12-11 18:43:45 字数 631 浏览 0 评论 0原文

我正在编写一个分类器，用于对特价优惠是否适用于餐厅/酒店/等进行分类...这是用于分析外部网站的网络爬虫的一部分。首先，我创建了一个 Meal?() 方法，该方法接受一段文本，如果它认为该文本是关于用餐交易的，则返回 true。它不可能 100% 准确，因为仅使用简单的关键字匹配。

def meal?(text)
  !text.match(/restaurant|meal|wine|.../i).nil?
end

现在我正在为其编写一个测试，我有两个问题。第一个是我认为在单元测试中重新列出所有这些关键字有点多余。你怎么认为？

第二个问题：我在源代码管理中有一个 .html 文件。用于测试爬虫的解析功能。理论上它的所有项目都应该通过，所以我想在这个分类测试中使用该 html，解析该 html 并将每笔交易的描述输入到该方法中。

一个缺点是 .html 取自外部站点。当该网站更改布局时，我将更新此 .html 文件，然后我也必须更改此分类测试。但我认为这没关系。

这是推荐的吗？我之所以想到这种方式，是因为我觉得从 .html 中提取信息并将其放入测试脚本本身（不是 DRY，并且使测试脚本变得相当大）感到不安。提供解析的描述是否会违反任何基本测试法则，例如“这向开发人员隐藏了必要的详细信息”或“这不利于生成报告”？

原文

I am writing a classifier for categorizing whether a special deal is for a restaurant/hotel/etc... This is part of a web-crawler for analyzing external sites.
For start I made a meal?() method, which accepts a piece of text and will return true if it think the text is about a meal deal. It can't be 100% accurate, since only simple keyword matching is used.

def meal?(text)
  !text.match(/restaurant|meal|wine|.../i).nil?
end

Now I am writing a test for it, and I have two questions. The first one is that I think it is a bit redundant to re-list all of these keywords in the unit test again. What do you think?

The second question:
I have an .html file in source control. It is used to test the crawler's parsing functionality. Theoretically all of its items should pass, so I am thinking to use that html in this categorizing test, parse that html and feed the descriptions of each deal into this method.

One drawback is that the .html is taken from an external site. When that site changes layout I will update this .html file, and then I have to change this categorizing test too. But I think this is okay.

Is this recommended? I thought of this way because I feels uneasy extracting information out of that .html and place it in the test script itself (not DRY, and makes test script quite big). Would feeding the parsed description violate any fundamental testing laws, like 'this hides the necessary details away from developers' or 'this is bad for generating reports'?

分享到QQ

分享到微博