通过示例测试分类器

发布于 2024-12-11 18:43:45 字数 631 浏览 0 评论 0原文

我正在编写一个分类器,用于对特价优惠是否适用于餐厅/酒店/等进行分类...这是用于分析外部网站的网络爬虫的一部分。 首先,我创建了一个 Meal?() 方法,该方法接受一段文本,如果它认为该文本是关于用餐交易的,则返回 true。它不可能 100% 准确,因为仅使用简单的关键字匹配。

def meal?(text)
  !text.match(/restaurant|meal|wine|.../i).nil?
end

现在我正在为其编写一个测试,我有两个问题。第一个是我认为在单元测试中重新列出所有这些关键字有点多余。你怎么认为?

第二个问题: 我在源代码管理中有一个 .html 文件。用于测试爬虫的解析功能。理论上它的所有项目都应该通过,所以我想在这个分类测试中使用该 html,解析该 html 并将每笔交易的描述输入到该方法中。

一个缺点是 .html 取自外部站点。当该网站更改布局时,我将更新此 .html 文件,然后我也必须更改此分类测试。但我认为这没关系。

这是推荐的吗?我之所以想到这种方式,是因为我觉得从 .html 中提取信息并将其放入测试脚本本身(不是 DRY,并且使测试脚本变得相当大)感到不安。提供解析的描述是否会违反任何基本测试法则,例如“这向开发人员隐藏了必要的详细信息”或“这不利于生成报告”?

I am writing a classifier for categorizing whether a special deal is for a restaurant/hotel/etc... This is part of a web-crawler for analyzing external sites.
For start I made a meal?() method, which accepts a piece of text and will return true if it think the text is about a meal deal. It can't be 100% accurate, since only simple keyword matching is used.

def meal?(text)
  !text.match(/restaurant|meal|wine|.../i).nil?
end

Now I am writing a test for it, and I have two questions. The first one is that I think it is a bit redundant to re-list all of these keywords in the unit test again. What do you think?

The second question:
I have an .html file in source control. It is used to test the crawler's parsing functionality. Theoretically all of its items should pass, so I am thinking to use that html in this categorizing test, parse that html and feed the descriptions of each deal into this method.

One drawback is that the .html is taken from an external site. When that site changes layout I will update this .html file, and then I have to change this categorizing test too. But I think this is okay.

Is this recommended? I thought of this way because I feels uneasy extracting information out of that .html and place it in the test script itself (not DRY, and makes test script quite big). Would feeding the parsed description violate any fundamental testing laws, like 'this hides the necessary details away from developers' or 'this is bad for generating reports'?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

少女情怀诗 2024-12-18 18:43:45

好吧,我显然误解了这个问题,所以我将彻底修改这个答案。

我个人认为,与间接加载 html 文件相比,从 html 文件中获取实际文本并将其复制/粘贴到测试中更简单、更可取。我可以找到两个原因......

  • 当我编写/读取单元测试时,我更喜欢所有信息就在我面前,而不是像我必须挖掘的资源文件那样的“外部源”。个人喜好不过。
  • 这有点令人困惑,因为您还可以将此方法用于其他用途,而不仅仅是从 html 文件中读取文本并对其进行分类。因此,为了使其更通用,我将在实际测试中使用原始文本。

然而,我找不到你想做的事情真的很糟糕的原因,我认为这可以归结为个人喜好。

OK so I obviously misunderstood the question so I will revise this answer completely.

I personally think it is simpler and preferable to take the actual text from the html file and copy/paste it to the test as opposed to the indirection of loading an html file. Two reasons I can find...

  • When I write/read unit tests I prefer all the info to be there right in front of me instead of being an 'external source' like a resource file that I have to dig for. Personal preference tho.
  • It is a bit confusing, because you can use this method for other things as well not just reading text from html file and classifying it. So to keep it more generic I would just use raw text in the actual test.

I cannot however find a reason why what you are trying to do is really really bad, I think it boils down to personal preference.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文