命名实体识别中的计算精度和召回率

发布于 2024-08-12 17:29:07 字数 256 浏览 9 评论 0原文

现在我要报告命名实体识别的结果。我发现有点令人困惑的一件事是,我对精确度和召回率的理解是简单地总结所有类别的真阳性、真阴性、假阳性和假阴性。

但现在我认为这似乎难以置信,因为每个错误分类都会同时产生一个误报和一个误报(例如,应该标记为“A”但被标记为“B”的标记是误报) “A”和“B”误报)。因此,所有类别的误报和漏报数量都是相同的,这意味着精度(总是!)等于召回率。这根本不可能是真的,所以我的推理有错误,我想知道它在哪里。这当然是一件非常明显和直接的事情,但我现在却想不起来。

Now I am about to report the results from Named Entity Recognition. One thing that I find a bit confusing is that my understanding of precision and recall was that one simply sums up true positives, true negatives, false positives and false negatives over all classes.

But this seems implausible now that I think of it as each misclassification would give simultaneously rise to one false positive and one false negative (e.g. a token that should have been labelled as "A" but was labelled as "B" is a false negative for "A" and false positive for "B"). Thus the number of the false positives and the false negatives over all classes would be the same which means that precision is (always!) equal to recall. This simply can't be true so there is an error in my reasoning and I wonder where it is. It is certainly something quite obvious and straight-forward but it escapes me right now.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

疧_╮線 2024-08-19 17:29:08

需要明确的是,这些是定义:

精度 = TP/(TP+FP) = 您发现的内容的哪一部分是真实情况?

召回率 = TP/(TP+FN) = 你恢复了哪部分真实数据?

不一定总是相等,因为假阴性的数量不一定等于假阳性的数量。

如果我正确理解您的问题,那么您将每个标记分配给两个以上可能的标签之一。为了使精确度和召回率有意义,您需要有一个二元分类器。因此,如果您将分类器表述为标记是否在“A”组中,则可以使用精度和召回率,然后对每个组重复。在这种情况下,错过的分类将被视为一组的假阴性和另一组的假阳性两次。

如果您正在进行这样的分类,其中它不是二进制的(将每个标记分配给一个组),那么查看标记对可能会很有用。将您的问题表述为“标记 X 和 Y 是否属于同一分类组?”。这允许您计算所有节点对的精度和召回率。如果您的分类组带有标签或具有相关含义,则这种方法不太合适。例如,如果您的分类组是“水果”和“蔬菜”,并且您将“苹果”和“橙子”都分类为“蔬菜”,那么即使分配了错误的组,该算法也会将其评分为真阳性。但如果你的组没有标记,例如“A”和“B”,那么如果苹果和橙子都被分类为“A”,那么你可以说“A”对应于“Fruits”。

Just to be clear, these are the definitions:

Precision = TP/(TP+FP) = What portion of what you found was ground truth?

Recall = TP/(TP+FN) = What portion of the ground truth did you recover?

The won't necessarily always be equal, since the number of false negatives will not necessarily equal the number of false positives.

If I understand your problem right, you're assigning each token to one of more than two possible labels. In order for precision and recall to make sense, you need to have a binary classifier. So you could use precision and recall if you phrased the classifier as whether a token is in Group "A" or not, and then repeat for each group. In this case a missed classification would count twice as a false negative for one group and a false positive for another.

If you're doing a classification like this where it isn't binary (assigning each token to a group) it might be useful instead to look at pairs of tokens. Phrase your problem as "Are tokens X and Y in the same classification group?". This allows you to compute precision and recall over all pairs of nodes. This isn't as appropriate if your classification groups are labeled or have associated meanings. For example if your classification groups are "Fruits" and "Vegetables", and you classify both "Apples" and "Oranges" as "Vegetables" then this algorithm would score it as a true positive even though the wrong group was assigned. But if your groups are unlabled, for example "A" and "B", then if apples and oranges were both classified as "A", afterward you could say that "A" corresponds to "Fruits".

抹茶夏天i‖ 2024-08-19 17:29:08

请查找我能够应用于对我的 NER 模型进行评分的最新更新答案 (Spacy 3.X)。

https://stackoverflow.com/a/75179670/21976126

注意此数据结构非常重要人使用,因为 Spacy 模型将接受所有形式的字典,而不会抛出错误并对模型进行不正确的评分。

Please find the most recently updated answer (Spacy 3.X) that I was able to apply for scoring my NER model.

https://stackoverflow.com/a/75179670/21976126

It is very important that you take note of the data structure this person uses, as the Spacy Model will accept all forms of dictionaries without throwing an error and improperly score your model.

时光暖心i 2024-08-19 17:29:08

如果您正在训练 spacy ner 模型,那么他们的 Scorer.py API 可以为您提供 ner 的精确度、召回率和召回率。

代码和输出将采用以下格式:-

17

对于在以下链接中有相同问题的人:

spaCy/scorer.py
'''python

import spacy

from spacy.gold import GoldParse

from spacy.scorer import Scorer


def evaluate(ner_model, examples):

scorer = Scorer()
for input_, annot in examples:
    doc_gold_text = ner_model.make_doc(input_)
    gold = GoldParse(doc_gold_text, entities=annot)
    pred_value = ner_model(input_)
    scorer.score(pred_value, gold)
return scorer.scores

示例运行

examples = [
    ('Who is Shaka Khan?',
     [(7, 17, 'PERSON')]),
    ('I like London and Berlin.',
     [(7, 13, 'LOC'), (18, 24, 'LOC')])
]

ner_model = spacy.load(ner_model_path) # for spaCy's pretrained use 'en_core_web_sm'
results = evaluate(ner_model, examples)
'''
Output will be in format like:-
{'uas': 0.0, 'las': 0.0, **'ents_p'**: 43.75, **'ents_r'**: 35.59322033898305, **'ents_f'**: 39.252336448598136, 'tags_acc': 0.0, 'token_acc': 100.0}**strong text**

If you are training an spacy ner model then their scorer.py API which gives you precision, recall and recall of your ner.

The Code and output would be in this format:-

17

For those one having the same question in the following link:

spaCy/scorer.py
'''python

import spacy

from spacy.gold import GoldParse

from spacy.scorer import Scorer


def evaluate(ner_model, examples):

scorer = Scorer()
for input_, annot in examples:
    doc_gold_text = ner_model.make_doc(input_)
    gold = GoldParse(doc_gold_text, entities=annot)
    pred_value = ner_model(input_)
    scorer.score(pred_value, gold)
return scorer.scores

example run

examples = [
    ('Who is Shaka Khan?',
     [(7, 17, 'PERSON')]),
    ('I like London and Berlin.',
     [(7, 13, 'LOC'), (18, 24, 'LOC')])
]

ner_model = spacy.load(ner_model_path) # for spaCy's pretrained use 'en_core_web_sm'
results = evaluate(ner_model, examples)
'''
Output will be in format like:-
{'uas': 0.0, 'las': 0.0, **'ents_p'**: 43.75, **'ents_r'**: 35.59322033898305, **'ents_f'**: 39.252336448598136, 'tags_acc': 0.0, 'token_acc': 100.0}**strong text**
垂暮老矣 2024-08-19 17:29:07

通常计算精确度和召回率的方式(这是我在论文中使用的)是衡量实体之间的对比。假设基本事实具有以下内容(对于实体类型没有任何区别)

[Microsoft Corp.] CEO [Steve Ballmer] 今天宣布发布 [Windows 7]

这有 3 个实体。

假设您的实际提取内容如下

[Microsoft Corp.] [CEO] [Steve] Ballmer said the release of Windows 7 [today]

您有一个与 Microsoft Corp. 完全匹配的内容、CEOtoday 的误报、Windows 7 的误报以及 Steve 的子字符串匹配

我们计算首先定义匹配标准来实现精确度和召回率。例如,它们必须完全匹配吗?如果它们完全重叠,那么它是匹配的吗?实体类型重要吗?通常我们希望为其中几个标准提供精确度和召回率。

完全匹配:真阳性 = 1(Microsoft Corp.,唯一完全匹配),假阳性 =3(CEO今天Steve,这不是完全匹配),漏报 = 2(Steve BallmerWindows 7

Precision = True Positives / (True Positives + False Positives) = 1/(1+3) = 0.25
Recall = True Positives / (True Positives + False Negatives) = 1/(1+2) = 0.33

< strong>任何重叠都可以: 真阳性 = 2(Microsoft Corp.SteveSteve Ballmer 重叠),假阳性=2(CEO今天),假阴性 = 1(Windows 7

Precision = True Positives / (True Positives + False Positives) = 2/(2+2) = 0.55
Recall = True Positives / (True Positives + False Negatives) = 2/(2+1) = 0.66

然后,读者可以推断出“真实表现” “(当允许使用人类判断来决定哪些重叠差异显着、哪些重叠差异不显着时,无偏见的人类检查员所给出的精确度和召回率)介于两者之间。

报告 F1 度量通常也很有用,它是精度和召回率的调和平均值,当您必须权衡精度和召回率时,它给出了“性能”的一些概念。

The way precision and recall is typically computed (this is what I use in my papers) is to measure entities against each other. Supposing the ground truth has the following (without any differentiaton as to what type of entities they are)

[Microsoft Corp.] CEO [Steve Ballmer] announced the release of [Windows 7] today

This has 3 entities.

Supposing your actual extraction has the following

[Microsoft Corp.] [CEO] [Steve] Ballmer announced the release of Windows 7 [today]

You have an exact match for Microsoft Corp, false positives for CEO and today, a false negative for Windows 7 and a substring match for Steve

We compute precision and recall by first defining matching criteria. For example, do they have to be an exact match? Is it a match if they overlap at all? Do entity types matter? Typically we want to provide precision and recall for several of these criteria.

Exact match: True Positives = 1 (Microsoft Corp., the only exact match), False Positives =3 (CEO, today, and Steve, which isn't an exact match), False Negatives = 2 (Steve Ballmer and Windows 7)

Precision = True Positives / (True Positives + False Positives) = 1/(1+3) = 0.25
Recall = True Positives / (True Positives + False Negatives) = 1/(1+2) = 0.33

Any Overlap OK: True Positives = 2 (Microsoft Corp., and Steve which overlaps Steve Ballmer), False Positives =2 (CEO, and today), False Negatives = 1 (Windows 7)

Precision = True Positives / (True Positives + False Positives) = 2/(2+2) = 0.55
Recall = True Positives / (True Positives + False Negatives) = 2/(2+1) = 0.66

The reader is then left to infer that the "real performance" (the precision and recall that an unbiased human checker would give when allowed to use human judgement to decide which overlap discrepancies are significant, and which are not) is somewhere between the two.

It's also often useful to report the F1 measure, which is the harmonic mean of precision and recall, and which gives some idea of "performance" when you have to trade off precision against recall.

-残月青衣踏尘吟 2024-08-19 17:29:07

CoNLL-2003 NER 任务中,评估是基于正确标记的实体,而不是令牌,如论文“CoNLL-2003 共享任务简介”中所述:
与语言无关的命名实体识别'
。如果系统在文档中识别出具有正确起点和终点的正确类型的实体,则实体被正确标记。我更喜欢这种评估方法,因为它更接近于实际任务绩效的衡量标准; NER 系统的用户关心实体,而不是单个代币。

但是,您所描述的问题仍然存在。如果将 ORG 类型的实体标记为 LOC 类型,则会导致 LOC 误报和 ORG 误报。这篇博客文章中对此问题进行了有趣的讨论< /a>.

In the CoNLL-2003 NER task, the evaluation was based on correctly marked entities, not tokens, as described in the paper 'Introduction to the CoNLL-2003 Shared Task:
Language-Independent Named Entity Recognition'
. An entity is correctly marked if the system identifies an entity of the correct type with the correct start and end point in the document. I prefer this approach in evaluation because it's closer to a measure of performance on the actual task; a user of the NER system cares about entities, not individual tokens.

However, the problem you described still exists. If you mark an entity of type ORG with type LOC you incur a false positive for LOC and a false negative for ORG. There is an interesting discussion on the problem in this blog post.

咿呀咿呀哟 2024-08-19 17:29:07

如前所述,衡量 NER 性能的方法有多种。可以根据文本中的位置和类别(人、位置、组织等)单独评估实体检测的精确度。或者将两个方面结合在一个措施中。

您会在以下论文中找到一篇很好的评论:D. Nadeau,《半监督命名实体识别:在很少监督的情况下学习识别 100 种实体类型》(2007 年)。查看 2.6 节。 NER评估。

As mentioned before, there are different ways of measuring NER performance. It is possible to evaluate separately how precisely entities are detected in terms of position in the text, and in terms of their class (person, location, organization, etc.). Or to combine both aspects in a single measure.

You'll find a nice review in the following thesis: D. Nadeau, Semi-Supervised Named Entity Recognition: Learning to Recognize 100 Entity Types with Little Supervision (2007). Have a look at section 2.6. Evaluation of NER.

烧了回忆取暖 2024-08-19 17:29:07

这个问题没有简单的正确答案。有多种不同的方法来计算错误。 MUC比赛用了一个,其他人用过其他的。

但是,为了帮助您解决眼前的困惑:

您有一组标签,不是吗?比如“无”、“人”、“动物”、“蔬菜”?

如果令牌应该是人,并且您将其标记为 NONE,那么这对于 NONE 来说是误报,对于 PERSON 来说是误报。如果一个标记应该是 NONE,而你将它标记为 PERSON,则反之亦然。

因此,您会获得每种实体类型的分数。

您还可以汇总这些分数。

There is no simple right answer to this question. There are a variety of different ways to count errors. The MUC competitions used one, other people have used others.

However, to help you with your immediate confusion:

You have a set of tags, no? Something like NONE, PERSON, ANIMAL, VEGETABLE?

If a token should be person, and you tag it NONE, then that's a false positive for NONE and a false negative for PERSON. If a token should be NONE and you tag it PERSON, it's the other way around.

So you get a score for each entity type.

You can also aggregate those scores.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文