命名实体识别中的计算精度和召回率

发布于 2024-08-12 17:29:07 字数 256 浏览 9 评论 0原文

现在我要报告命名实体识别的结果。我发现有点令人困惑的一件事是，我对精确度和召回率的理解是简单地总结所有类别的真阳性、真阴性、假阳性和假阴性。

但现在我认为这似乎难以置信，因为每个错误分类都会同时产生一个误报和一个误报（例如，应该标记为“A”但被标记为“B”的标记是误报） “A”和“B”误报）。因此，所有类别的误报和漏报数量都是相同的，这意味着精度（总是！）等于召回率。这根本不可能是真的，所以我的推理有错误，我想知道它在哪里。这当然是一件非常明显和直接的事情，但我现在却想不起来。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

疧_╮線 2024-08-19 17:29:08

需要明确的是，这些是定义：

精度 = TP/(TP+FP) = 您发现的内容的哪一部分是真实情况？

召回率 = TP/(TP+FN) = 你恢复了哪部分真实数据？

不一定总是相等，因为假阴性的数量不一定等于假阳性的数量。

如果我正确理解您的问题，那么您将每个标记分配给两个以上可能的标签之一。为了使精确度和召回率有意义，您需要有一个二元分类器。因此，如果您将分类器表述为标记是否在“A”组中，则可以使用精度和召回率，然后对每个组重复。在这种情况下，错过的分类将被视为一组的假阴性和另一组的假阳性两次。

如果您正在进行这样的分类，其中它不是二进制的（将每个标记分配给一个组），那么查看标记对可能会很有用。将您的问题表述为“标记 X 和 Y 是否属于同一分类组？”。这允许您计算所有节点对的精度和召回率。如果您的分类组带有标签或具有相关含义，则这种方法不太合适。例如，如果您的分类组是“水果”和“蔬菜”，并且您将“苹果”和“橙子”都分类为“蔬菜”，那么即使分配了错误的组，该算法也会将其评分为真阳性。但如果你的组没有标记，例如“A”和“B”，那么如果苹果和橙子都被分类为“A”，那么你可以说“A”对应于“Fruits”。

回复收藏 0 原文

抹茶夏天i‖ 2024-08-19 17:29:08

请查找我能够应用于对我的 NER 模型进行评分的最新更新答案 (Spacy 3.X)。

https://stackoverflow.com/a/75179670/21976126

注意此数据结构非常重要人使用，因为 Spacy 模型将接受所有形式的字典，而不会抛出错误并对模型进行不正确的评分。

回复收藏 0 原文

时光暖心i 2024-08-19 17:29:08

如果您正在训练 spacy ner 模型，那么他们的 Scorer.py API 可以为您提供 ner 的精确度、召回率和召回率。

代码和输出将采用以下格式：-

对于在以下链接中有相同问题的人：

spaCy/scorer.py
'''python

import spacy

from spacy.gold import GoldParse

from spacy.scorer import Scorer


def evaluate(ner_model, examples):

scorer = Scorer()
for input_, annot in examples:
    doc_gold_text = ner_model.make_doc(input_)
    gold = GoldParse(doc_gold_text, entities=annot)
    pred_value = ner_model(input_)
    scorer.score(pred_value, gold)
return scorer.scores

示例运行

examples = [
    ('Who is Shaka Khan?',
     [(7, 17, 'PERSON')]),
    ('I like London and Berlin.',
     [(7, 13, 'LOC'), (18, 24, 'LOC')])
]

ner_model = spacy.load(ner_model_path) # for spaCy's pretrained use 'en_core_web_sm'
results = evaluate(ner_model, examples)
'''
Output will be in format like:-
{'uas': 0.0, 'las': 0.0, **'ents_p'**: 43.75, **'ents_r'**: 35.59322033898305, **'ents_f'**: 39.252336448598136, 'tags_acc': 0.0, 'token_acc': 100.0}**strong text**

If you are training an spacy ner model then their scorer.py API which gives you precision, recall and recall of your ner.

The Code and output would be in this format:-

For those one having the same question in the following link:

spaCy/scorer.py
'''python

import spacy

from spacy.gold import GoldParse

from spacy.scorer import Scorer


def evaluate(ner_model, examples):

scorer = Scorer()
for input_, annot in examples:
    doc_gold_text = ner_model.make_doc(input_)
    gold = GoldParse(doc_gold_text, entities=annot)
    pred_value = ner_model(input_)
    scorer.score(pred_value, gold)
return scorer.scores

example run

examples = [
    ('Who is Shaka Khan?',
     [(7, 17, 'PERSON')]),
    ('I like London and Berlin.',
     [(7, 13, 'LOC'), (18, 24, 'LOC')])
]

ner_model = spacy.load(ner_model_path) # for spaCy's pretrained use 'en_core_web_sm'
results = evaluate(ner_model, examples)
'''
Output will be in format like:-
{'uas': 0.0, 'las': 0.0, **'ents_p'**: 43.75, **'ents_r'**: 35.59322033898305, **'ents_f'**: 39.252336448598136, 'tags_acc': 0.0, 'token_acc': 100.0}**strong text**

回复收藏 0 原文

垂暮老矣 2024-08-19 17:29:07

通常计算精确度和召回率的方式（这是我在论文中使用的）是衡量实体之间的对比。假设基本事实具有以下内容（对于实体类型没有任何区别）

[Microsoft Corp.] CEO [Steve Ballmer] 今天宣布发布 [Windows 7]

这有 3 个实体。

假设您的实际提取内容如下

[Microsoft Corp.] [CEO] [Steve] Ballmer said the release of Windows 7 [today]

您有一个与 Microsoft Corp. 完全匹配的内容、CEO 和 today 的误报、Windows 7 的误报以及 Steve 的子字符串匹配

我们计算首先定义匹配标准来实现精确度和召回率。例如，它们必须完全匹配吗？如果它们完全重叠，那么它是匹配的吗？实体类型重要吗？通常我们希望为其中几个标准提供精确度和召回率。

完全匹配：真阳性 = 1（Microsoft Corp.，唯一完全匹配），假阳性 =3（CEO，今天 和 Steve，这不是完全匹配），漏报 = 2（Steve Ballmer 和 Windows 7）

Precision = True Positives / (True Positives + False Positives) = 1/(1+3) = 0.25
Recall = True Positives / (True Positives + False Negatives) = 1/(1+2) = 0.33

< strong>任何重叠都可以：真阳性 = 2（Microsoft Corp. 和 Steve 与 Steve Ballmer 重叠），假阳性=2（CEO 和 今天），假阴性 = 1（Windows 7）

Precision = True Positives / (True Positives + False Positives) = 2/(2+2) = 0.55
Recall = True Positives / (True Positives + False Negatives) = 2/(2+1) = 0.66

然后，读者可以推断出“真实表现” “（当允许使用人类判断来决定哪些重叠差异显着、哪些重叠差异不显着时，无偏见的人类检查员所给出的精确度和召回率）介于两者之间。

报告 F1 度量通常也很有用，它是精度和召回率的调和平均值，当您必须权衡精度和召回率时，它给出了“性能”的一些概念。

The way precision and recall is typically computed (this is what I use in my papers) is to measure entities against each other. Supposing the ground truth has the following (without any differentiaton as to what type of entities they are)

[Microsoft Corp.] CEO [Steve Ballmer] announced the release of [Windows 7] today

This has 3 entities.

Supposing your actual extraction has the following

[Microsoft Corp.] [CEO] [Steve] Ballmer announced the release of Windows 7 [today]

You have an exact match for Microsoft Corp, false positives for CEO and today, a false negative for Windows 7 and a substring match for Steve

We compute precision and recall by first defining matching criteria. For example, do they have to be an exact match? Is it a match if they overlap at all? Do entity types matter? Typically we want to provide precision and recall for several of these criteria.

Exact match: True Positives = 1 (Microsoft Corp., the only exact match), False Positives =3 (CEO, today, and Steve, which isn't an exact match), False Negatives = 2 (Steve Ballmer and Windows 7)

Precision = True Positives / (True Positives + False Positives) = 1/(1+3) = 0.25
Recall = True Positives / (True Positives + False Negatives) = 1/(1+2) = 0.33

Any Overlap OK: True Positives = 2 (Microsoft Corp., and Steve which overlaps Steve Ballmer), False Positives =2 (CEO, and today), False Negatives = 1 (Windows 7)

Precision = True Positives / (True Positives + False Positives) = 2/(2+2) = 0.55
Recall = True Positives / (True Positives + False Negatives) = 2/(2+1) = 0.66

The reader is then left to infer that the "real performance" (the precision and recall that an unbiased human checker would give when allowed to use human judgement to decide which overlap discrepancies are significant, and which are not) is somewhere between the two.

It's also often useful to report the F1 measure, which is the harmonic mean of precision and recall, and which gives some idea of "performance" when you have to trade off precision against recall.

回复收藏 0 原文

-残月青衣踏尘吟 2024-08-19 17:29:07

在 CoNLL-2003 NER 任务中，评估是基于正确标记的实体，而不是令牌，如论文“CoNLL-2003 共享任务简介”中所述：
与语言无关的命名实体识别'。如果系统在文档中识别出具有正确起点和终点的正确类型的实体，则实体被正确标记。我更喜欢这种评估方法，因为它更接近于实际任务绩效的衡量标准； NER 系统的用户关心实体，而不是单个代币。

但是，您所描述的问题仍然存在。如果将 ORG 类型的实体标记为 LOC 类型，则会导致 LOC 误报和 ORG 误报。这篇博客文章中对此问题进行了有趣的讨论< /a>.

回复收藏 0 原文