如何计算 POS 标注器的标签精度和召回率?
我正在使用一些基于规则和统计的 POS 标记器来使用词性 (POS) 标记语料库(大约 5000 个句子)。以下是我的测试语料库的一个片段,其中每个单词都由其各自的 POS 标签用“/”分隔。
No/RB ,/, it/PRP was/VBD n't/RB Black/NNP Monday/NNP ./.
But/CC while/IN the/DT New/NNP York/NNP Stock/NNP Exchange/NNP did/VBD n't/RB fall/VB apart/RB Friday/NNP as/IN the/DT Dow/NNP Jones/NNP Industrial/NNP Average/NNP plunged/VBD 190.58/CD points/NNS --/: most/JJS of/IN it/PRP in/IN the/DT final/JJ hour/NN --/: it/PRP barely/RB managed/VBD *-2/-NONE- to/TO stay/VB this/DT side/NN of/IN chaos/NN ./.
Some/DT ``/`` circuit/NN breakers/NNS ''/'' installed/VBN */-NONE- after/IN the/DT October/NNP 1987/CD crash/NN failed/VBD their/PRP$ first/JJ test/NN ,/, traders/NNS say/VBP 0/-NONE- *T*-1/-NONE- ,/, *-2/-NONE- unable/JJ *-3/-NONE- to/TO cool/VB the/DT selling/NN panic/NN in/IN both/DT stocks/NNS and/CC futures/NNS ./.
标记语料库后,它看起来像这样:
No/DT ,/, it/PRP was/VBD n't/RB Black/NNP Monday/NNP ./.
But/CC while/IN the/DT New/NNP York/NNP Stock/NNP Exchange/NNP did/VBD n't/RB fall/VB apart/RB Friday/VB as/IN the/DT Dow/NNP Jones/NNP Industrial/NNP Average/JJ plunged/VBN 190.58/CD points/NNS --/: most/RBS of/IN it/PRP in/IN the/DT final/JJ hour/NN --/: it/PRP barely/RB managed/VBD *-2/-NONE- to/TO stay/VB this/DT side/NN of/IN chaos/NNS ./.
Some/DT ``/`` circuit/NN breakers/NNS ''/'' installed/VBN */-NONE- after/IN the/DT October/NNP 1987/CD crash/NN failed/VBD their/PRP$ first/JJ test/NN ,/, traders/NNS say/VB 0/-NONE- *T*-1/-NONE- ,/, *-2/-NONE- unable/JJ *-3/-NONE- to/TO cool/VB the/DT selling/VBG panic/NN in/IN both/DT stocks/NNS and/CC futures/NNS ./.
我需要计算标记准确性(Tagwise-Recall & Precision),因此需要找到每个单词标记中的错误(如果有)-标签对。
我正在考虑的方法是循环遍历这两个文本文件并将它们存储在一个列表中,然后逐个元素比较“两个”列表。
这种方法对我来说似乎很粗糙,所以希望你们能为上述问题提出一些更好的解决方案。
从维基百科页面:
在分类任务中, 一个类的精度是指 真阳性(即 正确标记为归属的物品 到正类)除以 标记为的元素总数 属于正类(即 真阳性和假阳性的总和 积极的,这是错误的项目 标记为属于该类)。 在这种情况下召回率定义为 真阳性数除以 的元素总数 实际上属于正类 (即真阳性和 假阴性,这些项目 没有被标记为属于 积极的班级,但应该是)。
I am using some rule-based and statistical POS taggers to tag a corpus(of around 5000 sentences) with Parts of Speech(POS). Following is a snippet of my test corpus where each word is seperated by its respective POS tag by '/'.
No/RB ,/, it/PRP was/VBD n't/RB Black/NNP Monday/NNP ./.
But/CC while/IN the/DT New/NNP York/NNP Stock/NNP Exchange/NNP did/VBD n't/RB fall/VB apart/RB Friday/NNP as/IN the/DT Dow/NNP Jones/NNP Industrial/NNP Average/NNP plunged/VBD 190.58/CD points/NNS --/: most/JJS of/IN it/PRP in/IN the/DT final/JJ hour/NN --/: it/PRP barely/RB managed/VBD *-2/-NONE- to/TO stay/VB this/DT side/NN of/IN chaos/NN ./.
Some/DT ``/`` circuit/NN breakers/NNS ''/'' installed/VBN */-NONE- after/IN the/DT October/NNP 1987/CD crash/NN failed/VBD their/PRP$ first/JJ test/NN ,/, traders/NNS say/VBP 0/-NONE- *T*-1/-NONE- ,/, *-2/-NONE- unable/JJ *-3/-NONE- to/TO cool/VB the/DT selling/NN panic/NN in/IN both/DT stocks/NNS and/CC futures/NNS ./.
After tagging the corpus, it looks like this:
No/DT ,/, it/PRP was/VBD n't/RB Black/NNP Monday/NNP ./.
But/CC while/IN the/DT New/NNP York/NNP Stock/NNP Exchange/NNP did/VBD n't/RB fall/VB apart/RB Friday/VB as/IN the/DT Dow/NNP Jones/NNP Industrial/NNP Average/JJ plunged/VBN 190.58/CD points/NNS --/: most/RBS of/IN it/PRP in/IN the/DT final/JJ hour/NN --/: it/PRP barely/RB managed/VBD *-2/-NONE- to/TO stay/VB this/DT side/NN of/IN chaos/NNS ./.
Some/DT ``/`` circuit/NN breakers/NNS ''/'' installed/VBN */-NONE- after/IN the/DT October/NNP 1987/CD crash/NN failed/VBD their/PRP$ first/JJ test/NN ,/, traders/NNS say/VB 0/-NONE- *T*-1/-NONE- ,/, *-2/-NONE- unable/JJ *-3/-NONE- to/TO cool/VB the/DT selling/VBG panic/NN in/IN both/DT stocks/NNS and/CC futures/NNS ./.
I need to calculate the tagging accuracy(Tag wise- Recall & Precision), therefore need to find an error(if any) in tagging for each word-tag pair.
The approach I am thinking of is to loop through these 2 text files and store them in a list and later compare the 'two' lists element by element.
The approach seems really crude to me, so would like you guys to suggest some better solution to the above problem.
From the wikipedia page:
In a classification task, the
precision for a class is the number of
true positives (i.e. the number of
items correctly labeled as belonging
to the positive class) divided by the
total number of elements labeled as
belonging to the positive class (i.e.
the sum of true positives and false
positives, which are items incorrectly
labeled as belonging to the class).
Recall in this context is defined as
the number of true positives divided
by the total number of elements that
actually belong to the positive class
(i.e. the sum of true positives and
false negatives, which are items which
were not labeled as belonging to the
positive class but should have been).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
请注意,由于每个单词只有一个标签,因此总体召回率和精确度分数对于这项任务来说毫无意义(它们都等于准确度度量)。但要求每个标签的召回率和精度测量确实有意义 - 例如,您可以找到 DT 标签的召回率和精度。
一次对所有标签执行此操作的最有效方法与您建议的方法类似,尽管您可以通过跳过列表制作阶段来保存一次数据传递。读入每个文件的一行,逐字比较两行,然后重复,直到到达文件末尾。对于每个单词比较,您可能想要检查单词是否相等,而不是假设两个文件同步。对于每种类型的标签,您保留三个运行总计:真阳性、假阳性和假阴性。如果当前单词的两个标签匹配,则增加该标签的真阳性总数。如果它们不匹配,您需要增加真实标签的误报总数和机器错误选择的标签的误报总数。最后,您可以按照维基百科摘录中的公式计算每个标签的召回率和精确度分数。
我还没有测试过这段代码,而且我的 Python 还很生疏,但这应该能让你有所了解。我假设文件已打开,并且总计数据结构是字典的字典:
Note that since every word has exactly one tag, overall recall and precision scores are meaningless for this task (they'll both just equal the accuracy measure). But it does make sense to ask for recall and precision measures per tag - for example, you could find the recall and precision for the DT tag.
The most efficient way to do this for all tags at once is similar to the way you suggested, though you can save one pass over the data by skipping the list-making stage. Read in a line of each file, compare the two lines word by word, and repeat until you reach the end of the files. For each word comparison, you probably want to check the words are equal for sanity, rather than assuming the two files are in sync. For each kind of tag, you keep three running totals: true positives, false positives and false negatives. If the two tags for the current word match, increment the true positive total for the tag. If they don't match, you need to increment the false negative total for the true tag and the false positive total for the tag your machine mistakenly chose. At the end, you can calculate recall and precision scores for each tag by following the formula in your Wikipedia excerpt.
I haven't tested this code and my Python's a but rusty, but this should give you the idea. I'm assuming the files are open and the
totals
data structure is a dictionary of dictionaries: