使用软件包“数据集”时,胭脂分数是不同的。并“ rouge_score”
我使用2个软件包,“数据集”和“ rouge_score”来获取Rouge-1分数。但是,精度和召回是不同的。我想知道哪个软件包会产生正确的分数?
from rouge_score import rouge_scorer
import datasets
hyp = ['I have no car.']
ref = ['I want to buy a car.']
scorer1 = datasets.load_metric('rouge')
scorer2 = rouge_scorer.RougeScorer(['rouge1'])
results = {'precision_rouge_score': [], 'recall_rouge_score': [], 'fmeasure_rouge_score': [], \
'precision_datasets': [], 'recall_datasets': [], 'fmeasure_datasets': []}
for (h, r) in zip(hyp, ref):
precision, recall, fmeasure = scorer2.score(h, r)['rouge1']
results['precision_rouge_score'].append(precision)
results['recall_rouge_score'].append(recall)
results['fmeasure_rouge_score'].append(fmeasure)
output = scorer1.compute(predictions=[h], references=[r])
results['precision_datasets'].append(output['rouge1'].mid.precision)
results['recall_datasets'].append(output['rouge1'].mid.recall)
results['fmeasure_datasets'].append(output['rouge1'].mid.fmeasure)
print('results: ', results)
结果是:
{'precision_rouge_score': [0.3333333333333333], 'recall_rouge_score': [0.5],
'fmeasure_rouge_score': [0.4],
'precision_datasets': [0.5], 'recall_datasets': [0.3333333333333333],
'fmeasure_datasets': [0.4]}
I use 2 packages, "datasets" and "rouge_score" to get the rouge-1 scores. However, the precision and recall are different. I wonder which package produces the correct scores?
from rouge_score import rouge_scorer
import datasets
hyp = ['I have no car.']
ref = ['I want to buy a car.']
scorer1 = datasets.load_metric('rouge')
scorer2 = rouge_scorer.RougeScorer(['rouge1'])
results = {'precision_rouge_score': [], 'recall_rouge_score': [], 'fmeasure_rouge_score': [], \
'precision_datasets': [], 'recall_datasets': [], 'fmeasure_datasets': []}
for (h, r) in zip(hyp, ref):
precision, recall, fmeasure = scorer2.score(h, r)['rouge1']
results['precision_rouge_score'].append(precision)
results['recall_rouge_score'].append(recall)
results['fmeasure_rouge_score'].append(fmeasure)
output = scorer1.compute(predictions=[h], references=[r])
results['precision_datasets'].append(output['rouge1'].mid.precision)
results['recall_datasets'].append(output['rouge1'].mid.recall)
results['fmeasure_datasets'].append(output['rouge1'].mid.fmeasure)
print('results: ', results)
The results are:
{'precision_rouge_score': [0.3333333333333333], 'recall_rouge_score': [0.5],
'fmeasure_rouge_score': [0.4],
'precision_datasets': [0.5], 'recall_datasets': [0.3333333333333333],
'fmeasure_datasets': [0.4]}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
根据原始论文, https://aclanthology.org/w04-1013.pdf ,我看到了这个公式:
因此,对于上面的2个句子(炒作:“我没有车。” vs ref:“我想买一辆汽车。汽车)/6(i,想要,要,购买,A,汽车)= 0.333333。似乎“数据集”软件包是正确的。
According to the original paper, https://aclanthology.org/W04-1013.pdf, I saw this formula:
So for 2 above sentences (Hyp: "I have no car." vs Ref:"I want to buy a car."), rouge1-recall = 2 (I, car)/6 (I, want, to, buy, a, car) = 0.333333. It seems "datasets" package is correct.