当前位置：文江博客话题详情

用于句子相似度检测的 BLEU 评分实现

发布于 2024-10-24 08:06:41 字数 122 浏览 1 评论 0原文

我需要计算 BLEU 分数来识别两个句子是否相似。我读过一些文章，其中大部分是关于用于测量机器翻译准确性的 BLEU 分数。但是我需要 BLEU 分数来找出句子之间的相似性相同的语言[英语]。（即）（两个句子都是英语）。谢谢期待。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

剑心龙吟 2024-10-31 08:06:42

对于句子级别的比较，请使用平滑的 BLEU

用于机器翻译评估的标准 BLEU 分数 (BLEU:4) 仅在语料库级别才真正有意义，因为任何不具有至少一场 4-gram 匹配的得分为 0。

发生这种情况是因为，从本质上讲，BLEU 实际上只是几何平均值 n-gram 精度，通过简洁性惩罚进行缩放，以防止具有某些匹配材料的非常短的句子被给予不适当的高分。由于几何平均值是通过将平均值中包含的所有项相乘来计算的，因此任何 n 元语法计数为零都会导致整个分数为零。

如果您想将 BLEU 应用于单个句子，最好使用平滑的 BLEU (Lin 和 Och 2004 - 请参阅第 4 节），在计算 n 元语法精度之前，将每个 n 元语法计数加 1。这将防止任何 n 元语法精度为零，因此即使没有任何 4 元语法匹配，也会产生非零值。

Java 实现

您将在斯坦福机器翻译包中找到 BLEU 和 smooth BLEU 的 Java 实现短语。

替代方案

正如 Andreas 已经提到的，您可能需要使用替代评分指标，例如 < strong>Levenstein 的字符串编辑距离。然而，使用传统的 Levenstein 字符串编辑距离来比较句子的一个问题是它没有明确地意识到单词边界。

其他替代方案包括：

单词错误率 - 这本质上是 Levenstein 距离应用于单词序列而不是字符序列。它广泛用于对语音识别系统进行评分。
翻译编辑率 (TER) - 这很相似字错误率，但它允许对相邻单词和短语进行额外的交换编辑操作。该指标在机器翻译社区中变得很流行，因为它比 BLEU 等其他句子相似性指标更能与人类判断相关。此指标的最新变体，称为翻译编辑率加值 (TERp)，允许使用 WordNet 匹配同义词以及多词序列的释义（“died”~=“kicked the bucket”）。
METEOR - 该指标首先计算允许用于对所比较的两个句子中的单词进行任意重新排序。如果有多种可能的方法来对齐句子，METEOR 会选择一种能够最大限度地减少交叉对齐边缘的方法。与 TERp 一样，METEOR 允许匹配 WordNet 同义词和多词序列的释义。对齐后，该指标使用匹配单词的数量计算两个句子之间的相似度，以计算 F-α 分数，精确度和召回率的平衡度量，然后通过对对齐中存在的词序扰乱量的惩罚来缩放。

For sentence level comparisons, use smoothed BLEU

The standard BLEU score used for machine translation evaluation (BLEU:4) is only really meaningful at the corpus level, since any sentence that does not have at least one 4-gram match will be given a score of 0.

This happens because, at its core, BLEU is really just the geometric mean of n-gram precisions that is scaled by a brevity penalty to prevent very short sentences with some matching material from being given inappropriately high scores. Since the geometric mean is calculated by multiplying together all the terms to be included in the mean, having a zero for any of the n-gram counts results in the entire score being zero.

If you want to apply BLEU to individual sentences, you're better off using smoothed BLEU (Lin and Och 2004 - see sec. 4), whereby you add 1 to each of the n-gram counts before you calculate the n-gram precisions. This will prevent any of the n-gram precisions from being zero, and thus will result in non-zero values even when there are not any 4-gram matches.

Java Implementation

You'll find a Java implementation of both BLEU and smooth BLEU in the Stanford machine translation package Phrasal.

Alternatives

As Andreas already mentioned, you might want to use an alternative scoring metric such as Levenstein's string edit distance. However, one problem with using the traditional Levenstein string edit distance to compare sentences is that it isn't explicitly aware of word boundaries.

Other alternatives include:

Word Error Rate - This is essentially the Levenstein distance applied to a sequence of words rather than a sequence of characters. It's widely used for scoring speech recognition systems.
Translation Edit Rate (TER) - This is similar to word error rate, but it allows for an additional swap edit operation for adjacent words and phrases. This metric has become popular within the machine translation community since it correlates better with human judgments than other sentence similarity measures such as BLEU. The most recent variant of this metric, known as Translation Edit Rate Plus (TERp), allows for matching of synonyms using WordNet as well as paraphrases of multiword sequences ("died" ~= "kicked the bucket").
METEOR - This metric first calculates an alignment that allows for arbitrary reordering of the words in the two sentences being compared. If there are multiple possible ways to align the sentences, METEOR selects the one that minimizes crisscrossing alignment edges. Like TERp, METEOR allows for matching of WordNet synonyms and paraphrases of multiword sequences. After alignment, the metric computes the similarity between the two sentences using the number of matching words to calculate a F-α score, a balanced measure of precision and recall, which is then scaled by a penalty for the amount of word order scrambling present in the alignment.

回复收藏 0 原文

笑梦风尘 2024-10-31 08:06:42

在这里：http://code.google.com/p/lingutil/

回复收藏 0 原文

清风挽心 2024-10-31 08:06:42

好吧，如果您只想计算 BLEU 分数，那很简单。将一个句子视为参考翻译，将另一个句子视为候选翻译。

回复收藏 0 原文

如果没结果 2024-10-31 08:06:42

也许（Levenstein）编辑距离或者汉明距离也是一个选项。不管怎样，BLEU 分数也适合该工作；它衡量一个句子与参考文献的相似度，因此只有当它们使用与您的问题相同的语言时才有意义。

回复收藏 0 原文

伊面 2024-10-31 08:06:42

您可以使用 Moses multi-bleu 脚本，其中还可以使用多个引用： https://github.com/moses-smt/mosesdecoder/blob/RELEASE-2.1.1/scripts/generic/multi-bleu.perl

回复收藏 0 原文

无所谓啦 2024-10-31 08:06:42

不鼓励您自己实现 BLEU，SACREBLEU 是一个标准实现。

from datasets import load_metric
metric = load_metric("sacrebleu")

You are not encouraged to implement the BLEU yourself, and the SACREBLEU is a standard implementation.

from datasets import load_metric
metric = load_metric("sacrebleu")

回复收藏 0 原文

~没有更多了~

关于作者

夕色琉璃

暂无简介

0 文章

0 评论

23 人气

关注发私信

友情链接

文江博客

用于句子相似度检测的 BLEU 评分实现

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（6）

关于作者

相关话题

热门标签

推荐作者

1CH1MKgiKxn9p

ゞ记忆︶ㄣ

JackDx

信远

yaoduoduo1995

霞映澄塘

友情链接

用于句子相似度检测的 BLEU 评分实现

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（6）

关于作者

相关话题

热门标签

推荐作者

1CH1MKgiKxn9p

ゞ记忆︶ㄣ

JackDx

信远

yaoduoduo1995

霞映澄塘

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。