用于检查转录准确性/编辑距离的脚本的伪代码

发布于 2024-12-10 03:51:12 字数 821 浏览 4 评论 0原文

我需要编写一个脚本（可能是用 Ruby 编写的），该脚本将获取一段文本，并将该文本的多个录音转录与原始文本进行比较，以检查准确性。如果这完全令人困惑，我会尝试用另一种方式解释......

我有几个不同的人阅读几个句子长的脚本的录音。这些录音都已被其他人多次转录成文本。我需要获取所有转录（数百个）并将它们与原始脚本进行比较以确保准确性。

我什至无法概念化伪代码，并且想知道是否有人可以为我指出正确的方向。是否有我应该考虑的既定算法？已经向我建议了 Levenshtein 距离，但这似乎不能很好地应对更长的距离字符串，考虑到标点符号选择、空格等的差异——缺少第一个单词会破坏整个算法，即使其他每个单词都是完美的。我对任何事都持开放态度——谢谢！

编辑：

谢谢你的提示，psyho。然而，我最担心的情况之一是这样的情况：

原文：

如果我知道有这门课程，我就会选修该课程！

转录

我会选修该课程如果我知道它是可用的！

即使对标记进行逐字比较，该转录也会被标记为非常错误，即使它几乎是完美的，而且这很难边缘情况！ “would've”和“would have”通常发音极其相似，尤其是在世界的这个地区。有没有办法使您建议的方法足够强大来处理这个问题？我曾考虑过向前和向后进行逐词比较，并构建一种综合分数，但这会因为这样的转录而崩溃：

如果我知道它可用，我会选择该课程!

有什么想法吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

心病无药医 2024-12-17 03:51:12

简单版本：

将您的输入标记为单词（将包含单词、标点符号等的字符串转换为不带标点符号的小写单词数组）。
使用编辑距离（逐字）将原始数组与转录数组进行比较。

可能的改进：

您可以引入标点符号标记（或将它们全部替换为像“.”这样的简单标记）。
可以修改编辑距离算法，以便将与键盘上靠近的字符拼写错误的字符生成更小的距离。您可以应用此功能，以便在比较各个单词时，您可以使用 Levenshtein 距离（标准化，使其值范围从 0 到 1，例如将其除以两个单词中较长的单词的长度），然后在“外部”距离计算中使用该值。

很难说哪种算法最适合您的数据。我的建议是：确保您有某种自动化的方式来可视化或测试您的解决方案。通过这种方式，您可以快速迭代和试验您的解决方案，并查看您的更改如何影响最终结果。

编辑：
回应您的担忧：

最简单的方法是从标准化较短的形式开始（使用 gsub）：

str.gsub("n't", ' not').gsub("'d", " had").gsub("'re", " are")

注意，您甚至可以将“'s”扩展为“is”，即使它在语法上不正确，因为如果 John 的意思是“John is”，那么你就会得到正确的结果，如果它的意思是“由 John 拥有”，那么很可能两个文本都包含相同的形式，因此你不会通过“错误地”扩展两个文本来拉近距离。另一种情况是，它应该表示“John has”，但在“s”之后可能会出现“got”，所以你也可以轻松处理。

您可能还想处理数值（1st = 第一个，等等）。一般来说，您可以通过进行一些预处理来改善结果。如果它并不总是 100% 正确，请不要担心，它应该足够正确:)

Simple version:

Tokenize your input into words (convert a string containing words, punctuation, etc. into an array of lowercase words, without punctuation).
Use the Levenshtein distance (wordwise) to compare the original array with the transcription arrays.

Possible improvements:

You could introduce tokens for punctuation (or replace them all with a simple token like '.').
Levenshtein distance algorithm can be modified so that misspelling a character that with a character that is close on the keyboard generates a smaller distance. You could potentialy apply this, so that when comparing individual words, you would use Levenshtein distance (normalized, so that it's value ranges from 0 to 1, for example by dividing it by the length of the longer of the two words), and then use that value in the "outer" distance calculation.

It's hard to say what algorithm will work best with your data. My tip is: make sure you have some automated way of visualizing or testing your solution. This way you can quickly iterate and experiment with your solution and see how your changes affect the end result.

EDIT:
In response to your concerns:

The easiest way would be to start with normalizing the shorter forms (using gsub):

str.gsub("n't", ' not').gsub("'d", " had").gsub("'re", " are")

Note, that you can even expand "'s" to " is", even if it's not grammatically correct, because if John's means "John is", then you will get it right, and if it means "owned by John", then most likely both texts will contain the same form, so you will not further the distance by expanding both "incorrectly". The other case is when it should mean "John has", but then after "'s" there probably will be "got", so you can handle that easily as well.

You will probably also want to deal with numeric values (1st = first, etc.). Generally you can probably improve the result by doing some preprocessing. Don't worry if it's not always 100% correct, it should just be correct enough:)

回复收藏 0 原文