用于检查转录准确性/编辑距离的脚本的伪代码
我需要编写一个脚本(可能是用 Ruby 编写的),该脚本将获取一段文本,并将该文本的多个录音转录与原始文本进行比较,以检查准确性。如果这完全令人困惑,我会尝试用另一种方式解释......
我有几个不同的人阅读几个句子长的脚本的录音。这些录音都已被其他人多次转录成文本。我需要获取所有转录(数百个)并将它们与原始脚本进行比较以确保准确性。
我什至无法概念化伪代码,并且想知道是否有人可以为我指出正确的方向。是否有我应该考虑的既定算法?已经向我建议了 Levenshtein 距离,但这似乎不能很好地应对更长的距离字符串,考虑到标点符号选择、空格等的差异——缺少第一个单词会破坏整个算法,即使其他每个单词都是完美的。我对任何事都持开放态度——谢谢!
编辑:
谢谢你的提示,psyho。然而,我最担心的情况之一是这样的情况:
原文:
如果我知道有这门课程,我就会选修该课程!
转录
我会选修该课程如果我知道它是可用的!
即使对标记进行逐字比较,该转录也会被标记为非常错误,即使它几乎是完美的,而且这很难边缘情况! “would've”和“would have”通常发音极其相似,尤其是在世界的这个地区。有没有办法使您建议的方法足够强大来处理这个问题?我曾考虑过向前和向后进行逐词比较,并构建一种综合分数,但这会因为这样的转录而崩溃:
如果我知道它可用,我会选择该课程!
有什么想法吗?
I need to write a script, probably in Ruby, that will take one block of text and compare a number of transcriptions of recordings of that text to the original to check for accuracy. If that's just completely confusing, I'll try explaining another way...
I have recordings of several different people reading a script that is a few sentences long. These recordings have all been transcribed back to text a number of times by other people. I need to take all of the transcriptions (hundreds) and compare them against the original script for accuracy.
I'm having trouble even conceptualising the pseudocode, and wondering if someone can point me in the right direction. Is there an established algorithm I should be considering? The Levenshtein distance has been suggested to me, but this seems like it wouldn't cope well with longer strings, considering differences in punctuation choices, whitespace, etc.--missing the first word would wreck the entire algorithm, even if every other word were perfect. I'm open to anything--thank you!
Edit:
Thanks for the tips, psyho. One of my biggest concerns, however, is a situation like this:
Original Text:
I would've taken that course if I'd known it was available!
Transcription
I would have taken that course if I'd known it was available!
Even with a word-wise comparison of tokens, this transcription will be marked as quite errant, even though it's almost perfect, and this is hardly an edge-case! "would've" and "would have" are commonly pronounced extremely similarly, especially in this part of the world. Is there a way to make the approach you suggest robust enough to deal with this? I've thought about running a word-wise comparison both forward and backward and building a sort of composite score, but this would fall apart with a transcription like this:
I would have taken that course if I had known it was available!
Any ideas?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
简单版本:
可能的改进:
很难说哪种算法最适合您的数据。我的建议是:确保您有某种自动化的方式来可视化或测试您的解决方案。通过这种方式,您可以快速迭代和试验您的解决方案,并查看您的更改如何影响最终结果。
编辑:
回应您的担忧:
最简单的方法是从标准化较短的形式开始(使用 gsub):
注意,您甚至可以将“'s”扩展为“is”,即使它在语法上不正确,因为如果 John 的意思是“John is”,那么你就会得到正确的结果,如果它的意思是“由 John 拥有”,那么很可能两个文本都包含相同的形式,因此你不会通过“错误地”扩展两个文本来拉近距离。另一种情况是,它应该表示“John has”,但在“s”之后可能会出现“got”,所以你也可以轻松处理。
您可能还想处理数值(1st = 第一个,等等)。一般来说,您可以通过进行一些预处理来改善结果。如果它并不总是 100% 正确,请不要担心,它应该足够正确:)
Simple version:
Possible improvements:
It's hard to say what algorithm will work best with your data. My tip is: make sure you have some automated way of visualizing or testing your solution. This way you can quickly iterate and experiment with your solution and see how your changes affect the end result.
EDIT:
In response to your concerns:
The easiest way would be to start with normalizing the shorter forms (using gsub):
Note, that you can even expand "'s" to " is", even if it's not grammatically correct, because if John's means "John is", then you will get it right, and if it means "owned by John", then most likely both texts will contain the same form, so you will not further the distance by expanding both "incorrectly". The other case is when it should mean "John has", but then after "'s" there probably will be "got", so you can handle that easily as well.
You will probably also want to deal with numeric values (1st = first, etc.). Generally you can probably improve the result by doing some preprocessing. Don't worry if it's not always 100% correct, it should just be correct enough:)
由于您最终试图比较不同的转录者如何处理段落的发音方式,因此您可以尝试使用语音算法进行比较,例如Metaphone。
Since you're ultimately trying to compare how different transcribers have dealt with the way the passage sounds, you might try comparing using a phonetic algorithm such as Metaphone.
在对我在这个问题中指出的问题进行实验后,我发现编辑距离实际上考虑了这些问题。我不完全理解如何或为什么,但经过实验可以看出情况确实如此。
After experimenting with the issues I noted in this question, I found that the Levenshtein Distance actually takes these problems into account. I don't fully understand how or why, but can see after experimentation that this is the case.