在拼写错误的单词处放置点

发布于 2024-08-16 07:41:19 字数 746 浏览 10 评论 0 原文

我正在用 PHP 创建一个网络应用程序,人们可以尝试翻译他们在学校需要学习的单词。

例如,有人需要将荷兰语单词“weer”翻译为英语的“weather”,但不幸的是他输入了“whether”。因为他几乎打错了单词,所以我想再给他一次尝试,在他犯错的地方用点“.”:

Language A:   weer
Language B:   weather
Input:        whether
Output:       w..ther

或者,例如

Language A:   fout
Language B:   mistake
Input:        mitake
Output:       mi.take

或者:

Language A:   echt
Language B:   genuine
Input:        genuinely
Output:       genuinely (almost good, shorten the word a little bit)

但是,如果输入也不同与所需的翻译相差很大,我不想得到像 ........ 这样的输出,

我听说过 Levenshtein 距离,我认为我需要一种与该算法非常相似的算法,但是我不知道如何将点放在正确的位置,而不是回显要完成多少操作。

那么,如何在有人写错的地方返回带有点的拼错单词呢?

I'm creating a web app in PHP where people can try to translate words they need to learn for school.

For example, someone needs to translate the Dutch word 'weer' to 'weather' in English, but unfortunately he types 'whether'. Because he almost typed the right word, I want to give him another try, with dots '.' on the places where he made a mistake:

Language A:   weer
Language B:   weather
Input:        whether
Output:       w..ther

Or, for example

Language A:   fout
Language B:   mistake
Input:        mitake
Output:       mi.take

Or:

Language A:   echt
Language B:   genuine
Input:        genuinely
Output:       genuinely (almost good, shorten the word a little bit)

But, if the input differs too much from the desired translation, I don't want to get output like ........

I heard about Levenshtein distance and I think I need an algorithm that is much like that one, but I don't know how to place dots on the right place instead of echoing how many operations are to be done.

So, how can I return the misspelled word with dots on the places where someone made a mistake?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

噩梦成真你也成魔 2024-08-23 07:41:19

首先,看看维基百科上的 Levenshtein 算法
然后继续查看文章页面上的示例和结果矩阵d

                *k*     *i*     *t*     *t*     *e*     *n*
        >0       1       2       3       4       5       6 
*s*      1      >1       2       3       4       5       6 
*i*      2       2      >1       2       3       4       5 
*t*      3       3       2      >1       2       3       4 
*t*      4       4       3       2      >1       2       3 
*i*      5       5       4       3       2      >2       3 
*n*      6       6       5       4       3       3      >2 
*g*      7       7       6       5       4       4      >3 

距离位于矩阵的右下角,d[m, n]。但从那里开始
现在可以回溯到矩阵 d[1, 1] 左上角的最小步骤。您只需向左、向左或向上走每一步,以最小化路径为准。
在上面的示例中,您会找到标有“>”的路径标志:

  s i t t i n g      k i t t e n
0 1 1 1 1 2 2 3    0 1 1 1 1 2 2 3
  ^       ^   ^      ^       ^   ^
  changes in the distance, replace by dots

现在您可以在最小路径上找到距离发生变化的位置 d[i,j](在上面的示例中用 ^ 标记),并且对于那些您在第一个(或第二个)单词中放置点的字母i(或 j 分别)。

结果:

  s i t t i n g      k i t t e n
  ^       ^   ^      ^       ^   ^
  . i t t . n .      . i t t . n . 

First, take a look at the Levenshtein algorithm on wikipedia.
Then go on and look at the examples and the resulting matrix d on the article page:

                *k*     *i*     *t*     *t*     *e*     *n*
        >0       1       2       3       4       5       6 
*s*      1      >1       2       3       4       5       6 
*i*      2       2      >1       2       3       4       5 
*t*      3       3       2      >1       2       3       4 
*t*      4       4       3       2      >1       2       3 
*i*      5       5       4       3       2      >2       3 
*n*      6       6       5       4       3       3      >2 
*g*      7       7       6       5       4       4      >3 

The distance is found on lower right corner of the matrix, d[m, n]. But from there it
is now possible to follow backtrack the minimum steps to the upper left of of the matrix, d[1, 1]. You just go left, up-left, or up at each step whichever minimizes the path.
In the example above, you'd find the path marked by the '>' signs:

  s i t t i n g      k i t t e n
0 1 1 1 1 2 2 3    0 1 1 1 1 2 2 3
  ^       ^   ^      ^       ^   ^
  changes in the distance, replace by dots

Now you can find on the minimum path at which locations d[i,j] the distance changes (marked by ^ in the example above), and for those letters you put in the first (or second) word a dot at position i (or j respectively).

Result:

  s i t t i n g      k i t t e n
  ^       ^   ^      ^       ^   ^
  . i t t . n .      . i t t . n . 
瀟灑尐姊 2024-08-23 07:41:19

您正在寻找的术语称为“编辑距离”。使用诸如 Levenshtein distance 之类的东西会告诉您将一个字符串转换为另一个字符串所需的操作数(插入、删除、替换等)。

以下是其他“编辑距离”算法的列表

一旦您确定某个单词“足够接近”(即,它没有超过所需编辑的某个阈值),您就可以显示需要进行编辑的位置(通过显示点)。

那么你怎么知道把点放在哪里呢?

关于“Levenshtein 距离”的有趣之处在于,它使用 M x N 矩阵,每个轴上有一个单词(请参阅 Levenshtein 文章)。创建矩阵后,您可以确定哪些字母需要“额外编辑”才能正确。这就是你放置点的地方。如果信件需要“无需额外编辑”,您只需打印信件即可。很酷。

The terminology you are looking for is called "edit distance." Using something like Levenshtein distance will tell you the number of operations needed to transform one string into the other (insertions, deletion, substitutions, etc).

Here is a list of other "editing distance" algorithms.

Once you decide that a word is "close enough" (i.e. it doesn't exceed some threshold of edits needed), you can show where the edits need to occur (by showing dots).

So how do you know where to put the dots?

The interesting thing about the "Levenshtein distance" is it uses a M x N matrix with one word on each axis (see the sample matrix in Levenshtein article). Once you create the matrix, you can determine which letters require "additional edits" to be correct. That's where you put the dots. If the letter requires "no additional edits," you simply print the letter. Pretty cool.

指尖微凉心微凉 2024-08-23 07:41:19

我认为你需要做一个多步骤的过程,而不仅仅是 Levenshtein。首先,我会检查输入单词是否是目标单词的一种形式。这将抓住你的第三个例子,而且不用担心添加点。您还可以使用此步骤来捕获同义词。下一步是检查两个字符串的长度差异。

如果差异为 0,您可以进行字母比较以放置点。如果您不想显示所有点,那么您应该保留放置的点的数量,一旦超过限制,就会显示一些错误消息。 (抱歉,这是不正确的)

如果差异显示输入较长,您需要检查要删除的字母才能解决问题。在这里,您可以使用 Levenshtein 来查看它们的距离,如果距离太远,则显示错误消息,如果在范围内,您将需要反向执行 Levenshtein 的步骤,并以某种方式标记更改。不确定您想如何表明一封信需要删除。

如果差异显示输入较短,您可以使用编辑距离来查看两个单词是否足够接近或显示错误。然后再次执行相反的步骤,添加用于插入的点和用于替换的点。

实际上,最后两个步骤可以合并为一个函数,该函数运行该算法并记住插入、删除或替换并相应地更改输出。

I think you need to do a multi step process not just Levenshtein. First I would check if the input word is a form of target word. That would catch your 3rd example and also no worry about adding dots. You could also use this step to catch synonyms. The next step is to check the length difference of the two strings.

If the difference is 0 you can do a letter for letter comparison to place the dots. If you don't want to show all dots then you should keep a count of dots placed and once over the limit display some error message. (Sorry that was incorrect)

If the difference is shows the input is longer you need to check for a letter to be delete would fix the problem. Here you can use Levenshtein to see how close they are if they are too far away show your error message if it is in range you will need to do the steps of Levenshtein in reverse and mark the changes somehow. Not sure how you want to show that a letter needs deleted.

If the difference shows the input is shorter you can do use Levenshtein distance to see if the two words are close enough or show the error. Then do the steps in reverse again adding dots for insertions and dots for substitutions.

Actually the last two steps can be combined into one function that runs through the algorithm and remembers the insert delete or substitution and changes the output accordingly.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文