编辑距离:如何更好地处理单词交换位置?

发布于 2024-07-19 04:21:48 字数 847 浏览 7 评论 0原文

我已经使用 PHP levenshtein 函数成功比较了字符串。

但是,对于包含交换位置的子字符串的两个字符串,该算法将它们计为全新的子字符串。

例如:

levenshtein("The quick brown fox", "brown quick The fox"); // 10 differences

被视为具有更少的共同点

levenshtein("The quick brown fox", "The quiet swine flu"); // 9 differences

我更喜欢一种算法,它可以看出前两个更加相似。

我怎样才能提出一个比较函数来识别已切换位置的子字符串与编辑不同?

我想到的一种可能的方法是在比较之前将字符串中的所有单词按字母顺序排列。 这使得单词的原始顺序完全脱离了比较。 然而,这样做的缺点是,仅更改单词的第一个字母可能会比更改单个字母造成更大的破坏。

我想要实现的是比较关于人的两个事实(作为自由文本字符串),并确定这些事实表明同一事实的可能性有多大。 例如,事实可能是某人就读的学校、雇主或出版商的名称。 两条记录中同一所学校的拼写可能不同、单词顺序不同、多余的单词等,因此如果我们要很好地猜测它们指的是同一所学校,那么匹配就必须有些模糊。 到目前为止,它对于拼写错误的处理效果非常好(除此之外,我使用了类似于变音素的语音算法),但如果你切换学校中常见的单词顺序,效果就很差:“xxx College”与“xxx学院”。

I've had some success comparing strings using the PHP levenshtein function.

However, for two strings which contain substrings that have swapped positions, the algorithm counts those as whole new substrings.

For example:

levenshtein("The quick brown fox", "brown quick The fox"); // 10 differences

are treated as having less in common than:

levenshtein("The quick brown fox", "The quiet swine flu"); // 9 differences

I'd prefer an algorithm which saw that the first two were more similar.

How could I go about coming up with a comparison function that can identify substrings which have switched position as being distinct to edits?

One possible approach I've thought of is to put all the words in the string into alphabetical order, before the comparison. That takes the original order of the words completely out of the comparison. A downside to this, however, is that changing just the first letter of a word can create a much bigger disruption than a changing a single letter should cause.

What I'm trying to achieve is to compare two facts about people which are free text strings, and decide how likely these facts are to indicate the same fact. The facts might be the school someone attended, the name of their employer or publisher, for example. Two records may have the same school spelled differently, words in a different order, extra words, etc, so the matching has to be somewhat fuzzy if we are to make a good guess that they refer to the same school. So-far it is working very well for spelling errors (I am using a phoenetic algorithm similar to metaphone on top of this all) but very poorly if you switch the order of words around which seem common in a school: "xxx college" vs "college of xxx".

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

天涯离梦残月幽梦 2024-07-26 04:21:48

N-grams

使用N-grams,它支持多个-整个文本中的字符换位

总体思路是,将有问题的两个字符串拆分为所有可能的 2-3 个字符子字符串 (n-gram),并将两个字符串之间共享的 n-gram 数量视为它们的相似性度量。 然后可以通过将共享数除以较长字符串中的 n 元语法总数来对其进行标准化。 这计算起来很简单,但相当强大。

对于例句:

A. The quick brown fox
B. brown quick The fox
C. The quiet swine flu

A 和 B 共享 18 2-grams

A 和 C 仅共享 8 2-grams

总共20个可能。

Gravano 等人对此进行了更详细的讨论。 纸

tf-idf 和余弦相似度

一个不那么简单的替代方案,但基于信息论,将使用术语 词频-逆文档频率 (tf-idf) 来权衡标记,构建句子向量,然后使用 余弦相似度作为相似度度量。

该算法是:

  1. 计算每个句子的 2 字符标记频率 (tf)。
  2. 计算逆句子频率 (idf),它是语料库中所有句子数量(在本例中为 3)除以特定标记在所有句子中出现的次数的商的对数。 在这种情况下,th 出现在所有句子中,因此它的信息内容为零 (log(3/3)=0)。
    idf Formula
  3. 通过将 tf 和 idf 表中的相应单元格相乘来生成 tf-idf 矩阵。
    tfidf
  4. 最后,计算所有句子对的余弦相似度矩阵,其中 A 和 B 是 tf-idf 表中的权重相应的代币。 范围从 0(不相似)到 1(相等)。
    余弦相似度
    相似矩阵

Levenshtein 修改和 Metaphone

关于其他答案。 Damerau–Levenshtein 修改仅支持两个相邻<的转置/strong> 字符。 Metaphone 旨在匹配听起来相同或不同的单词
用于相似性匹配。

N-grams

Use N-grams, which support multiple-character transpositions across the whole text.

The general idea is that you split the two strings in question into all the possible 2-3 character substrings (n-grams) and treat the number of shared n-grams between the two strings as their similarity metric. This can be then normalized by dividing the shared number by the total number of n-grams in the longer string. This is trivial to calculate, but fairly powerful.

For the example sentences:

A. The quick brown fox
B. brown quick The fox
C. The quiet swine flu

A and B share 18 2-grams

A and C share only 8 2-grams

out of 20 total possible.

This has been discussed in more detail in the Gravano et al. paper.

tf-idf and cosine similarity

A not so trivial alternative, but grounded in information theory would be to use term term frequency–inverse document frequency (tf-idf) to weigh the tokens, construct sentence vectors and then use cosine similarity as the similarity metric.

The algorithm is:

  1. Calculate 2-character token frequencies (tf) per sentence.
  2. Calculate inverse sentence frequencies (idf), which is a logarithm of a quotient of the number of all sentences in the corpus (in this case 3) divided by the number of times a particular token appears across all sentences. In this case th is in all sentences so it has zero information content (log(3/3)=0).
    idf formula
  3. Produce the tf-idf matrix by multiplying corresponding cells in the tf and idf tables.
    tfidf
  4. Finally, calculate cosine similarity matrix for all sentence pairs, where A and B are weights from the tf-idf table for the corresponding tokens. The range is from 0 (not similar) to 1 (equal).
    cosine similarity
    similarity matrix

Levenshtein modifications and Metaphone

Regarding other answers. Damerau–Levenshtein modificication supports only the transposition of two adjacent characters. Metaphone was designed to match words that sound the same and not
for similarity matching.

回忆躺在深渊里 2024-07-26 04:21:48

这很容易。 只需在单词而不是字母上使用 Damerau-Levenshtein 距离即可。

Its easy. Just use the Damerau-Levenshtein distance on the words instead of letters.

蓝梦月影 2024-07-26 04:21:48

在空格上爆炸,对数组进行排序,内爆,然后进行编辑。

Explode on spaces, sort the array, implode, then do the Levenshtein.

装纯掩盖桑 2024-07-26 04:21:48

你也可以试试这个。 (只是一个额外的建议)

$one = metaphone("The quick brown fox"); // 0KKBRNFKS
$two = metaphone("brown quick The fox"); // BRNKK0FKS
$three = metaphone("The quiet swine flu"); // 0KTSWNFL

similar_text($one, $two, $percent1); // 66.666666666667
similar_text($one, $three, $percent2); // 47.058823529412
similar_text($two, $three, $percent3); // 23.529411764706

这将表明第一和第二比一和三以及二和三更相似。

You can also try this. (just an extra suggestion)

$one = metaphone("The quick brown fox"); // 0KKBRNFKS
$two = metaphone("brown quick The fox"); // BRNKK0FKS
$three = metaphone("The quiet swine flu"); // 0KTSWNFL

similar_text($one, $two, $percent1); // 66.666666666667
similar_text($one, $three, $percent2); // 47.058823529412
similar_text($two, $three, $percent3); // 23.529411764706

This will show that the 1st and 2nd are more similar than one and three and two and three.

温柔女人霸气范 2024-07-26 04:21:48

我一直在拼写检查器中实现编辑。

您要求的是将换位计为 1 次编辑。

如果您只想计算一个单词的换位次数,这很容易。 然而,对于 2 个或更多距离的单词换位,算法的添加是最坏的情况 !(max(wordorder1.length(), wordorder2.length()))。 将非线性子算法添加到已经是二次算法中并不是一个好主意。

这就是它的工作原理。

if (wordorder1[n] == wordorder2[n-1])
{
  min(workarray[x-1, y] + 1, workarray[x, y-1] + 1, workarray[x-2, y-2]);
}
  else
{
  min(workarray[x-1, y] + 1, workarray[x, y-1] + 1);
}

只是为了触摸换位。 如果您想要所有换位,则必须从该点开始向后比较每个位置,

1[n] == 2[n-2].... 1[n] == 2[0]....

因此您会明白为什么他们不将其包含在标准方法中。

I've been implementing levenshtein in a spell checker.

What you're asking for is counting transpositions as 1 edit.

This is easy if you only wish to count transpositions of one word away. However for transposition of words 2 or more away, the addition to the algorithm is worst case scenario !(max(wordorder1.length(), wordorder2.length())). Adding a non-linear subalgorithm to an already quadratic algorithm is not a good idea.

This is how it would work.

if (wordorder1[n] == wordorder2[n-1])
{
  min(workarray[x-1, y] + 1, workarray[x, y-1] + 1, workarray[x-2, y-2]);
}
  else
{
  min(workarray[x-1, y] + 1, workarray[x, y-1] + 1);
}

JUST for touching transpositions. If you want all transpositions, you'd have to for every position work backwards from that point comparing

1[n] == 2[n-2].... 1[n] == 2[0]....

So you see why they don't include this in the standard method.

胡大本事 2024-07-26 04:21:48

接受这个答案< /a> 并进行以下更改:

void match(trie t, char* w, string s, int budget){
  if (budget < 0) return;
  if (*w=='\0') print s;
  foreach (char c, subtrie t1 in t){
    /* try matching or replacing c */
    match(t1, w+1, s+c, (*w==c ? budget : budget-1));
    /* try deleting c */
    match(t1, w, s, budget-1);
  }
  /* try inserting *w */
  match(t, w+1, s + *w, budget-1);
  /* TRY SWAPPING FIRST TWO CHARACTERS */
  if (w[1]){
    swap(w[0], w[1]);
    match(t, w, s, budget-1);
    swap(w[0], w[1]);
  }
}

这是用于在 trie 中进行字典搜索,但对于匹配单个单词,这是相同的想法。 你正在做分支定界,在任何时候,你都可以做出任何你喜欢的改变,只要你给它一个成本。

Take this answer and make the following change:

void match(trie t, char* w, string s, int budget){
  if (budget < 0) return;
  if (*w=='\0') print s;
  foreach (char c, subtrie t1 in t){
    /* try matching or replacing c */
    match(t1, w+1, s+c, (*w==c ? budget : budget-1));
    /* try deleting c */
    match(t1, w, s, budget-1);
  }
  /* try inserting *w */
  match(t, w+1, s + *w, budget-1);
  /* TRY SWAPPING FIRST TWO CHARACTERS */
  if (w[1]){
    swap(w[0], w[1]);
    match(t, w, s, budget-1);
    swap(w[0], w[1]);
  }
}

This is for dictionary search in a trie, but for matching to a single word, it's the same idea. You're doing branch-and-bound, and at any point, you can make any change you like, as long as you give it a cost.

梦归所梦 2024-07-26 04:21:48

消除两个字符串之间的重复单词,然后使用 Levenshtein。

Eliminate duplicate words between the two strings and then use Levenshtein.

冷弦 2024-07-26 04:21:48

我相信这是使用向量空间搜索引擎的一个很好的例子。

在这种技术中,每个文档本质上成为一个向量,其维度与整个语料库中不同单词的维度一样多; 然后类似的文档占据该向量空间中的相邻区域。 该模型的一个很好的特性是查询也只是文档:要回答查询,您只需计算它们在向量空间中的位置,并且您的结果是您可以找到的最接近的文档。 我确信有 PHP 的即用型解决方案。

为了模糊向量空间的结果,您可以考虑使用词干/类似的自然语言处理技术,并使用 levenshtein 为整体词汇中出现的相似单词构建二次查询。

i believe this is a prime example for using a vector-space search engine.

in this technique, each document essentially becomes a vector with as many dimensions as there are different words in the entire corpus; similar documents then occupy neighboring areas in that vector space. one nice property of this model is that queries are also just documents: to answer a query, you simply calculate their position in vector space, and your results are the closest documents you can find. i am sure there are get-and-go solutions for PHP out there.

to fuzzify results from vector space, you could consider to do stemming / similar natural language processing technique, and use levenshtein to construct secondary queries for similar words that occur in your overall vocabulary.

空城之時有危險 2024-07-26 04:21:48

如果第一个字符串是 A,第二个字符串是 B:

  1. 将 A 和 B 拆分为单词
  2. 对于 A 中的每个单词,在 B 中找到最佳匹配的单词(使用 levenshtein)
  3. 从 B 中删除该单词并将其同时放入 B* 中索引作为 A 中的匹配单词。
  4. 现在比较 A 和 B*

示例:

A: The quick brown fox
B: Quick blue fox the
B*: the Quick blue fox

您可以通过多次执行来改进步骤 2,首先仅查找精确匹配,然后查找 A 中没有同伴的单词的紧密匹配在 B* 中,然后是不太接近的比赛,等等。

If the first string is A and the second one is B:

  1. Split A and B into words
  2. For every word in A, find the best matching word in B (using levenshtein)
  3. Remove that word from B and put it in B* at the same index as the matching word in A.
  4. Now compare A and B*

Example:

A: The quick brown fox
B: Quick blue fox the
B*: the Quick blue fox

You could improve step 2 by doing it in multiple passes, finding only exact matches at first, then finding close matches for words in A that don't have a companion in B* yet, then less close matches, etc.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文