用于字符串相似度比较的 N 元语法分割函数
作为更好地理解我目前正在学习的 F# 的练习的一部分,我编写了函数 将给定字符串拆分为 n 元语法。
1)我想收到有关我的功能的反馈:可以以更简单或更有效的方式编写吗?
2)我的总体目标是编写基于 n-gram 相似度返回字符串相似度(0.0 .. 1.0 范围)的函数;这种方法对于短字符串比较是否有效,或者这种方法是否可以可靠地用于比较大字符串(例如文章)。
3)我知道 n 元语法比较会忽略两个字符串的上下文。您建议采用什么方法来实现我的目标?
//s:string - target string to split into n-grams
//n:int - n-gram size to split string into
let ngram_split (s:string, n:int) =
let ngram_count = s.Length - (s.Length % n)
let ngram_list = List.init ngram_count (fun i ->
if( i + n >= s.Length ) then
s.Substring(i,s.Length - i) + String.init ((i + n) - s.Length)
(fun i -> "#")
else
s.Substring(i,n)
)
let ngram_array_unique = ngram_list
|> Seq.ofList
|> Seq.distinct
|> Array.ofSeq
//produce tuples of ngrams (ngram string,how much occurrences in original string)
Seq.init ngram_array_unique.Length (fun i -> (ngram_array_unique.[i],
ngram_list |> List.filter(fun item -> item = ngram_array_unique.[i])
|> List.length)
)
As part of excersise to better understand F# which I am currently learning , I wrote function to
split given string into n-grams.
1) I would like to receive feedback about my function : can this be written simpler or in more efficient way?
2) My overall goal is to write function that returns string similarity (on 0.0 .. 1.0 scale) based on n-gram similarity; Does this approach works well for short strings comparisons , or can this method reliably be used to compare large strings (like articles for example).
3) I am aware of the fact that n-gram comparisons ignore context of two strings. What method would you suggest to accomplish my goal?
//s:string - target string to split into n-grams
//n:int - n-gram size to split string into
let ngram_split (s:string, n:int) =
let ngram_count = s.Length - (s.Length % n)
let ngram_list = List.init ngram_count (fun i ->
if( i + n >= s.Length ) then
s.Substring(i,s.Length - i) + String.init ((i + n) - s.Length)
(fun i -> "#")
else
s.Substring(i,n)
)
let ngram_array_unique = ngram_list
|> Seq.ofList
|> Seq.distinct
|> Array.ofSeq
//produce tuples of ngrams (ngram string,how much occurrences in original string)
Seq.init ngram_array_unique.Length (fun i -> (ngram_array_unique.[i],
ngram_list |> List.filter(fun item -> item = ngram_array_unique.[i])
|> List.length)
)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我不太了解评估字符串的相似性,因此我无法给您提供有关第 2 点和第 3 点的太多反馈。但是,这里有一些建议可能有助于使您的实现更简单。
您需要执行的许多操作已在某些用于处理序列(列表、数组等)的 F# 库函数中可用。字符串也是(字符)序列,因此您可以编写以下代码:
Seq.windowed
函数实现滑动窗口,这正是提取字符串的 n 元模型所需的。Seq.groupBy
函数将序列 (n-gram) 的元素收集到包含具有相同键的值的组序列中。我们使用 id 来计算密钥,这意味着 n-gram 本身就是密钥(因此我们得到组,其中每个组包含相同的 n-gram)。然后我们只需将 n-gram 转换为字符串并计算组中元素的数量。或者,您可以将整个函数编写为单个处理管道,如下所示:
I don't know much about evaluating similarity of strings, so I can't give you much feedback regarding points 2 and 3. However, here are a few suggestions that may help to make your implementation simpler.
Many of the operations that you need to do are already available in some F# library function for working with sequences (lists, arrays, etc.). Strings are also sequences (of characters), so you can write the following:
The
Seq.windowed
function implements a sliding window, which is exactly what you need to extract the n-grams of your string. TheSeq.groupBy
function collects the elements of a sequence (n-grams) into a sequence of groups that contain values with the same key. We useid
to calculate the key, which means that the n-gram is itself the key (and so we get groups, where each group contains the same n-grams). Then we just convert n-gram to string and count the number of elements in the group.Alternatively, you can write the entire function as a single processing pipeline like this:
你的代码对我来说看起来没问题。由于 ngram 提取和相似度比较经常使用。您应该在这里考虑一些效率问题。
MapReduce 模式非常适合您的频率计数问题:
对单词进行分组并将所有计数加在一起。
let wordCntReducer(wseq: seq) =
<前><代码>wseq
|> Seq.groupBy (fun (id,cnt) -> id)
|> Seq.map(fun(id,idseq)->
(id, idseq |> Seq.sumBy(fun (id,cnt) -> cnt)))
(* 测试:wordCntReducer [1,1; 2,1; 1,1; 2,1; 2,2;] *)
您还需要在构建一组字符串的 ngram 期间维护
映射。因为在后续处理过程中处理整数比处理字符串要高效得多。(2)比较两短串之间的距离。常见的做法是通过简单的动态规划来使用编辑距离。要计算文章之间的相似度,最先进的方法是使用 TFIDF 特征表示。实际上上面的代码是从我的数据挖掘库中提取的用于术语频率计数的代码。
(3)有复杂的NLP方法,例如基于解析树的树核,来配合上下文信息。
Your code looks OK to me. Since ngram extraction and similarity comparison are used very often. You should consider some efficiency issues here.
The MapReduce pattern is very suitable for your frequency counting problem:
do a grouping of the words and adds all the counting together.
let wordCntReducer (wseq: seq<int*int>) =
You also need to maintain a
<word,int>
map during your ngram building for a set of strings. As it is much more efficient to handle integers rather than strings during later processing.(2) to compare the distance between two short strings. A common practice is to use Edit Distance using a simple dynamic programming. To compute the similarity between articles, a state-of-the-art method is to use TFIDF feature representation. Actuallym the code above is for term frequency counting, extracted from my data mining library.
(3) There are complex NLP methods, e.g. tree kernels based on the parse tree, to in-cooperate the context information in.
我认为您对问题(1)有一些很好的答案。
问题 (2):
您可能需要余弦相似度来比较两个任意的 n-gram 集合(越大越好)。这为您提供了 0.0 - 1.0 的范围,无需任何缩放。 Wikipedia 页面给出了一个方程,F# 翻译非常简单:
对于输入,您需要运行类似 Tomas' 的答案来获取两张地图,然后删除仅存在于一张地图中的键:
使用基于字符的 n 元语法,我不知道您的结果会有多好。这取决于你对文本的什么样的特征感兴趣。我是做自然语言处理的,所以通常我的第一步是词性标注。然后我比较词性的 n 元语法。我使用 T'n'T 来实现此目的,但它存在奇怪的许可问题。我的一些同事使用 ACOPOST,这是一种免费的替代方案(如啤酒和自由)。我不知道准确度有多高,但词性标注现在是一个很好理解的问题,至少对于英语和相关语言来说是这样。
问题(3):
比较两个几乎相同的字符串的最佳方法是编辑距离。我不知道这是否是你的情况,尽管你可以通过多种方式放宽假设,例如比较 DNA 字符串。
关于这个主题的标准书是 Sankoff 和 Kruskal 的 “时间扭曲、字符串编辑和 Maromolecules” 。它已经很老了(1983 年),但提供了如何使基本算法适应许多应用程序的很好的例子。
I think you have some good answers for question (1).
Question (2):
You probably want cosine similarity to compare two arbitrary collections of n-grams (the larger better). This gives you a range of 0.0 - 1.0 without any scaling needed. The Wikipedia page gives an equation, and the F# translation is pretty straightforward:
For input, you need to run something like Tomas' answer to get two maps, then remove keys that only exist in one:
With character-based n-grams, I don't know how good your results will be. It depends on what kind of features of the text you are interested in. I do natural language processing, so usually my first step is part-of-speech tagging. Then I compare over n-grams of the parts of speech. I use T'n'T for this, but it has bizarro licencing issues. Some of my colleagues use ACOPOST instead, a Free alternative (as in beer AND freedom). I don't know how good the accuracy is, but POS tagging is a well-understood problem these days, at least for English and related languages.
Question (3):
The best way to compare two strings that are nearly identical is Levenshtein distance. I don't know if that is your case here, although you can relax the assumptions in a number of ways, eg for comparing DNA strings.
The standard book on this subject is Sankoff and Kruskal's "Time Warps, String Edits, and Maromolecules". It's pretty old (1983), but gives good examples of how to adapt the basic algorithm to a number of applications.
问题 3:
我的参考书是 Bill Smyth 的 Computing Patterns in Strings
Question 3:
My reference book is Computing Patterns in Strings by Bill Smyth