We don’t allow questions seeking recommendations for software libraries, tutorials, tools, books, or other off-site resources. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
接受
或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
发布评论
评论(3)
嗯,这是一个有趣的问题。您可以使用 NLTK 提取核心概念(名词组)并直接进行比较。在这种情况下,您将得到:
现在,相似性不是双向的。第 2 组在第 1 组中得到充分代表,但反之则不然。您可以应用调和平均值来计算一个组在另一个组中的百分比,因此 G21 将为 1.0,G12 将为 0.57。因此调和平均值为 H = 2AB/(A+B) == 2(1.0)*(0.57)/(1.0 + 0.57) = 0.72。
现在,这并不相同,但在您的示例中,您希望两个段落之间存在匹配。在本例中,它们的调和平均值 H 为 0.72。数字越高,实现的难度就越大。 H>0.8被认为是好的。对于大多数系统来说,H>0.9 是例外。所以你必须决定的是你想在沙子上画出任意的线吗?它必须是任意的,因为您没有给出相似程度的定义。那么你设置为0.6、0.7吗? 0.12948737 怎么样?发现这个阈值的一个好方法是采取测试示例,无需进行数学计算,只需自行判断它们的相似性,然后运行数字并看看您会得出什么结果。
Well that is an interesting question. You could use NLTK to extract the core concepts (Noun groups) and compare those directly. In this case you'd get:
Now, similarity is not bi-directional. Group 2 is fully represented in Group 1, but not the other way around. You could apply a harmonic mean where you count the percentage of a group in another group so G21 would be 1.0 and G12 would be 0.57. So the harmonic mean would be H = 2AB/(A+B) == 2(1.0)*(0.57)/(1.0 + 0.57) = 0.72.
Now, this isn't identical but in your example you wanted there to be a match between the two paragraphs. In this case their harmonic mean, H, is 0.72. The higher the number, the harder it is to achieve. H>0.8 is considered good. H>0.9 for most systems is exceptional. So what you must decide is where do you want your arbitrary line in the sand drawn? It has to be arbitrary because you haven't given a definition of the degree of similarity. So do you set it at 0.6, 0.7? How about 0.12948737? A good way of discovering this threshold is to take test examples and without doing the math just judge for yourself their similarity and then run the numbers and see what you come up with.
我不知道是否有 .NET 实现,但您可以轻松地自己编写代码。
您可以使用反向 n 元语法索引 (A),在搜索段落中查找 n 元语法 (B),计算常见的 n 元语法除以总 n 元语法 (C),这会给出一个概率度量,您可以根据该概率度量可以设置一个阈值,也可能做其他事情。
(A) 创建反向 n-gram 索引:从要搜索的段落中获取所有 n-gram 并将它们存储在数据库中。
(B) 当查找与语料库匹配的段落时,计算其所有 n-gram,查找每个 n反向索引中的-gram。
(C) 通过除以搜索段落和结果段落中所有 n-gram 的公共 n-gram 数量以及所有 n-gram 的并集,计算找到的每个条目的分数。
实现细节:
快乐编码
I don't know whether there is a .NET implementation, but you can easily code this yourself.
You can use a reversed n-gram index (A), look up n-grams in your search paragraph (B), calculate common n-grams divided by total n-grams (C) which gives you a probability measure, for which you can set a threshold and probably do other stuff as well.
(A) Create a reversed n-gram index: get all n-grams from the paragraphs you want to search through and store them in a db.
(B) When looking up a paragraph to match against the corpus, calculate all its n-grams, look up each n-gram in the reversed index.
(C) Calculate a score for each entry you find by dividing the amount of its common n-grams and the union of all n-grams in search paragraph and result paragraph.
Implementation details:
Happy coding
我建议您使用 Google 用于查找网络上重复文档的相同算法
http://www.cs.sunysb.edu/~cse692/papers/henzinger_sigir06 .pdf
使用 Rubin 算法对短语进行哈希处理,对这些哈希值进行排序并比较底部 10 个。非常快。
帕特里克.
I suggest you want to use the same algorithm Google uses to find duplicate documents on the web
http://www.cs.sunysb.edu/~cse692/papers/henzinger_sigir06.pdf
Hash the phrases using Rubin's algorithm, sort these hashes and compare the bottom 10. Very fast.
Patrick.