在不同文档中查找相似段落

发布于 2024-12-09 04:47:12 字数 1539 浏览 1 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

遗弃M 2024-12-16 04:47:12

嗯,这是一个有趣的问题。您可以使用 NLTK 提取核心概念(名词组)并直接进行比较。在这种情况下,您将得到:

  1. 管辖法律、协议、法律、州、密苏里州、冲突、法律原则
  2. 法律、法律、州、密苏里州、协议

现在,相似性不是双向的。第 2 组在第 1 组中得到充分代表,但反之则不然。您可以应用调和平均值来计算一个组在另一个组中的百分比,因此 G21 将为 1.0,G12 将为 0.57。因此调和平均值为 H = 2AB/(A+B) == 2(1.0)*(0.57)/(1.0 + 0.57) = 0.72。

现在,这并不相同,但在您的示例中,您希望两个段落之间存在匹配。在本例中,它们的调和平均值 H 为 0.72。数字越高,实现的难度就越大。 H>0.8被认为是好的。对于大多数系统来说,H>0.9 是例外。所以你必须决定的是你想在沙子上画出任意的线吗?它必须是任意的,因为您没有给出相似程度的定义。那么你设置为0.6、0.7吗? 0.12948737 怎么样?发现这个阈值的一个好方法是采取测试示例,无需进行数学计算,只需自行判断它们的相似性,然后运行数字并看看您会得出什么结果。

Well that is an interesting question. You could use NLTK to extract the core concepts (Noun groups) and compare those directly. In this case you'd get:

  1. Governing Law, Agreement, Laws, State, Missouri, conflict, law principles
  2. Law, laws, State, Missouri, Agreement

Now, similarity is not bi-directional. Group 2 is fully represented in Group 1, but not the other way around. You could apply a harmonic mean where you count the percentage of a group in another group so G21 would be 1.0 and G12 would be 0.57. So the harmonic mean would be H = 2AB/(A+B) == 2(1.0)*(0.57)/(1.0 + 0.57) = 0.72.

Now, this isn't identical but in your example you wanted there to be a match between the two paragraphs. In this case their harmonic mean, H, is 0.72. The higher the number, the harder it is to achieve. H>0.8 is considered good. H>0.9 for most systems is exceptional. So what you must decide is where do you want your arbitrary line in the sand drawn? It has to be arbitrary because you haven't given a definition of the degree of similarity. So do you set it at 0.6, 0.7? How about 0.12948737? A good way of discovering this threshold is to take test examples and without doing the math just judge for yourself their similarity and then run the numbers and see what you come up with.

吹泡泡o 2024-12-16 04:47:12

我不知道是否有 .NET 实现,但您可以轻松地自己编写代码。

您可以使用反向 n 元语法索引 (A),在搜索段落中查找 n 元语法 (B),计算常见的 n 元语法除以总 n 元语法 (C),这会给出一个概率度量,您可以根据该概率度量可以设置一个阈值,也可能做其他事情。

(A) 创建反向 n-gram 索引:从要搜索的段落中获取所有 n-gram 并将它们存储在数据库中。

示例:

由以下(短)段落组成的小语料库:
{雨、西班牙、主要、平原}

具有以下 3 元语法:{#ra, rai, ain,
在#、#sp、spa、pai、#ma、mai、#pl、pla、lai}

你会得到以下内容
键值对:{(#ra, rain), (rai, rain), (ain, rain), (ain, spain),
(ain, main), (ain, plain), (in#, rain), (in#, 西班牙), (in#, main),
(in#, plain), (#sp, 西班牙), (spa, 西班牙), (pai, 西班牙), (#ma, main),
(mai, main), (#pl, plain), (pla, plain), (lai, plain)}

(B) 当查找与语料库匹配的段落时,计算其所有 n-gram,查找每个 n反向索引中的-gram。

示例:

在我们刚刚创建的数据库中查找“pain”。我们得到以下 n 元语法:
{#pa, pai, ain, in#}

以下是匹配的条目:

#pa->没有匹配

派-> {西班牙}

艾因 -> {雨、西班牙、主要、平原}

在# -> {雨、西班牙、主要、平原}

(C) 通过除以搜索段落和结果段落中所有 n-gram 的公共 n-gram 数量以及所有 n-gram 的并集,计算找到的每个条目的分数。

示例

pain 与西班牙:常见 n 元语法:{pai, ain, in#},n 元语法并集 {#pa, #sp, pai, ain, in#},得分:3/5 (0.6)

痛苦与雨:常见的 n 元语法:{ain, in#},n 元语法的并集 {#pa, #ra, ain, in#},得分:2/4 (0.5)

pain 与 main:常见 n 元语法:{ain, in#},n 元语法并集 {#pa, #ma, ain, in#},得分:2/4 (0.5)

pain 与 plain:常见 n-gram:{ain, in#},n-gram 并集 {#pa, #pl, lai, ain, in#},得分:2/5 (0.4)

实现细节:

  • 使 n-grams 中的 n 在搜索和存储中都可配置,这给你灵活地试验 3-grams、4-grams、5-grams...
  • 使用快速(内存中/进程中)数据库,例如 SQLite 或 BerkeleyDb
  • 确保允许重复键(BerkeleyDb 允许这样做 - 但您需要找到重复键值对的解决方法,例如“maintain”的 3-grams ain 和 in# 是两倍,因此从技术上讲,您无法将其存储在数据库中)

快乐编码

I don't know whether there is a .NET implementation, but you can easily code this yourself.

You can use a reversed n-gram index (A), look up n-grams in your search paragraph (B), calculate common n-grams divided by total n-grams (C) which gives you a probability measure, for which you can set a threshold and probably do other stuff as well.

(A) Create a reversed n-gram index: get all n-grams from the paragraphs you want to search through and store them in a db.

Example:

A small corpus consisting of the following (short) paragraphs:
{rain, spain, main, plain}

has the following 3-grams: {#ra, rai, ain,
in#, #sp, spa, pai, #ma, mai, #pl, pla, lai}

You get the following
key-value pairs: {(#ra, rain), (rai, rain), (ain, rain), (ain, spain),
(ain, main), (ain, plain), (in#, rain), (in#, spain), (in#, main),
(in#, plain), (#sp, spain), (spa, spain), (pai, spain), (#ma, main),
(mai, main), (#pl, plain), (pla, plain), (lai, plain)}

(B) When looking up a paragraph to match against the corpus, calculate all its n-grams, look up each n-gram in the reversed index.

Example:

Look up "pain" in the db we just created. We get the following n-grams:
{#pa, pai, ain, in#}

Here are the matching entries:

#pa -> no match

pai -> {spain}

ain -> {rain, spain, main, plain}

in# -> {rain, spain, main, plain}

(C) Calculate a score for each entry you find by dividing the amount of its common n-grams and the union of all n-grams in search paragraph and result paragraph.

Example

pain vs. spain: common n-grams: {pai, ain, in#}, union of n-grams {#pa, #sp, pai, ain, in#}, score: 3/5 (0.6)

pain vs. rain: common n-grams: {ain, in#}, union of n-grams {#pa, #ra, ain, in#}, score: 2/4 (0.5)

pain vs. main: common n-grams: {ain, in#}, union of n-grams {#pa, #ma, ain, in#}, score: 2/4 (0.5)

pain vs. plain: common n-grams: {ain, in#}, union of n-grams {#pa, #pl, lai, ain, in#}, score: 2/5 (0.4)

Implementation details:

  • Make the n in n-grams configurable both in search and storage, this gives you flexibility to experiment with 3-grams, 4-grams, 5-grams, ...
  • Use a fast (in-memory/in-process) db, such as SQLite or BerkeleyDb
  • Make sure duplicate keys are allowed (BerkeleyDb allows this - but you need to find a workaround for duplicate key-value pairs, for example "maintain" has twice the 3-grams ain and in#, so technically you cannot store this in the db)

Happy coding

血之狂魔 2024-12-16 04:47:12

我建议您使用 Google 用于查找网络上重复文档的相同算法
http://www.cs.sunysb.edu/~cse692/papers/henzinger_sigir06 .pdf

使用 Rubin 算法对短语进行哈希处理,对这些哈希值进行排序并比较底部 10 个。非常快。

帕特里克.

I suggest you want to use the same algorithm Google uses to find duplicate documents on the web
http://www.cs.sunysb.edu/~cse692/papers/henzinger_sigir06.pdf

Hash the phrases using Rubin's algorithm, sort these hashes and compare the bottom 10. Very fast.

Patrick.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文