当前位置：文江博客话题详情

在不同文档中查找相似段落

发布于 2024-12-09 04:47:12 字数 1539 浏览 1 评论 0原文

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

遗弃Ｍ 2024-12-16 04:47:12

嗯，这是一个有趣的问题。您可以使用 NLTK 提取核心概念（名词组）并直接进行比较。在这种情况下，您将得到：

管辖法律、协议、法律、州、密苏里州、冲突、法律原则
法律、法律、州、密苏里州、协议

现在，相似性不是双向的。第 2 组在第 1 组中得到充分代表，但反之则不然。您可以应用调和平均值来计算一个组在另一个组中的百分比，因此 G21 将为 1.0，G12 将为 0.57。因此调和平均值为 H = 2AB/(A+B) == 2(1.0)*(0.57)/(1.0 + 0.57) = 0.72。

现在，这并不相同，但在您的示例中，您希望两个段落之间存在匹配。在本例中，它们的调和平均值 H 为 0.72。数字越高，实现的难度就越大。 H＞0.8被认为是好的。对于大多数系统来说，H＞0.9 是例外。所以你必须决定的是你想在沙子上画出任意的线吗？它必须是任意的，因为您没有给出相似程度的定义。那么你设置为0.6、0.7吗？ 0.12948737 怎么样？发现这个阈值的一个好方法是采取测试示例，无需进行数学计算，只需自行判断它们的相似性，然后运行数字并看看您会得出什么结果。

回复收藏 0 原文

吹泡泡o 2024-12-16 04:47:12

我不知道是否有 .NET 实现，但您可以轻松地自己编写代码。

您可以使用反向 n 元语法索引 (A)，在搜索段落中查找 n 元语法 (B)，计算常见的 n 元语法除以总 n 元语法 (C)，这会给出一个概率度量，您可以根据该概率度量可以设置一个阈值，也可能做其他事情。

(A) 创建反向 n-gram 索引：从要搜索的段落中获取所有 n-gram 并将它们存储在数据库中。

示例：
由以下（短）段落组成的小语料库：
{雨、西班牙、主要、平原}
具有以下 3 元语法：{#ra, rai, ain,
在#、#sp、spa、pai、#ma、mai、#pl、pla、lai}
你会得到以下内容
键值对：{(#ra, rain), (rai, rain), (ain, rain), (ain, spain),
(ain, main), (ain, plain), (in#, rain), (in#, 西班牙), (in#, main),
(in#, plain), (#sp, 西班牙), (spa, 西班牙), (pai, 西班牙), (#ma, main),
(mai, main), (#pl, plain), (pla, plain), (lai, plain)}

(B) 当查找与语料库匹配的段落时，计算其所有 n-gram，查找每个 n反向索引中的-gram。

示例：
在我们刚刚创建的数据库中查找“pain”。我们得到以下 n 元语法：
{#pa, pai, ain, in#}
以下是匹配的条目：
#pa->没有匹配
派-> {西班牙}
艾因 -> {雨、西班牙、主要、平原}
在# -> {雨、西班牙、主要、平原}

示例
pain 与西班牙：常见 n 元语法：{pai, ain, in#}，n 元语法并集 {#pa, #sp, pai, ain, in#}，得分：3/5 (0.6)
痛苦与雨：常见的 n 元语法：{ain, in#}，n 元语法的并集 {#pa, #ra, ain, in#}，得分：2/4 (0.5)
pain 与 main：常见 n 元语法：{ain, in#}，n 元语法并集 {#pa, #ma, ain, in#}，得分：2/4 (0.5)
pain 与 plain：常见 n-gram：{ain, in#}，n-gram 并集 {#pa, #pl, lai, ain, in#}，得分：2/5 (0.4)

实现细节：

使 n-grams 中的 n 在搜索和存储中都可配置，这给你灵活地试验 3-grams、4-grams、5-grams...
使用快速（内存中/进程中）数据库，例如 SQLite 或 BerkeleyDb
确保允许重复键（BerkeleyDb 允许这样做 - 但您需要找到重复键值对的解决方法，例如“maintain”的 3-grams ain 和 in# 是两倍，因此从技术上讲，您无法将其存储在数据库中）

快乐编码

I don't know whether there is a .NET implementation, but you can easily code this yourself.

You can use a reversed n-gram index (A), look up n-grams in your search paragraph (B), calculate common n-grams divided by total n-grams (C) which gives you a probability measure, for which you can set a threshold and probably do other stuff as well.

(A) Create a reversed n-gram index: get all n-grams from the paragraphs you want to search through and store them in a db.

Example:
A small corpus consisting of the following (short) paragraphs:
{rain, spain, main, plain}
has the following 3-grams: {#ra, rai, ain,
in#, #sp, spa, pai, #ma, mai, #pl, pla, lai}
You get the following
key-value pairs: {(#ra, rain), (rai, rain), (ain, rain), (ain, spain),
(ain, main), (ain, plain), (in#, rain), (in#, spain), (in#, main),
(in#, plain), (#sp, spain), (spa, spain), (pai, spain), (#ma, main),
(mai, main), (#pl, plain), (pla, plain), (lai, plain)}

(B) When looking up a paragraph to match against the corpus, calculate all its n-grams, look up each n-gram in the reversed index.

Example:
Look up "pain" in the db we just created. We get the following n-grams:
{#pa, pai, ain, in#}
Here are the matching entries:
#pa -> no match
pai -> {spain}
ain -> {rain, spain, main, plain}
in# -> {rain, spain, main, plain}

(C) Calculate a score for each entry you find by dividing the amount of its common n-grams and the union of all n-grams in search paragraph and result paragraph.

Example
pain vs. spain: common n-grams: {pai, ain, in#}, union of n-grams {#pa, #sp, pai, ain, in#}, score: 3/5 (0.6)
pain vs. rain: common n-grams: {ain, in#}, union of n-grams {#pa, #ra, ain, in#}, score: 2/4 (0.5)
pain vs. main: common n-grams: {ain, in#}, union of n-grams {#pa, #ma, ain, in#}, score: 2/4 (0.5)
pain vs. plain: common n-grams: {ain, in#}, union of n-grams {#pa, #pl, lai, ain, in#}, score: 2/5 (0.4)

Implementation details:

Make the n in n-grams configurable both in search and storage, this gives you flexibility to experiment with 3-grams, 4-grams, 5-grams, ...
Use a fast (in-memory/in-process) db, such as SQLite or BerkeleyDb
Make sure duplicate keys are allowed (BerkeleyDb allows this - but you need to find a workaround for duplicate key-value pairs, for example "maintain" has twice the 3-grams ain and in#, so technically you cannot store this in the db)

Happy coding

回复收藏 0 原文

血之狂魔 2024-12-16 04:47:12

我建议您使用 Google 用于查找网络上重复文档的相同算法
http://www.cs.sunysb.edu/~cse692/papers/henzinger_sigir06 .pdf

使用 Rubin 算法对短语进行哈希处理，对这些哈希值进行排序并比较底部 10 个。非常快。

帕特里克.

回复收藏 0 原文

~没有更多了~

关于作者

自控

暂无简介

0 文章

0 评论

24 人气

关注发私信

友情链接

文江博客

在不同文档中查找相似段落

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

胡图图

zt006

z祗昰~

冰葑

野の

天空

友情链接

在不同文档中查找相似段落

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

胡图图

zt006

z祗昰~

冰葑

野の

天空

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。