查找两组之间的邻近匹配对

发布于 2024-12-22 14:17:24 字数 226 浏览 1 评论 0原文

给定两组关键字,其中每个关键字都有开始和结束偏移量(例如关键字“abc”从偏移量 23 开始并在偏移量 25 结束),我想有效地找到这些组之间的匹配对。 匹配对是 set1 中的一个关键字和 set2 中的一个关键字,其中一个关键字在另一个关键字结束之后开始,但一个关键字的结尾到另一个关键字的开头之间不超过 MAX_PROXIMITY 个字符。此外,每个关键字只能属于一对(匹配的关键字不能重复用于其他匹配)。

Given two sets of keywords, where each keyword got start and end offset (e.g. keyword "abc" starts at offset 23 and ends at offset 25), I would like to efficiently find matching pairs between those sets.
a matched pair is a keyword from set1 and a keyword from set2, where one keyword starts after the other keyword ends, but no more than MAX_PROXIMITY characters between the end of the one to the start of the other. in addition, each keyword can belong only to one pair (matched keyword cannot be reused for another match).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

是你 2024-12-29 14:17:24

您可以将其表示为二分图中的最大匹配。将您拥有的两个集合视为两组顶点,并在从第一个集合的所有顶点到第二个集合中满足您的规则的所有顶点之间生成边,即“其中一个关键字在另一个关键字结束之后开始,但不超过 MAX_PROXIMITY一个的末尾到另一个的开头之间的字符”

一旦你有了图,就可以在二分图中运行最大匹配算法。
http://en.wikipedia.org/wiki/Matching_(graph_theory)#Maximum_matchings_in_bipartite_graphs

You could formulate it as maximum matching in a bipartite graph. Consider the two sets you have as two sets of vertices and generate edges between all the vertices from the first set to all the vertices in second set which satisfy your rule i.e. " where one keyword starts after the other keyword ends, but no more than MAX_PROXIMITY characters between the end of the one to the start of the other"

Once you have the graph in place run a maximum matching algorithm in a bipartite graph.
http://en.wikipedia.org/wiki/Matching_(graph_theory)#Maximum_matchings_in_bipartite_graphs

阳光下的泡沫是彩色的 2024-12-29 14:17:24

您可以使用动态规划来解决这个问题。

假设您有每组关键字,按它们开始的偏移量排序。
让我们定义

set1_keyword[i] # i-th keyword in the first set (ordered by the start offset)
set2_keyword[j] # same for the second set
max_pairs[i][j] # number of pairs in optimal assignment between keywords 1..i from the first set and keywords 1..j from the second set

当然 max_pairs[n1][n2] 是您问题的答案(其中 n1n2 是第一个和分别设置第二个关键字)

要计算 max_pairs 表,请使用以下公式

if (set1_keyword[i] matches set2_keyword[j])
    max_pairs[i][j] = max_pairs[i-1][j-1] + 1
else
    max_pairs[i][j] = max(max_pairs[i-1][j], max_pairs[i][j-1])

这基本上是说,如果您可以匹配关键字,则可以这样做,因为对于关键字 1..i 和 1..j 的问题,它是尽力而为 做。在另一种情况下(第 i 个和第 j 个关键字不匹配),您无法找到将第 i 个和第 j 个关键字与某些不同关键字配对的解决方案。因此,在最佳解决方案中,第 i 个关键字或第 j 个关键字应该是不配对的。
这基本上告诉我们查看问题的(已计算的)解决方案 max_pairs[i-1][j] (没有第 i 个关键字)或 max_pairs[i][j-1 ] (没有第 j 个关键字)并选择两者中较好的一个。

如果您以正确的顺序计算此表,即

for (int i = 0; i < n1; ++i)
    for (int j = 0; j < n2; ++j)
        # compute max_pairs[i][j] here

算法将具有 O(n1*n2) 复杂度,这比二部图中的分配问题(运行时间为 O(n^3))要好

有关动态规划的更多信息,请参阅动态规划

You can use dynamic programming to solve this problem.

Suppose you have each set of keywords ordered by the offset at which they start.
Let's define

set1_keyword[i] # i-th keyword in the first set (ordered by the start offset)
set2_keyword[j] # same for the second set
max_pairs[i][j] # number of pairs in optimal assignment between keywords 1..i from the first set and keywords 1..j from the second set

Of course max_pairs[n1][n2] is the answer to your problem (wheren1 and n2 are the sizes of the first and second keyword set respectively)

To compute max_pairs table use the following formula

if (set1_keyword[i] matches set2_keyword[j])
    max_pairs[i][j] = max_pairs[i-1][j-1] + 1
else
    max_pairs[i][j] = max(max_pairs[i-1][j], max_pairs[i][j-1])

This basically says that if you can match the keywords you do it, because for the problem with keywords 1..i and 1..j it's the best you can do. In the other case (i-th and j-th keywords don't match) you cannot have a solution in which both i-th and j-th keywords are paired to some different keywords. So in the optimal solution either i-th keyword or j-th keyword should be unpaired.
That basically tells us to look at the (already computed) solutions for problems max_pairs[i-1][j] (without i-th keyword) or max_pairs[i][j-1] (without j-th keyword) and choose the better of the two.

If you compute this table in the correct order i.e

for (int i = 0; i < n1; ++i)
    for (int j = 0; j < n2; ++j)
        # compute max_pairs[i][j] here

the algorithm will have O(n1*n2) complexity, which is better than the assignment problem in bipartite graph (which runs in O(n^3))

For more on dynamic programming please refer to dynamic programming

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文