查找两组之间的邻近匹配对
给定两组关键字,其中每个关键字都有开始和结束偏移量(例如关键字“abc”从偏移量 23 开始并在偏移量 25 结束),我想有效地找到这些组之间的匹配对。 匹配对是 set1 中的一个关键字和 set2 中的一个关键字,其中一个关键字在另一个关键字结束之后开始,但一个关键字的结尾到另一个关键字的开头之间不超过 MAX_PROXIMITY
个字符。此外,每个关键字只能属于一对(匹配的关键字不能重复用于其他匹配)。
Given two sets of keywords, where each keyword got start and end offset (e.g. keyword "abc" starts at offset 23 and ends at offset 25), I would like to efficiently find matching pairs between those sets.
a matched pair is a keyword from set1 and a keyword from set2, where one keyword starts after the other keyword ends, but no more than MAX_PROXIMITY
characters between the end of the one to the start of the other. in addition, each keyword can belong only to one pair (matched keyword cannot be reused for another match).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可以将其表示为二分图中的最大匹配。将您拥有的两个集合视为两组顶点,并在从第一个集合的所有顶点到第二个集合中满足您的规则的所有顶点之间生成边,即“其中一个关键字在另一个关键字结束之后开始,但不超过 MAX_PROXIMITY一个的末尾到另一个的开头之间的字符”
一旦你有了图,就可以在二分图中运行最大匹配算法。
http://en.wikipedia.org/wiki/Matching_(graph_theory)#Maximum_matchings_in_bipartite_graphs
You could formulate it as maximum matching in a bipartite graph. Consider the two sets you have as two sets of vertices and generate edges between all the vertices from the first set to all the vertices in second set which satisfy your rule i.e. " where one keyword starts after the other keyword ends, but no more than MAX_PROXIMITY characters between the end of the one to the start of the other"
Once you have the graph in place run a maximum matching algorithm in a bipartite graph.
http://en.wikipedia.org/wiki/Matching_(graph_theory)#Maximum_matchings_in_bipartite_graphs
您可以使用动态规划来解决这个问题。
假设您有每组关键字,按它们开始的偏移量排序。
让我们定义
当然
max_pairs[n1][n2]
是您问题的答案(其中n1
和n2
是第一个和分别设置第二个关键字)要计算
max_pairs
表,请使用以下公式这基本上是说,如果您可以匹配关键字,则可以这样做,因为对于关键字 1..i 和 1..j 的问题,它是尽力而为 做。在另一种情况下(第 i 个和第 j 个关键字不匹配),您无法找到将第 i 个和第 j 个关键字与某些不同关键字配对的解决方案。因此,在最佳解决方案中,第 i 个关键字或第 j 个关键字应该是不配对的。
这基本上告诉我们查看问题的(已计算的)解决方案
max_pairs[i-1][j]
(没有第 i 个关键字)或max_pairs[i][j-1 ]
(没有第 j 个关键字)并选择两者中较好的一个。如果您以正确的顺序计算此表,即
算法将具有 O(n1*n2) 复杂度,这比二部图中的分配问题(运行时间为 O(n^3))要好
有关动态规划的更多信息,请参阅动态规划
You can use dynamic programming to solve this problem.
Suppose you have each set of keywords ordered by the offset at which they start.
Let's define
Of course
max_pairs[n1][n2]
is the answer to your problem (wheren1
andn2
are the sizes of the first and second keyword set respectively)To compute
max_pairs
table use the following formulaThis basically says that if you can match the keywords you do it, because for the problem with keywords 1..i and 1..j it's the best you can do. In the other case (i-th and j-th keywords don't match) you cannot have a solution in which both i-th and j-th keywords are paired to some different keywords. So in the optimal solution either i-th keyword or j-th keyword should be unpaired.
That basically tells us to look at the (already computed) solutions for problems
max_pairs[i-1][j]
(without i-th keyword) ormax_pairs[i][j-1]
(without j-th keyword) and choose the better of the two.If you compute this table in the correct order i.e
the algorithm will have O(n1*n2) complexity, which is better than the assignment problem in bipartite graph (which runs in O(n^3))
For more on dynamic programming please refer to dynamic programming