同义词查找算法
我认为示例会比冗长的描述好得多:)
让我们假设我们有一个数组数组:
("Server1", "Server_1", "Main Server", "192.168.0.3")
("Server_1", "VIP Server", "Main Server")
("Server_2", "192.168.0.4")
("192.168.0.3", "192.168.0.5")
("Server_2", "Backup")
每一行都包含同义词字符串。作为这个数组的处理结果,我想得到这个:
("Server1", "Server_1", "Main Server", "192.168.0.3", "VIP Server", "192.168.0.5")
("Server_2", "192.168.0.4", "Backup")
所以我认为我需要一种递归算法。编程语言实际上并不重要——一般来说,我只需要一点想法方面的帮助。我将使用 php 或 python。
谢谢你!
I think example will be much better than loooong description :)
Let's assume we have an array of arrays:
("Server1", "Server_1", "Main Server", "192.168.0.3")
("Server_1", "VIP Server", "Main Server")
("Server_2", "192.168.0.4")
("192.168.0.3", "192.168.0.5")
("Server_2", "Backup")
Each line contains strings which are synonyms. And as a result of processing of this array I want to get this:
("Server1", "Server_1", "Main Server", "192.168.0.3", "VIP Server", "192.168.0.5")
("Server_2", "192.168.0.4", "Backup")
So I think I need a kind of recursive algorithm. Programming language actually doesn't matter — I need only a little help with idea in general. I'm going to use php or python.
Thank you!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
这个问题可以简化为图论中的问题,您可以在图中找到所有连接节点组。
解决这个问题的一个有效方法是使用“洪水填充”算法,这本质上是一种递归呼吸优先搜索。这个维基百科条目描述了洪水填充算法以及它如何应用于解决寻找连通区域的问题一个图表。
要了解如何将原始问题变成图表上的问题:将每个条目(例如“Server1”、“Server_1”等)设为图表上的一个节点。当且仅当节点与边是同义词时,才将它们连接起来。如果您有足够的内存,矩阵数据结构特别适合跟踪边缘。否则,像地图这样的稀疏数据结构将起作用,特别是因为同义词的数量可能会受到限制。
那么edge[0][1] = edge[1][0] = 1,表示节点#0 和#1 之间有一条边(这意味着它们是同义词)。而edge[0][2] = edge[2][0] = 0,表明Server1和Server_2不是同义词。
复杂性分析
创建此数据结构非常高效,因为查找字符串到节点编号的映射的单个线性传递就足以创建它。如果将字符串到节点编号的映射存储在字典中,那么这将是一个 O(n log n) 步骤。
进行洪水填充的时间复杂度为 O(n),您只需访问图中的每个节点一次。所以,整个算法的复杂度是 O(n log n)。
This problem can be reduced to a problem in graph theory where you find all groups of connected nodes in a graph.
An efficient way to solve this problem is doing a "flood fill" algorithm, which is essentially a recursive breath first search. This wikipedia entry describes the flood fill algorithm and how it applies to solving the problem of finding connected regions of a graph.
To see how the original question can be made into a question on graphs: make each entry (e.g. "Server1", "Server_1", etc.) a node on a graph. Connect nodes with edges if and only if they are synonyms. A matrix data structure is particularly appropriate for keeping track of the edges, provided you have enough memory. Otherwise a sparse data structure like a map will work, especially since the number of synonyms will likely be limited.
Then edge[0][1] = edge[1][0] = 1, indicated that there is an edge between nodes #0 and #1 ( which means that they are synonyms ). While edge[0][2] = edge[2][0] = 0, indicating that Server1 and Server_2 are not synonyms.
Complexity Analysis
Creating this data structure is pretty efficient because a single linear pass with a lookup of the mapping of strings to node numbers is enough to crate it. If you store the mapping of strings to node numbers in a dictionary then this would be a O(n log n) step.
Doing the flood fill is O(n), you only visit each node in the graph once. So, the algorithm in all is O(n log n).
引入整数标记,表示同义词组。在开始时,用不同的标记标记从
1
到N
的所有单词。然后搜索您的集合,如果您发现索引
i
和j
的两个单词是同义词,则用标记i
和 < code>j 两者的数量都较少。经过 N 次迭代后,您将获得所有同义词组。这是一种肮脏且不完全有效的解决方案,我相信可以通过联合查找结构获得更多性能。
Introduce integer marking, which indicates synonym groups. On start one marks all words with different marks from
1
toN
.Then search trough your collection and if you find two words with indexes
i
andj
are synonym, then remark all of words with markingi
andj
with lesser number of both. AfterN
iteration you get all groups of synonyms.It is some dirty and not throughly efficient solution, I believe one can get more performance with union-find structures.
编辑:这可能不是解决您的问题的最有效方法。如果您对最大性能感兴趣(例如,如果您有数百万个值),您可能有兴趣编写更复杂的算法。
PHP,似乎正在工作(至少对于给定示例中的数据):
输出:
Edit: This probably is NOT the most efficient way of solving your problem. If you are interested in max performance (e.g., if you have millions of values), you might be interested in writing more complex algorithm.
PHP, seems to be working (at least with data from given example):
Output:
这将产生比 PHP 示例 (Python 3) 更低的复杂性:
This would yield lower complexity then the PHP example (Python 3):
我一直在寻找 python 的解决方案,所以我想出了这个解决方案。如果你愿意使用像集合这样的Python数据结构
你也可以使用这个解决方案。 “它是如此简单,穴居人都可以使用它。”
简单来说,这就是其背后的逻辑。
这将产生以下结果
I was looking for a solution in python, so I came up with this solution. If you are willing to use python data structures like sets
you can use this solution too. "It's so simple a cave man can use it."
Simply this is the logic behind it.
This will have a result of