同义词查找算法

发布于 2024-11-10 09:23:19 字数 527 浏览 0 评论 0原文

我认为示例会比冗长的描述好得多:)

让我们假设我们有一个数组数组:

("Server1", "Server_1", "Main Server", "192.168.0.3")
("Server_1", "VIP Server", "Main Server")
("Server_2", "192.168.0.4")
("192.168.0.3", "192.168.0.5")
("Server_2", "Backup")

每一行都包含同义词字符串。作为这个数组的处理结果,我想得到这个:

("Server1", "Server_1", "Main Server", "192.168.0.3", "VIP Server", "192.168.0.5")
("Server_2", "192.168.0.4", "Backup")

所以我认为我需要一种递归算法。编程语言实际上并不重要——一般来说,我只需要一点想法方面的帮助。我将使用 php 或 python。

谢谢你!

I think example will be much better than loooong description :)

Let's assume we have an array of arrays:

("Server1", "Server_1", "Main Server", "192.168.0.3")
("Server_1", "VIP Server", "Main Server")
("Server_2", "192.168.0.4")
("192.168.0.3", "192.168.0.5")
("Server_2", "Backup")

Each line contains strings which are synonyms. And as a result of processing of this array I want to get this:

("Server1", "Server_1", "Main Server", "192.168.0.3", "VIP Server", "192.168.0.5")
("Server_2", "192.168.0.4", "Backup")

So I think I need a kind of recursive algorithm. Programming language actually doesn't matter — I need only a little help with idea in general. I'm going to use php or python.

Thank you!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

初见你 2024-11-17 09:23:19

这个问题可以简化为图论中的问题,您可以在图中找到所有连接节点组。

解决这个问题的一个有效方法是使用“洪水填充”算法,这本质上是一种递归呼吸优先搜索。这个维基百科条目描述了洪水填充算法以及它如何应用于解决寻找连通区域的问题一个图表。

要了解如何将原始问题变成图表上的问题:将每个条目(例如“Server1”、“Server_1”等)设为图表上的一个节点。当且仅当节点与边是同义词时,才将它们连接起来。如果您有足够的内存,矩阵数据结构特别适合跟踪边缘。否则,像地图这样的稀疏数据结构将起作用,特别是因为同义词的数量可能会受到限制。

  • Server1 是节点#0
  • Server_1 是节点#1
  • Server_2 是节点#2

那么edge[0][1] = edge[1][0] = 1,表示节点#0 和#1 之间有一条边(这意味着它们是同义词)。而edge[0][2] = edge[2][0] = 0,表明Server1和Server_2不是同义词。

复杂性分析

创建此数据结构非常高效,因为查找字符串到节点编号的映射的单个线性传递就足以创建它。如果将字符串到节点编号的映射存储在字典中,那么这将是一个 O(n log n) 步骤。

进行洪水填充的时间复杂度为 O(n),您只需访问图中的每个节点一次。所以,整个算法的复杂度是 O(n log n)。

This problem can be reduced to a problem in graph theory where you find all groups of connected nodes in a graph.

An efficient way to solve this problem is doing a "flood fill" algorithm, which is essentially a recursive breath first search. This wikipedia entry describes the flood fill algorithm and how it applies to solving the problem of finding connected regions of a graph.

To see how the original question can be made into a question on graphs: make each entry (e.g. "Server1", "Server_1", etc.) a node on a graph. Connect nodes with edges if and only if they are synonyms. A matrix data structure is particularly appropriate for keeping track of the edges, provided you have enough memory. Otherwise a sparse data structure like a map will work, especially since the number of synonyms will likely be limited.

  • Server1 is Node #0
  • Server_1 is Node #1
  • Server_2 is Node #2

Then edge[0][1] = edge[1][0] = 1, indicated that there is an edge between nodes #0 and #1 ( which means that they are synonyms ). While edge[0][2] = edge[2][0] = 0, indicating that Server1 and Server_2 are not synonyms.

Complexity Analysis

Creating this data structure is pretty efficient because a single linear pass with a lookup of the mapping of strings to node numbers is enough to crate it. If you store the mapping of strings to node numbers in a dictionary then this would be a O(n log n) step.

Doing the flood fill is O(n), you only visit each node in the graph once. So, the algorithm in all is O(n log n).

生生漫 2024-11-17 09:23:19

引入整数标记,表示同义词组。在开始时,用不同的标记标记从 1N 的所有单词。

然后搜索您的集合,如果您发现索引 ij 的两个单词是同义词,则用标记 i 和 < code>j 两者的数量都较少。经过 N 次迭代后,您将获得所有同义词组。

这是一种肮脏且不完全有效的解决方案,我相信可以通过联合查找结构获得更多性能。

Introduce integer marking, which indicates synonym groups. On start one marks all words with different marks from 1 to N.

Then search trough your collection and if you find two words with indexes i and j are synonym, then remark all of words with marking i and j with lesser number of both. After N iteration you get all groups of synonyms.

It is some dirty and not throughly efficient solution, I believe one can get more performance with union-find structures.

冬天旳寂寞 2024-11-17 09:23:19

编辑:这可能不是解决您的问题的最有效方法。如果您对最大性能感兴趣(例如,如果您有数百万个值),您可能有兴趣编写更复杂的算法。


PHP,似乎正在工作(至少对于给定示例中的数据):

$data = array(
    array("Server1", "Server_1", "Main Server", "192.168.0.3"),
    array("Server_1", "VIP Server", "Main Server"),
    array("Server_2", "192.168.0.4"),
    array("192.168.0.3", "192.168.0.5"),
    array("Server_2", "Backup"),
);

do {
    $foundSynonyms = false;
    foreach ( $data as $firstKey => $firstValue ) {
        foreach ( $data as $secondKey => $secondValue ) {
            if ( $firstKey === $secondKey ) {
                continue;
            }
            if ( array_intersect($firstValue, $secondValue) ) {
                $data[$firstKey] = array_unique(array_merge($firstValue, $secondValue));
                unset($data[$secondKey]);
                $foundSynonyms = true;
                break 2; // outer foreach
            }
        }
    }
} while ( $foundSynonyms );

print_r($data);

输出:

Array
(
    [0] => Array
        (
            [0] => Server1
            [1] => Server_1
            [2] => Main Server
            [3] => 192.168.0.3
            [4] => VIP Server
            [6] => 192.168.0.5
        )

    [2] => Array
        (
            [0] => Server_2
            [1] => 192.168.0.4
            [3] => Backup
        )

)

Edit: This probably is NOT the most efficient way of solving your problem. If you are interested in max performance (e.g., if you have millions of values), you might be interested in writing more complex algorithm.


PHP, seems to be working (at least with data from given example):

$data = array(
    array("Server1", "Server_1", "Main Server", "192.168.0.3"),
    array("Server_1", "VIP Server", "Main Server"),
    array("Server_2", "192.168.0.4"),
    array("192.168.0.3", "192.168.0.5"),
    array("Server_2", "Backup"),
);

do {
    $foundSynonyms = false;
    foreach ( $data as $firstKey => $firstValue ) {
        foreach ( $data as $secondKey => $secondValue ) {
            if ( $firstKey === $secondKey ) {
                continue;
            }
            if ( array_intersect($firstValue, $secondValue) ) {
                $data[$firstKey] = array_unique(array_merge($firstValue, $secondValue));
                unset($data[$secondKey]);
                $foundSynonyms = true;
                break 2; // outer foreach
            }
        }
    }
} while ( $foundSynonyms );

print_r($data);

Output:

Array
(
    [0] => Array
        (
            [0] => Server1
            [1] => Server_1
            [2] => Main Server
            [3] => 192.168.0.3
            [4] => VIP Server
            [6] => 192.168.0.5
        )

    [2] => Array
        (
            [0] => Server_2
            [1] => 192.168.0.4
            [3] => Backup
        )

)
音盲 2024-11-17 09:23:19

这将产生比 PHP 示例 (Python 3) 更低的复杂性:

a = [set(("Server1", "Server_1", "Main Server", "192.168.0.3")),
    set(("Server_1", "VIP Server", "Main Server")),
    set(("Server_2", "192.168.0.4")),
    set(("192.168.0.3", "192.168.0.5")),
    set(("Server_2", "Backup"))]

b = {}
c = set()
for s in a:
    full_s = s.copy()
    for d in s:
        if b.get(d):
            full_s.update(b[d])
    for d in full_s:
        b[d] = full_s
    c.add(frozenset(full_s))

for k,v in b.items():
    fsv = frozenset(v)
    if fsv in c:
        print(list(fsv))
        c.remove(fsv)

This would yield lower complexity then the PHP example (Python 3):

a = [set(("Server1", "Server_1", "Main Server", "192.168.0.3")),
    set(("Server_1", "VIP Server", "Main Server")),
    set(("Server_2", "192.168.0.4")),
    set(("192.168.0.3", "192.168.0.5")),
    set(("Server_2", "Backup"))]

b = {}
c = set()
for s in a:
    full_s = s.copy()
    for d in s:
        if b.get(d):
            full_s.update(b[d])
    for d in full_s:
        b[d] = full_s
    c.add(frozenset(full_s))

for k,v in b.items():
    fsv = frozenset(v)
    if fsv in c:
        print(list(fsv))
        c.remove(fsv)
日裸衫吸 2024-11-17 09:23:19

我一直在寻找 python 的解决方案,所以我想出了这个解决方案。如果你愿意使用像集合这样的Python数据结构
你也可以使用这个解决方案。 “它是如此简单,穴居人都可以使用它。”

简单来说,这就是其背后的逻辑。

foreach set_of_values in value_collection:
    alreadyInSynonymSet = false
    foreach synonym_set in synonym_collection:
        if set_of_values in synonym_set:
            alreadyInSynonymSet = true
            synonym_set = synonym_set.union(set_of_values)
    if not alreadyInSynonymSet:
        synonym_collection.append(set(set_of_values))
vals = (
    ("Server1", "Server_1", "Main Server", "192.168.0.3"),
    ("Server_1", "VIP Server", "Main Server"),
    ("Server_2", "192.168.0.4"),
    ("192.168.0.3", "192.168.0.5"),
    ("Server_2", "Backup"),
)

value_sets = (set(value_tup) for value_tup in vals)

synonym_collection = []

for value_set in value_sets:
    isConnected = False # If connected to a term in the graph

    print(f'\nCurrent Value Set: {value_set}')
    
    for synonyms in synonym_collection:
        # IF two sets are disjoint, they don't have common elements
        if not set(synonyms).isdisjoint(value_set):
            isConnected = True
            synonyms |= value_set # Appending elements of new value_set to synonymous set
            break

    # If it's not related to any other term, create a new set
    if not isConnected:
        print ('Value set not in graph, adding to graph...')
        synonym_collection.append(value_set)

print('\nDone, Completed Graphing Synonyms')
print(synonym_collection)

这将产生以下结果

Current Value Set: {'Server1', 'Main Server', '192.168.0.3', 'Server_1'}
Value set not in graph, adding to graph...

Current Value Set: {'VIP Server', 'Main Server', 'Server_1'}

Current Value Set: {'192.168.0.4', 'Server_2'}
Value set not in graph, adding to graph...

Current Value Set: {'192.168.0.3', '192.168.0.5'}

Current Value Set: {'Server_2', 'Backup'}

Done, Completed Graphing Synonyms
[{'VIP Server', 'Main Server', '192.168.0.3', '192.168.0.5', 'Server1', 'Server_1'}, {'192.168.0.4', 'Server_2', 'Backup'}]

I was looking for a solution in python, so I came up with this solution. If you are willing to use python data structures like sets
you can use this solution too. "It's so simple a cave man can use it."

Simply this is the logic behind it.

foreach set_of_values in value_collection:
    alreadyInSynonymSet = false
    foreach synonym_set in synonym_collection:
        if set_of_values in synonym_set:
            alreadyInSynonymSet = true
            synonym_set = synonym_set.union(set_of_values)
    if not alreadyInSynonymSet:
        synonym_collection.append(set(set_of_values))
vals = (
    ("Server1", "Server_1", "Main Server", "192.168.0.3"),
    ("Server_1", "VIP Server", "Main Server"),
    ("Server_2", "192.168.0.4"),
    ("192.168.0.3", "192.168.0.5"),
    ("Server_2", "Backup"),
)

value_sets = (set(value_tup) for value_tup in vals)

synonym_collection = []

for value_set in value_sets:
    isConnected = False # If connected to a term in the graph

    print(f'\nCurrent Value Set: {value_set}')
    
    for synonyms in synonym_collection:
        # IF two sets are disjoint, they don't have common elements
        if not set(synonyms).isdisjoint(value_set):
            isConnected = True
            synonyms |= value_set # Appending elements of new value_set to synonymous set
            break

    # If it's not related to any other term, create a new set
    if not isConnected:
        print ('Value set not in graph, adding to graph...')
        synonym_collection.append(value_set)

print('\nDone, Completed Graphing Synonyms')
print(synonym_collection)

This will have a result of

Current Value Set: {'Server1', 'Main Server', '192.168.0.3', 'Server_1'}
Value set not in graph, adding to graph...

Current Value Set: {'VIP Server', 'Main Server', 'Server_1'}

Current Value Set: {'192.168.0.4', 'Server_2'}
Value set not in graph, adding to graph...

Current Value Set: {'192.168.0.3', '192.168.0.5'}

Current Value Set: {'Server_2', 'Backup'}

Done, Completed Graphing Synonyms
[{'VIP Server', 'Main Server', '192.168.0.3', '192.168.0.5', 'Server1', 'Server_1'}, {'192.168.0.4', 'Server_2', 'Backup'}]
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文