使用 NetworkX 根据相似性绘制节点之间的边？

发布于 2025-01-09 07:09:01 字数 934 浏览 4 评论 0原文

这是我的玩具节点数据框：

    import pandas as pd
    
    df = pd.DataFrame({
        'id': [1, 2, 3, 4, 5],
        'a': [55, 2123, -19.3, 9, -8], 
        'b': ['aa', 'bb', 'ad', 'kuku', 'lulu']
    })

我正在使用节点构建一个图（df 的每一行都是一个具有 id 和属性的节点）：

    import networkx as nx
    G = nx.Graph()
    
    for i, attr in df.set_index('id').iterrows():
        G.add_node(i, **attr.to_dict())

现在我想使用节点相似性（余弦或任何其他距离函数）连接这些节点。问题：

我可以对混合类型进行节点相似性并为每种类型应用不同的距离度量吗？
如果我的节点的属性都是数字，那么如何计算图中任意 2 个节点之间的相似度，并在节点 1 和 2 之间的相似度高于某个阈值 alpha 时绘制一条边？

对于问题 2，考虑我上面的 df 是：

    df = pd.DataFrame({
            'id': [1, 2, 3, 4, 5],
            'a': [55, 2123, -19.3, 9, -8], 
            'b': [21, -0.1, 0.003, 4, 2.1]
        })

原文

Here is my toy nodes dataframe:

    import pandas as pd
    
    df = pd.DataFrame({
        'id': [1, 2, 3, 4, 5],
        'a': [55, 2123, -19.3, 9, -8], 
        'b': ['aa', 'bb', 'ad', 'kuku', 'lulu']
    })

I am building a Graph with the nodes (each row of the df is a node with id and attributes):

    import networkx as nx
    G = nx.Graph()
    
    for i, attr in df.set_index('id').iterrows():
        G.add_node(i, **attr.to_dict())

Now I want to connect these nodes using nodes similarity (cosine or any other distance function).
Questions:

Can I do nodes similarity with mixed types and apply different distance metrics for each type?
If my node's attributes are all numbers, how can I calculate the similarity between any 2 nodes in my graph and draw an edge if similarity between node 1 and 2 is above some threshold alpha?

For question 2 consider my above df is:

    df = pd.DataFrame({
            'id': [1, 2, 3, 4, 5],
            'a': [55, 2123, -19.3, 9, -8], 
            'b': [21, -0.1, 0.003, 4, 2.1]
        })

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

风为裳 2025-01-16 07:09:01

AFAIK，networkx没有实现相似度计算，因此必须在networkx之外进行计算。

对于问题 1，考虑到混合数据类型，我可以推荐 recordlinkage 。使用此库，您可以实现一种逻辑，确定数字/字符串变量的组合被视为“相似”。

对于问题 2，如果数据都是数字，则使用 sklearn 的成对距离是合适的（从1.0.2版本开始，它不支持字符串数据类型，因此对于字符串变量recordlinkage/需要另一个字符串处理库或自定义管道）。沿着这些思路：

import networkx as nx
import numpy as np
import pandas as pd
from sklearn.metrics import pairwise_distances

df = pd.DataFrame(
    {
        "id": [1, 2, 3, 4, 5],
        "a": [55, 2123, -19.3, 9, -8],
        "b": ["aa", "bb", "ad", "kuku", "lulu"],
    }
)

dist_a = pairwise_distances(df[["a"]], metric="euclidean")

# form links if distance is lower than some threshold
ix_a, ix_b = np.where(dist_a < 70)

# add nodes
G = nx.Graph()
for source, target in zip(ix_a, ix_b):
    G.add_edge(source, target)

为了处理多个列（和距离），需要集成一些关于如何组合（以及可能的权重/标准化）不同距离的逻辑。

AFAIK, networkx does not implement calculation of similarity, so that will have to be calculated outside networkx.

For question 1, given the mixed data types, I can recommend recordlinkage. Using this library you can implement a logic for what combination of numeric/string variables is considered 'similar'.

For question 2, if the data is all numeric, then using sklearn's pairwise distances is appropriate (as of version 1.0.2, it does not support string dtype, so for string variables recordlinkage/another string processing library or a custom pipeline is needed). Something along these lines:

import networkx as nx
import numpy as np
import pandas as pd
from sklearn.metrics import pairwise_distances

df = pd.DataFrame(
    {
        "id": [1, 2, 3, 4, 5],
        "a": [55, 2123, -19.3, 9, -8],
        "b": ["aa", "bb", "ad", "kuku", "lulu"],
    }
)

dist_a = pairwise_distances(df[["a"]], metric="euclidean")

# form links if distance is lower than some threshold
ix_a, ix_b = np.where(dist_a < 70)

# add nodes
G = nx.Graph()
for source, target in zip(ix_a, ix_b):
    G.add_edge(source, target)

For handling multiple columns (and distances), one will need to integrate some logic on how to combine (and possible weigh/normalize) different distances.

回复收藏 0 原文

~没有更多了~