节点向量不返回所有节点

发布于 2025-01-09 14:59:34 字数 771 浏览 5 评论 0原文

我正在尝试使用 nodevector 的 Node2Vec 类来获取图形的嵌入。我无法显示完整的代码，但基本上这就是我正在做的事情：

import networkx as nx
import pandas as pd
import nodevectors

n2v = nodevectors.Node2Vec(n_components=128,
                           walklen=80,
                           epochs=3,
                           return_weight=1,
                           neighbor_weight=1,
                           threads=4)
G = nx.from_pandas_edgelist(df, 'customer', 'item', edge_attr='weight', create_using=nx.Graph)
n2v.fit(G)
model = n2v.model
shape = model.ww.vectors.shape

我知道 G 拥有我范围内的所有节点。然后，我拟合模型，但 model.ww.vectors 的行数小于我的节点数。

我没有成功找到为什么 model.ww.vectors 嵌入中表示的节点数低于 G 中的实际节点数。

有谁知道为什么会发生？

原文

I'm trying to use nodevector's Node2Vec class to get an embedding for my graph. I can't show the entire code, but basically this is what I'm doing:

import networkx as nx
import pandas as pd
import nodevectors

n2v = nodevectors.Node2Vec(n_components=128,
                           walklen=80,
                           epochs=3,
                           return_weight=1,
                           neighbor_weight=1,
                           threads=4)
G = nx.from_pandas_edgelist(df, 'customer', 'item', edge_attr='weight', create_using=nx.Graph)
n2v.fit(G)
model = n2v.model
shape = model.ww.vectors.shape

I know G has all the nodes from my scope. Then, I fit the model, but model.ww.vectors has a number of rows smaller than my number of nodes.

I'm not successfully finding why do the number of nodes represented in my embedding by model.ww.vectors is lower than my actual number of nodes in G.

Does anyone know why it happens?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

最美的太阳 2025-01-16 14:59:34

TL;DR：非默认 epochs=3 可能会导致某些节点仅出现 3 次，但内部 Word2Vec 模型默认会忽略出现次数少于 5 次的标记。升级到 epochs=5 可能是一个快速解决方案 - 但请继续阅读原因和信息。与各种默认值的权衡。

如果您使用此处描述的nodevectors包，看起来将基于 Gensim 的 Word2Vec 构建 - 它使用默认的 min_count=5。

这意味着任何出现次数少于 5 次的标记（在本例中为节点）都会被忽略。特别是在 Word2Vec 开创的自然语言环境中，完全丢弃这些罕见的单词通常有多种好处：

仅从几个特殊的例子中，这些罕见的单词本身就获得了特殊的向量，不太可能推广到下游与其他频繁使用的单词相比，使用（其他文本）
，每个单词总体上只需要很少的训练，&因此，对共享模型权重仅提供了一点阻力（基于其特殊的示例） - 因此向量较弱且更弱。保留来自随机初始化和随机初始化的更多任意影响语料库中的相对定位。（更频繁的单词提供了更多样、大量的示例来提取其独特的含义。）
由于自然语言中单词频率的 Zipfian 分布，存在很多这样的低频单词 – 通常甚至是拼写错误——它们总共占用了模型的大量内存和数据。训练时间。但它们不能单独获得非常好的向量，或者对共享模型产生普遍的有益影响。因此，它们最终就像噪音一样，削弱了其他向量对于更频繁的单词的影响。

因此，通常在 Word2Vec 中，丢弃稀有单词只会放弃低值向量，同时加快训练速度、减少内存需求等。提高剩余向量的质量：一个巨大的胜利。

尽管图随机游走中节点名称的分布可能与自然语言词频有很大不同，但一些相同的问题仍然适用于很少出现的节点。另一方面，如果一个节点确实只出现在长节点链的末尾，那么每次往返它的步行都将包含完全相同的邻居 - 并且也许在更多步行中额外出现不会添加新的各种信息（至少在分析的内部 Word2Vec 窗口 内）。

您可以通过使用 Node2Vec keep_walks 参数存储生成的行走来确认默认的 min_count 是否是您的问题，然后检查：到底是哪些节点在行走中出现次数少于 min_count 次？

如果是这样，可能有几个选项：

使用 Node2Vec w2vparams 选项覆盖 min_count ，例如 min_count=1 。如上所述，这在传统自然语言 Word2Vec 中始终是一个坏主意 - 但在图形应用程序中也许还不错，对于稀有/外边缘节点，一次步行就足够了，然后至少你从最小的训练中得到了任何奇怪/嘈杂的向量结果。
尝试影响游走以确保所有节点出现足够的次数。我想 Node2Vec walklen、return_weight 和 & 的一些值neighbor_weight 可以提高覆盖范围 - 但我不认为他们可以保证所有节点出现在至少 N 个（例如 5 个，以匹配默认的 min_count）不同的遍历中。但看起来 Node2Vec epochs 参数控制每个节点用作起点的次数 - 所以 epochs=5< /code> 将保证每个节点至少出现 5 次，作为 5 次单独行走的开始。（值得注意的是：Node2Vec 默认值为 epochs=20 - 这永远不会触发与内部 Word2Vec min_count=5<但设置非默认 epochs=3 可能会导致某些节点仅出现 3 次。）

TL;DR: Your non-default epochs=3 can result in some nodes appearing only 3 times – but the inner Word2Vec model by default ignores tokens appearing fewer than 5 times. Upping to epochs=5 may be a quick fix - but read on for the reasons & tradeoffs with various defaults.

If you're using the nodevectors package described here, it seems to be built on Gensim's Word2Vec – which uses a default min_count=5.

That means any tokens – in this case, nodes – which appear fewer than 5 times are ignored. Especially in the natural-language contexts where Word2Vec was pioneered, discarding such rare words entirely usually has multiple benefits:

from only a few idiosyncratic examples, such rare words themselves get peculiar vectors less-likely to generalize to downstream uses (other texts)
compared to other frequent words, each gets very little training effort overall, & thus provides only a little pushback on shared model weights (based on their peculiar examples) - so the vectors are weaker & retain more arbitrary influence from random-initialization & relative positioning in the corpus. (More-frequent words provide more varied, numerous examples to extract their unique meaning.)
because of the Zipfian distribution of word-frequencies in natural language, there are a lot of such low-frequency words – often even typos – and altogether they take up a lot of the model's memory & training-time. But they don't individually get very good vectors, or have generalizable beneficial influences on the shared model. So they wind up serving a lot like noise that weakens other vectors for more-frequent words, as well.

So typically in Word2Vec, discarding rare words only gives up low-value vectors while simultaneously speeding training, shrinking memory requirements, & improving the quality of the remaining vectors: a big win.

Although the distribution of node-names in graph random-walks may be very different from natural-language word-frequencies, some of the same concerns still apply for nodes that appear rarely. On the other hand, if a node truly only appears at the end of a long chain of nodes, every walk to or from it will include the exact same neighbors - and maybe extra appearances in more walks would add no new variety-of-information (at least within the inner Word2Vec window of analysis).

You may be able to confirm if the default min_count is your issue by using the Node2Vec keep_walks parameter to store the generated walks, then checking: are exactly the nodes that are 'missing' appearing fewer than min_count times in the walks?

If so, a few options may be:

override min_count using the Node2Vec w2vparams option to something like min_count=1. As noted above, this is always a bad idea in traditional natural-language Word2Vec - but maybe it's not so bad in a graph application, where for rare/outer-edge nodes one walk is enough, and then at least you have whatever strange/noisy vector results from that minimal training.
try to influence the walks to ensure all nodes appear enough times. I suppose some values of the Node2Vec walklen, return_weight, & neighbor_weight could improve coverage - but I don't think they could guarantee all nodes appear in at least N (say, 5, to match the default min_count) different walks. But it looks like the Node2Vec epochs parameter controls how many time every node is used as a starting point – so epochs=5 would guarantee every node appears at least 5 times, as the start of 5 separate walks. (Notably: the Node2Vec default is epochs=20 - which would never trigger a bad interaction with the internal Word2Vec min_count=5. But setting your non-default epochs=3 risks leaving some nodes with only 3 appearances.)