节点向量不返回所有节点
我正在尝试使用 nodevector
的 Node2Vec
类来获取图形的嵌入。我无法显示完整的代码,但基本上这就是我正在做的事情:
import networkx as nx
import pandas as pd
import nodevectors
n2v = nodevectors.Node2Vec(n_components=128,
walklen=80,
epochs=3,
return_weight=1,
neighbor_weight=1,
threads=4)
G = nx.from_pandas_edgelist(df, 'customer', 'item', edge_attr='weight', create_using=nx.Graph)
n2v.fit(G)
model = n2v.model
shape = model.ww.vectors.shape
我知道 G
拥有我范围内的所有节点。然后,我拟合模型,但 model.ww.vectors 的行数小于我的节点数。
我没有成功找到为什么 model.ww.vectors
嵌入中表示的节点数低于 G
中的实际节点数。
有谁知道为什么会发生?
I'm trying to use nodevector
's Node2Vec
class to get an embedding for my graph. I can't show the entire code, but basically this is what I'm doing:
import networkx as nx
import pandas as pd
import nodevectors
n2v = nodevectors.Node2Vec(n_components=128,
walklen=80,
epochs=3,
return_weight=1,
neighbor_weight=1,
threads=4)
G = nx.from_pandas_edgelist(df, 'customer', 'item', edge_attr='weight', create_using=nx.Graph)
n2v.fit(G)
model = n2v.model
shape = model.ww.vectors.shape
I know G
has all the nodes from my scope. Then, I fit the model, but model.ww.vectors
has a number of rows smaller than my number of nodes.
I'm not successfully finding why do the number of nodes represented in my embedding by model.ww.vectors
is lower than my actual number of nodes in G
.
Does anyone know why it happens?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
TL;DR:非默认
epochs=3
可能会导致某些节点仅出现 3 次,但内部Word2Vec
模型默认会忽略出现次数少于 5 次的标记。升级到epochs=5
可能是一个快速解决方案 - 但请继续阅读原因和信息。与各种默认值的权衡。--
如果您使用此处描述的
nodevectors
包,看起来将基于 Gensim 的Word2Vec
构建 - 它使用默认的min_count=5
。这意味着任何出现次数少于 5 次的标记(在本例中为节点)都会被忽略。特别是在
Word2Vec
开创的自然语言环境中,完全丢弃这些罕见的单词通常有多种好处:因此,通常在
Word2Vec
中,丢弃稀有单词只会放弃低值向量,同时加快训练速度、减少内存需求等。提高剩余向量的质量:一个巨大的胜利。尽管图随机游走中节点名称的分布可能与自然语言词频有很大不同,但一些相同的问题仍然适用于很少出现的节点。另一方面,如果一个节点确实只出现在长节点链的末尾,那么每次往返它的步行都将包含完全相同的邻居 - 并且也许在更多步行中额外出现不会添加新的各种信息(至少在分析的内部
Word2Vec
窗口
内)。您可以通过使用
Node2Vec
keep_walks
参数存储生成的行走来确认默认的min_count
是否是您的问题,然后检查:到底是哪些节点在行走中出现次数少于min_count
次?如果是这样,可能有几个选项:
Node2Vec
w2vparams
选项覆盖min_count
,例如min_count=1
。如上所述,这在传统自然语言Word2Vec
中始终是一个坏主意 - 但在图形应用程序中也许还不错,对于稀有/外边缘节点,一次步行就足够了,然后至少你从最小的训练中得到了任何奇怪/嘈杂的向量结果。Node2Vec
walklen
、return_weight
和 & 的一些值neighbor_weight
可以提高覆盖范围 - 但我不认为他们可以保证所有节点出现在至少 N 个(例如 5 个,以匹配默认的min_count
)不同的遍历中。但看起来Node2Vec
epochs
参数控制每个节点用作起点的次数 - 所以epochs=5< /code> 将保证每个节点至少出现 5 次,作为 5 次单独行走的开始。 (值得注意的是:
Node2Vec
默认值为epochs=20
- 这永远不会触发与内部Word2Vec
min_count=5<但设置非默认
epochs=3
可能会导致某些节点仅出现 3 次。)TL;DR: Your non-default
epochs=3
can result in some nodes appearing only 3 times – but the innerWord2Vec
model by default ignores tokens appearing fewer than 5 times. Upping toepochs=5
may be a quick fix - but read on for the reasons & tradeoffs with various defaults.--
If you're using the
nodevectors
package described here, it seems to be built on Gensim'sWord2Vec
– which uses a defaultmin_count=5
.That means any tokens – in this case, nodes – which appear fewer than 5 times are ignored. Especially in the natural-language contexts where
Word2Vec
was pioneered, discarding such rare words entirely usually has multiple benefits:So typically in
Word2Vec
, discarding rare words only gives up low-value vectors while simultaneously speeding training, shrinking memory requirements, & improving the quality of the remaining vectors: a big win.Although the distribution of node-names in graph random-walks may be very different from natural-language word-frequencies, some of the same concerns still apply for nodes that appear rarely. On the other hand, if a node truly only appears at the end of a long chain of nodes, every walk to or from it will include the exact same neighbors - and maybe extra appearances in more walks would add no new variety-of-information (at least within the inner
Word2Vec
window
of analysis).You may be able to confirm if the default
min_count
is your issue by using theNode2Vec
keep_walks
parameter to store the generated walks, then checking: are exactly the nodes that are 'missing' appearing fewer thanmin_count
times in the walks?If so, a few options may be:
min_count
using theNode2Vec
w2vparams
option to something likemin_count=1
. As noted above, this is always a bad idea in traditional natural-languageWord2Vec
- but maybe it's not so bad in a graph application, where for rare/outer-edge nodes one walk is enough, and then at least you have whatever strange/noisy vector results from that minimal training.Node2Vec
walklen
,return_weight
, &neighbor_weight
could improve coverage - but I don't think they could guarantee all nodes appear in at least N (say, 5, to match the defaultmin_count
) different walks. But it looks like theNode2Vec
epochs
parameter controls how many time every node is used as a starting point – soepochs=5
would guarantee every node appears at least 5 times, as the start of 5 separate walks. (Notably: theNode2Vec
default isepochs=20
- which would never trigger a bad interaction with the internalWord2Vec
min_count=5
. But setting your non-defaultepochs=3
risks leaving some nodes with only 3 appearances.)