将Gensim与现有嵌入式表一起使用

发布于 2025-02-09 03:06:07 字数 283 浏览 2 评论 0 原文

我已经使用HuggingFace Transformers库的微调变压器模型生成了一堆单词嵌入式。 现在,想进行一些快速评估,结果是否有任何好处。我偶然发现了Gensim,发现它具有方便的功能,例如 model.wv.most_similar(),可能还有一些我可以使用该行的其他功能。

我想知道是否没有加载Gensim模型,而是可以以某种方式将嵌入式表导入其中并使用它,因此我不必自己实现所有这些功能。

我的嵌入目前在词典中带有键,项目对是单词及其嵌入向量,尽管我可以以其他任何格式合理地保存它。

I have generated a bunch of word embeddings using a fine-tuned transformer model from the huggingface transformers library.
Now would like do some quick evaluation, whether the results are any good. I stumbled upon gensim and saw that it had handy functions like for example model.wv.most_similar() and probably a few others that I could use down the line.

I was wondering if instead of loading a gensim model, I could somehow import my embedding table into it and have it use that instead, so I don't have to implement all of those functions on my own.

My embeddings are currently in a dictionary with the key, item pair being the word and its embedding vector, though I could reasonably save it in any other format.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

秋千易 2025-02-16 03:06:07

做了一些挖掘并找到了这篇文章:

使用方法:

def save_word2vec_format(fname, vocab, vector_size, binary=True):
    """Store the input-hidden weight matrix in the same format used by the original
    C word2vec-tool, for compatibility.

    Parameters
    ----------
    fname : str
        The file path used to save the vectors in.
    vocab : dict
        The vocabulary of words.
    vector_size : int
        The number of dimensions of word vectors.
    binary : bool, optional
        If True, the data wil be saved in binary word2vec format, else it will be saved in plain text.


    """
    
    total_vec = len(vocab)
    with gensim.utils.smart_open(fname, 'wb') as fout:
        print(total_vec, vector_size)
        fout.write(gensim.utils.to_utf8("%s %s\n" % (total_vec, vector_size)))
        # store in sorted order: most frequent words at the top
        for word, row in tqdm(vocab.items()):
            if binary:
                row = row.astype(np.float32)
                fout.write(gensim.utils.to_utf8(word) + b" " + row.tostring())
            else:
                fout.write(gensim.utils.to_utf8("%s %s\n" % (word, ' '.join(repr(val) for val in row))))

model = gensim.models.KeyedVectors.load_word2vec_format('./my/path', binary=True)

需要一些小的修改,但似乎可以很好地工作我的用例。

Did some digging and found this article: https://www.kaggle.com/code/matsuik/convert-embedding-dictionary-to-gensim-w2v-format/notebook

With the method:

def save_word2vec_format(fname, vocab, vector_size, binary=True):
    """Store the input-hidden weight matrix in the same format used by the original
    C word2vec-tool, for compatibility.

    Parameters
    ----------
    fname : str
        The file path used to save the vectors in.
    vocab : dict
        The vocabulary of words.
    vector_size : int
        The number of dimensions of word vectors.
    binary : bool, optional
        If True, the data wil be saved in binary word2vec format, else it will be saved in plain text.


    """
    
    total_vec = len(vocab)
    with gensim.utils.smart_open(fname, 'wb') as fout:
        print(total_vec, vector_size)
        fout.write(gensim.utils.to_utf8("%s %s\n" % (total_vec, vector_size)))
        # store in sorted order: most frequent words at the top
        for word, row in tqdm(vocab.items()):
            if binary:
                row = row.astype(np.float32)
                fout.write(gensim.utils.to_utf8(word) + b" " + row.tostring())
            else:
                fout.write(gensim.utils.to_utf8("%s %s\n" % (word, ' '.join(repr(val) for val in row))))

and

model = gensim.models.KeyedVectors.load_word2vec_format('./my/path', binary=True)

It needed some small modification but seems to work just fine for my use case.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文