有没有更快的方法来获得 BERT 中给定子词嵌入的词嵌入

发布于 2025-01-20 06:13:56 字数 411 浏览 7 评论 0原文

使用 bert.tokenizer 我可以获取句子中单词的子词 id 和单词跨度，例如，给定句子“这是一个例子”，我得到 [“th”，“##is”的编码文本嵌入， "an","exam","##ple"]，以及 word_spans 列表：[[0,2],[2,3],[3,5]] 我的实现是

word_embeddings = torch.rand(len(word_spans),768).to(torch.device('cuda'))
for seq,word in enumerate(word_spans):
    word_embeddings[seq,:] = torch.mean(encoded_text[word[0]:word[1],:],0,True)

有没有更快的方法来组合pytorch中同一单词的所有子词的向量？

原文

Using bert.tokenizer I can get the subword ids and the word spans of words in a sentence, for example, given the sentence "This is an example", I get the encoded_text embeddings of ["th","##is","an","exam","##ple"],and the word_spans list: [[0,2],[2,3],[3,5]]
My implements is

word_embeddings = torch.rand(len(word_spans),768).to(torch.device('cuda'))
for seq,word in enumerate(word_spans):
    word_embeddings[seq,:] = torch.mean(encoded_text[word[0]:word[1],:],0,True)

is there any faster way to combine the vectors of all subwords of the same word in pytorch?

分享到QQ

分享到微博