如何知道一个词是否属于变压器模型？

发布于 2025-01-17 19:16:50 字数 291 浏览 4 评论 0原文

我使用模型 roberta 和 flaubert ，我使用python库 ston_transformer 。我使用余弦分数来计算相似性，但对某些话来说效果不佳。这些单词似乎是模型中“已知”单词的一部分（我猜我猜没有在训练集中的单词），例如：“ WCFS”，“ SARS”，“ OSGI”

是有一种方法可以检查字符串是否通过模型“知道”？（使用此库或任何其他能够加载这些变压器模型的人）

非常感谢。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

月牙弯弯 2025-01-24 19:16:50

对于Roberta和Flaubert模型，您可以使用get_vocab（）方法获取具有令牌及其ID的字典。词汇中的100个令牌的示例：

from transformers import RobertaTokenizer
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
vocab = tokenizer.get_vocab()
list(vocab.keys())[:100]

产量：

['<s>',
 '<pad>',
 '</s>',
 '<unk>',
 '.',
 'Ġthe',
 ',',
 'Ġto',
 'Ġand',
 'Ġof',
 'Ġa',
 'Ġin',
 '-',
 'Ġfor',
 'Ġthat',
 'Ġon',
 'Ġis',
 'âĢ',
 "'s",
 'Ġwith',
 'ĠThe',
 'Ġwas',
 'Ġ"',
 'Ġat',
 'Ġit',
 'Ġas',
 'Ġsaid',
 'Ļ',
 'Ġbe',
 's',
 'Ġby',
 'Ġfrom',
 'Ġare',
 'Ġhave',
 'Ġhas',
 ':',
 'Ġ(',
 'Ġhe',
 'ĠI',
 'Ġhis',
 'Ġwill',
 'Ġan',
 'Ġthis',
 ')',
 'ĠâĢ',
 'Ġnot',
 'Ŀ',
 'Ġyou',
 'ľ',
 'Ġtheir',
 'Ġor',
 'Ġthey',
 'Ġwe',
 'Ġbut',
 'Ġwho',
 'Ġmore',
 'Ġhad',
 'Ġbeen',
 'Ġwere',
 'Ġabout',
 ',"',
 'Ġwhich',
 'Ġup',
 'Ġits',
 'Ġcan',
 'Ġone',
 'Ġout',
 'Ġalso',
 'Ġ
然后，您可以在python中使用 operator中的检查令牌是否属于词汇：
"token" in vocab.keys()

,
 'Ġher',
 'Ġall',
 'Ġafter',
 '."',
 '/',
 'Ġwould',
 "'t",
 'Ġyear',
 'Ġwhen',
 'Ġfirst',
 'Ġshe',
 'Ġtwo',
 'Ġover',
 'Ġpeople',
 'ĠA',
 'Ġour',
 'ĠIt',
 'Ġtime',
 'Ġthan',
 'Ġinto',
 'Ġthere',
 't',
 'ĠHe',
 'Ġnew',
 'ĠâĢĶ',
 'Ġlast',
 'Ġjust',
 'ĠIn',
 'Ġother',
 'Ġso',
 'Ġwhat']

然后，您可以在python中使用 operator中的检查令牌是否属于词汇：

For RoBERTa and FlauBERT models, you can use get_vocab() method to get a dictionary with the tokens and theirs ids. Example of 100 tokens in vocab:

from transformers import RobertaTokenizer
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
vocab = tokenizer.get_vocab()
list(vocab.keys())[:100]

Yields:

['<s>',
 '<pad>',
 '</s>',
 '<unk>',
 '.',
 'Ġthe',
 ',',
 'Ġto',
 'Ġand',
 'Ġof',
 'Ġa',
 'Ġin',
 '-',
 'Ġfor',
 'Ġthat',
 'Ġon',
 'Ġis',
 'âĢ',
 "'s",
 'Ġwith',
 'ĠThe',
 'Ġwas',
 'Ġ"',
 'Ġat',
 'Ġit',
 'Ġas',
 'Ġsaid',
 'Ļ',
 'Ġbe',
 's',
 'Ġby',
 'Ġfrom',
 'Ġare',
 'Ġhave',
 'Ġhas',
 ':',
 'Ġ(',
 'Ġhe',
 'ĠI',
 'Ġhis',
 'Ġwill',
 'Ġan',
 'Ġthis',
 ')',
 'ĠâĢ',
 'Ġnot',
 'Ŀ',
 'Ġyou',
 'ľ',
 'Ġtheir',
 'Ġor',
 'Ġthey',
 'Ġwe',
 'Ġbut',
 'Ġwho',
 'Ġmore',
 'Ġhad',
 'Ġbeen',
 'Ġwere',
 'Ġabout',
 ',"',
 'Ġwhich',
 'Ġup',
 'Ġits',
 'Ġcan',
 'Ġone',
 'Ġout',
 'Ġalso',
 'Ġ
Then, you can use in operator in Python to check if a token belongs in the vocabulary:
"token" in vocab.keys()

,
 'Ġher',
 'Ġall',
 'Ġafter',
 '."',
 '/',
 'Ġwould',
 "'t",
 'Ġyear',
 'Ġwhen',
 'Ġfirst',
 'Ġshe',
 'Ġtwo',
 'Ġover',
 'Ġpeople',
 'ĠA',
 'Ġour',
 'ĠIt',
 'Ġtime',
 'Ġthan',
 'Ġinto',
 'Ġthere',
 't',
 'ĠHe',
 'Ġnew',
 'ĠâĢĶ',
 'Ġlast',
 'Ġjust',
 'ĠIn',
 'Ġother',
 'Ġso',
 'Ġwhat']

Then, you can use in operator in Python to check if a token belongs in the vocabulary:

回复收藏 0 原文

~没有更多了~