没有tokenizer中的encode_plus方法,如何制作功能矩阵
我正在使用低资源语言,需要制作分类器。 我使用Tokenizers库来训练以下令牌:WLV,BPE,UNI,WPC。我将每个结果保存到JSON文件中。
我使用tokenizer.from_file
函数加载每个令牌。
tokenizer_WLV = Tokenizer.from_file('tokenizer_WLV.json')
我可以看到它的加载正确。但是,仅存在方法编码
。
因此,如果我做tokenizer_wlv.encode(s1)
,我会得到一个输出
Encoding(num_tokens=7, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]
,并且可以看到每个令牌以及以下ID。
out_wlv = tokenizer_WLV.encode(s1)
print(out_wlv.ids)
print(out_wlv.tokens)
我可以使用encode_batch
def tokenize_sentences(sentences, tokenizer, max_seq_len = 128):
tokenizer.enable_padding(pad_id=3, pad_token="[PAD]", direction='right')
tokenized_sentences = tokenizer.encode_batch(sentences)
return tokenized_sentences
,它会导致
[Encoding(num_tokens=40, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),
Encoding(num_tokens=40, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),
Encoding(num_tokens=40, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),
Encoding(num_tokens=40, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])]
我需要以MXN大小制作数据功能,其中m是观测值的数量和n个唯一令牌的数量。 encode_plus
自动执行此操作。因此,我很好奇构建此功能矩阵的最有效方法是什么?
I am working on a low-resource language and need to make a classifier.
I used the tokenizers library to train the following tokenizers: WLV, BPE, UNI, WPC. I have saved the result of each into a json file.
I load each of the tokenizers using Tokenizer.from_file
function.
tokenizer_WLV = Tokenizer.from_file('tokenizer_WLV.json')
and I can see it is loaded properly. However only the method encode
exists.
so if I do tokenizer_WLV.encode(s1)
, I get an output like
Encoding(num_tokens=7, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]
and I can see each token along with the id as following.
out_wlv = tokenizer_WLV.encode(s1)
print(out_wlv.ids)
print(out_wlv.tokens)
I can use the encode_batch
def tokenize_sentences(sentences, tokenizer, max_seq_len = 128):
tokenizer.enable_padding(pad_id=3, pad_token="[PAD]", direction='right')
tokenized_sentences = tokenizer.encode_batch(sentences)
return tokenized_sentences
which results in something like
[Encoding(num_tokens=40, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),
Encoding(num_tokens=40, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),
Encoding(num_tokens=40, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),
Encoding(num_tokens=40, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])]
I need to make a data feature in a size of mxn where m is the number of observations and n number of unique tokens. encode_plus
does this automatically. So I am curious what is the most efficient way for constructing this feature matrix ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
encode_plus
是一种方法,拥抱面变形金刚具有(但是已经弃用,因此应忽略它)。替代性拥抱面Tokenizers和HuggingFace Transferer Tokenizers提供的是
__调用__
:encode_plus
is a method that huggingface transformer tokenizers have (but it is already deprecated and should therefore be ignored).The alternative huggingface tokenizers and the huggingface transformer tokenizers provide is
__call__
: