没有tokenizer中的encode_plus方法，如何制作功能矩阵

发布于 2025-02-10 03:43:50 字数 1519 浏览 1 评论 0原文

我正在使用低资源语言，需要制作分类器。我使用Tokenizers库来训练以下令牌：WLV，BPE，UNI，WPC。我将每个结果保存到JSON文件中。

我使用tokenizer.from_file函数加载每个令牌。

tokenizer_WLV = Tokenizer.from_file('tokenizer_WLV.json')

我可以看到它的加载正确。但是，仅存在方法编码。

因此，如果我做tokenizer_wlv.encode（s1），我会得到一个输出

Encoding(num_tokens=7, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]

，并且可以看到每个令牌以及以下ID。

out_wlv = tokenizer_WLV.encode(s1)
print(out_wlv.ids)
print(out_wlv.tokens)

我可以使用encode_batch

def tokenize_sentences(sentences, tokenizer, max_seq_len = 128):
    tokenizer.enable_padding(pad_id=3, pad_token="[PAD]", direction='right')
    tokenized_sentences = tokenizer.encode_batch(sentences)
    return tokenized_sentences

，它会导致

[Encoding(num_tokens=40, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),
 Encoding(num_tokens=40, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),
 Encoding(num_tokens=40, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),
 Encoding(num_tokens=40, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])]

我需要以MXN大小制作数据功能，其中m是观测值的数量和n个唯一令牌的数量。 encode_plus自动执行此操作。因此，我很好奇构建此功能矩阵的最有效方法是什么？

原文

I am working on a low-resource language and need to make a classifier.
I used the tokenizers library to train the following tokenizers: WLV, BPE, UNI, WPC. I have saved the result of each into a json file.

I load each of the tokenizers using Tokenizer.from_file function.

tokenizer_WLV = Tokenizer.from_file('tokenizer_WLV.json')

and I can see it is loaded properly. However only the method encode exists.

so if I do tokenizer_WLV.encode(s1), I get an output like

Encoding(num_tokens=7, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]

and I can see each token along with the id as following.

out_wlv = tokenizer_WLV.encode(s1)
print(out_wlv.ids)
print(out_wlv.tokens)

I can use the encode_batch

def tokenize_sentences(sentences, tokenizer, max_seq_len = 128):
    tokenizer.enable_padding(pad_id=3, pad_token="[PAD]", direction='right')
    tokenized_sentences = tokenizer.encode_batch(sentences)
    return tokenized_sentences

which results in something like

[Encoding(num_tokens=40, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),
 Encoding(num_tokens=40, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),
 Encoding(num_tokens=40, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),
 Encoding(num_tokens=40, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])]

I need to make a data feature in a size of mxn where m is the number of observations and n number of unique tokens. encode_plus does this automatically. So I am curious what is the most efficient way for constructing this feature matrix ?

分享到QQ

分享到微博