培训新的自动驱动器拥抱脸

发布于 2025-01-23 06:27:59 字数 864 浏览 2 评论 0原文

获取此错误：attributeError：'gpt2tokenizer'对象没有属性'train_new_from_iterator'

与拥抱面部文档非常相似。我更改了输入，就是它（不应该影响它）。它起作用一次。 2小时后回到它，但没有……什么都没改变。文档状态train_new_from_iterator仅与“快速”令牌一起使用，并且默认情况下，自动敲击器应该选择“快速”令牌。我最好的猜测是，这有一些麻烦。我还尝试降级变压器并重新安装到没有成功。 DF只是文本的一列。

from transformers import AutoTokenizer
import tokenizers

def batch_iterator(batch_size=10, size=5000):
    for i in range(100): #2264
        query = f"select note_text from cmx_uat.note where id > {i * size} limit 50;"
        df = pd.read_sql(sql=query, con=cmx_uat)

        for x in range(0, size, batch_size):
            yield list(df['note_text'].loc[0:5000])[x:x + batch_size]

old_tokenizer = AutoTokenizer.from_pretrained('roberta')
training_corpus = batch_iterator()
new_tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 32000)

原文

Getting this error: AttributeError: 'GPT2Tokenizer' object has no
attribute 'train_new_from_iterator'

Very similar to hugging face documentation. I changed the input and that's it (shouldn't affect it). It worked once. Came back to it 2 hrs later and it doesn't... nothing was changed NOTHING. Documentation states train_new_from_iterator only works with 'fast' tokenizers and that AutoTokenizer is supposed to pick a 'fast' tokenizer by default. My best guess is, it is having some trouble with this. I also tried downgrading transformers and reinstalling to no success. df is just one column of text.

from transformers import AutoTokenizer
import tokenizers

def batch_iterator(batch_size=10, size=5000):
    for i in range(100): #2264
        query = f"select note_text from cmx_uat.note where id > {i * size} limit 50;"
        df = pd.read_sql(sql=query, con=cmx_uat)

        for x in range(0, size, batch_size):
            yield list(df['note_text'].loc[0:5000])[x:x + batch_size]

old_tokenizer = AutoTokenizer.from_pretrained('roberta')
training_corpus = batch_iterator()
new_tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 32000)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

清醇 2025-01-30 06:27:59

有两件事要牢记：

首先：train_new_from_iterator仅适用于快速tokenizers。
在这里您可以阅读更多）

（语料库。应该是
例如，文本批处理的生成器，例如
文字如果您有内存中的所有内容。（官方文档）

def batch_iterator(batch_size=3, size=8):
        df = pd.DataFrame({"note_text": ['fghijk', 'wxyz']})
        for x in range(0, size, batch_size):
            yield df['note_text'].to_list()

old_tokenizer = AutoTokenizer.from_pretrained('roberta-base')
training_corpus = batch_iterator()
new_tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 32000)
print(old_tokenizer( ['fghijk', 'wxyz']))
print(new_tokenizer( ['fghijk', 'wxyz']))

输出：

{'input_ids': [[0, 506, 4147, 18474, 2], [0, 605, 32027, 329, 2]], 'attention_mask': [[1, 1, 1, 1, 1], [1, 1, 1, 1, 1]]}
{'input_ids': [[0, 22, 2], [0, 21, 2]], 'attention_mask': [[1, 1, 1], [1, 1, 1]]}

There are two things for keeping in mind:

First: The train_new_from_iterator works with fast tokenizers only.
(here you can read more)

Second: The training corpus. Should be
a generator of batches of texts, for instance, a list of lists of
texts if you have everything in memory. (official documents)

def batch_iterator(batch_size=3, size=8):
        df = pd.DataFrame({"note_text": ['fghijk', 'wxyz']})
        for x in range(0, size, batch_size):
            yield df['note_text'].to_list()

old_tokenizer = AutoTokenizer.from_pretrained('roberta-base')
training_corpus = batch_iterator()
new_tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 32000)
print(old_tokenizer( ['fghijk', 'wxyz']))
print(new_tokenizer( ['fghijk', 'wxyz']))

output:

{'input_ids': [[0, 506, 4147, 18474, 2], [0, 605, 32027, 329, 2]], 'attention_mask': [[1, 1, 1, 1, 1], [1, 1, 1, 1, 1]]}
{'input_ids': [[0, 22, 2], [0, 21, 2]], 'attention_mask': [[1, 1, 1], [1, 1, 1]]}

回复收藏 0 原文

~没有更多了~