培训新的自动驱动器拥抱脸
获取此错误:attributeError:'gpt2tokenizer'对象没有 属性'train_new_from_iterator'
与拥抱面部文档非常相似。我更改了输入,就是它(不应该影响它)。它起作用一次。 2小时后回到它,但没有……什么都没改变。文档状态train_new_from_iterator仅与“快速”令牌一起使用,并且默认情况下,自动敲击器应该选择“快速”令牌。我最好的猜测是,这有一些麻烦。我还尝试降级变压器并重新安装到没有成功。 DF只是文本的一列。
from transformers import AutoTokenizer
import tokenizers
def batch_iterator(batch_size=10, size=5000):
for i in range(100): #2264
query = f"select note_text from cmx_uat.note where id > {i * size} limit 50;"
df = pd.read_sql(sql=query, con=cmx_uat)
for x in range(0, size, batch_size):
yield list(df['note_text'].loc[0:5000])[x:x + batch_size]
old_tokenizer = AutoTokenizer.from_pretrained('roberta')
training_corpus = batch_iterator()
new_tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 32000)
Getting this error: AttributeError: 'GPT2Tokenizer' object has no
attribute 'train_new_from_iterator'
Very similar to hugging face documentation. I changed the input and that's it (shouldn't affect it). It worked once. Came back to it 2 hrs later and it doesn't... nothing was changed NOTHING. Documentation states train_new_from_iterator only works with 'fast' tokenizers and that AutoTokenizer is supposed to pick a 'fast' tokenizer by default. My best guess is, it is having some trouble with this. I also tried downgrading transformers and reinstalling to no success. df is just one column of text.
from transformers import AutoTokenizer
import tokenizers
def batch_iterator(batch_size=10, size=5000):
for i in range(100): #2264
query = f"select note_text from cmx_uat.note where id > {i * size} limit 50;"
df = pd.read_sql(sql=query, con=cmx_uat)
for x in range(0, size, batch_size):
yield list(df['note_text'].loc[0:5000])[x:x + batch_size]
old_tokenizer = AutoTokenizer.from_pretrained('roberta')
training_corpus = batch_iterator()
new_tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 32000)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
有两件事要牢记:
首先:
train_new_from_iterator
仅适用于快速tokenizers。在这里您可以阅读更多)
( 语料库。应该是
例如,文本批处理的生成器,例如
文字如果您有内存中的所有内容。 (官方文档)
输出:
There are two things for keeping in mind:
First: The
train_new_from_iterator
works with fast tokenizers only.(here you can read more)
Second: The training corpus. Should be
a generator of batches of texts, for instance, a list of lists of
texts if you have everything in memory. (official documents)
output: