用TensorFlow进行固定的拥抱面BERT模型时的怪异行为
我正在尝试使用TensorFlow(在启用CoLabPro GPU上)进行推文情感分析来微调HugingFace Bert模型。我逐步遵循了Huggingface网站上的指南,但我正在经历奇怪的培训时间。我尝试过的所有BERT模型都会发生这种情况。 我有两个不同大小(10k和250万)的数据集,其中包括我需要将其分类为积极情绪或负面情绪的推文。
使用此代码,我对数据集执行令牌化:
# perform tokenization of the dataset
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
def tokenize_function(sentence):
return tokenizer(sentence['Phrase'], padding=True, truncation=True, max_length=30)
train = train.map(tokenize_function, batched=True)
test = test.map(tokenize_function, batched=True)
val = val.map(tokenize_function, batched=True)
然后,我创建了Tensoflow数据集:
# go from 'Dataset' type to tensorflow so that our dataset can be used for training in keras
from transformers import DefaultDataCollator
data_collator = DefaultDataCollator(return_tensors="tf")
tf_train_dataset = train.to_tf_dataset(
columns=["attention_mask", "input_ids", "token_type_ids"],
label_cols=["Label"],
shuffle=False,
collate_fn=data_collator,
batch_size=256,
)
tf_val_dataset = val.to_tf_dataset(
columns=["attention_mask", "input_ids", "token_type_ids"],
label_cols=["Label"],
shuffle=False,
collate_fn=data_collator,
batch_size=256,
)
tf_test_dataset = test.to_tf_dataset(
columns=["attention_mask", "input_ids", "token_type_ids"],
shuffle=False,
collate_fn=data_collator,
batch_size=256,
)
下载并编译模型:
from transformers import TFAutoModelForSequenceClassification
# download pre-trained model
model = TFAutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=2)
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=tf.metrics.SparseCategoricalAccuracy(),
)
我终于训练了模型,
# Compute some variables needed to speed up training
batch_size = 64
train_steps_per_epoch = int(len(tf_train_dataset) // batch_size)
dev_steps_per_epoch = int(len(tf_val_dataset) // batch_size)
# train model
model.fit(tf_train_dataset,
validation_data=tf_val_dataset,
epochs=1,
verbose=2,
# steps_per_epoch=train_steps_per_epoch,
# validation_steps=dev_steps_per_epoch,
)
我首先在10K数据集中训练了该模型,一个时代大约需要20分钟。对我来说,这很多。训练集并不大,我正在使用相当强大的GPU。我尝试搜索网络以搜索一些技巧来加快火车时间,而stackoverflow上的一个人建议设置step_per_epoch
参数,我将其设置为您可以在代码中看到的内容。现在,Trainig时间呈指数级改善,我可以在完整数据集(2.5m)上训练30分钟的3个时期的模型,但性能实际上会降低。我介绍了step_per_epoch
的定义,对我来说,它几乎就像batch_size。
我现在的问题是:
- 伯特模型在10k推文的数据集中服用20分钟的一个时代是否正常?
step_per_epoch
实际做什么?为什么会这么加快训练时间?为什么表现实际下降?
I am trying to fine tune a Huggingface Bert model using Tensorflow (on ColabPro GPU enabled) for tweets sentiment analysis. I followed step by step the guide on the Huggingface website, but I am experiencing a weird training time. This happens with all the Bert models I tried.
I have two datasets of different sizes (10k and 2.5Millions) consisting of tweets that I need to classify as having a positive sentiment or a negative sentiment.
With this piece of code I perform tokenization of my dataset:
# perform tokenization of the dataset
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
def tokenize_function(sentence):
return tokenizer(sentence['Phrase'], padding=True, truncation=True, max_length=30)
train = train.map(tokenize_function, batched=True)
test = test.map(tokenize_function, batched=True)
val = val.map(tokenize_function, batched=True)
I then create tensoflow datasets:
# go from 'Dataset' type to tensorflow so that our dataset can be used for training in keras
from transformers import DefaultDataCollator
data_collator = DefaultDataCollator(return_tensors="tf")
tf_train_dataset = train.to_tf_dataset(
columns=["attention_mask", "input_ids", "token_type_ids"],
label_cols=["Label"],
shuffle=False,
collate_fn=data_collator,
batch_size=256,
)
tf_val_dataset = val.to_tf_dataset(
columns=["attention_mask", "input_ids", "token_type_ids"],
label_cols=["Label"],
shuffle=False,
collate_fn=data_collator,
batch_size=256,
)
tf_test_dataset = test.to_tf_dataset(
columns=["attention_mask", "input_ids", "token_type_ids"],
shuffle=False,
collate_fn=data_collator,
batch_size=256,
)
Download and compile the model:
from transformers import TFAutoModelForSequenceClassification
# download pre-trained model
model = TFAutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=2)
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=tf.metrics.SparseCategoricalAccuracy(),
)
And I finally train the model
# Compute some variables needed to speed up training
batch_size = 64
train_steps_per_epoch = int(len(tf_train_dataset) // batch_size)
dev_steps_per_epoch = int(len(tf_val_dataset) // batch_size)
# train model
model.fit(tf_train_dataset,
validation_data=tf_val_dataset,
epochs=1,
verbose=2,
# steps_per_epoch=train_steps_per_epoch,
# validation_steps=dev_steps_per_epoch,
)
I first trained this model on the 10K dataset and one epochs takes around 20mins. To me this is a lot. The training set is not that big and I am using a rather powerful GPU. I tried searching the web for some tricks to speed up the trainig time and a guy on stackoverflow suggested setting the steps_per_epoch
parameter, which I set to what you can see in the code. Now the trainig time improves exponentially, I can train the model on the full dataset (2.5M) for 3 epochs in 30mins, but the performance actually decrease. I loked up the definition of steps_per_epoch
and to me it is almost like batch_size.
My questions now would be:
- is it normal for a bert model to take 20mins for one epoch on a datatset of 10k tweets?
- what does
steps_per_epoch
actually do? why does it speed up the training time so much? and why does the performance actually decrease?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
对于您的第一个问题,我以前从未使用过Bert模型,所以我不能说。
对于您的第二个问题,根据我的理解,每个时期的步骤是将在1个时期内使用该模型的样品批次数量。
因此,假设您的时期数为2而不是1,并且在模型配备10批培训数据后,您将步骤设置为10每个时期的640个样本)将结束该时期并开始第二个时期,该时期也将在接收640个样品后结束。因此,最终该模型仅经过大约1280个数据样本训练。
尽管在您的情况下,您将每个时期的步骤设置为整个样本量的长度,因此奇怪的是它减少了训练时间。我将尝试类似的事情来了解发生了什么。
For your first question i haven't used the bert model before so i can't say.
For your second question well from what I understand, steps per epoch is the number of sample batches that will be used to fit the model during 1 epoch.
So let's say your number of epochs was 2 instead of 1 and you set your steps per epoch to 10 once the model is fitted with 10 batches of your training data (in your case the batch size is set to 64, which is 640 samples per epoch) it will end that epoch and start the second epoch which will also end after receiving 640 samples. So in the end the model is only trained with about 1280 samples of the data.
Although in your case you set your steps per epoch to the length of your entire sample size so it is strange that it reduced the training time. I will try something similar to understand what is going on.