用TensorFlow进行固定的拥抱面BERT模型时的怪异行为

发布于 2025-01-27 05:00:55 字数 2782 浏览 1 评论 0原文

我正在尝试使用TensorFlow（在启用CoLabPro GPU上）进行推文情感分析来微调HugingFace Bert模型。我逐步遵循了Huggingface网站上的指南，但我正在经历奇怪的培训时间。我尝试过的所有BERT模型都会发生这种情况。我有两个不同大小（10k和250万）的数据集，其中包括我需要将其分类为积极情绪或负面情绪的推文。

使用此代码，我对数据集执行令牌化：

# perform tokenization of the dataset
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

def tokenize_function(sentence):
    return tokenizer(sentence['Phrase'], padding=True, truncation=True, max_length=30)

train = train.map(tokenize_function, batched=True)
test = test.map(tokenize_function, batched=True)
val = val.map(tokenize_function, batched=True)

然后，我创建了Tensoflow数据集：

# go from 'Dataset' type to tensorflow so that our dataset can be used for training in keras
from transformers import DefaultDataCollator

data_collator = DefaultDataCollator(return_tensors="tf")

tf_train_dataset = train.to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=["Label"],
    shuffle=False,
    collate_fn=data_collator,
    batch_size=256,
)

tf_val_dataset = val.to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=["Label"],
    shuffle=False,
    collate_fn=data_collator,
    batch_size=256,
)

tf_test_dataset = test.to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    shuffle=False,
    collate_fn=data_collator,
    batch_size=256,
)

下载并编译模型：

from transformers import TFAutoModelForSequenceClassification

# download pre-trained model
model = TFAutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=2)

model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=tf.metrics.SparseCategoricalAccuracy(),
)

我终于训练了模型，

# Compute some variables needed to speed up training
batch_size = 64
train_steps_per_epoch = int(len(tf_train_dataset) // batch_size)
dev_steps_per_epoch = int(len(tf_val_dataset) // batch_size)

# train model
model.fit(tf_train_dataset,
           validation_data=tf_val_dataset,
            epochs=1,
            verbose=2,
            # steps_per_epoch=train_steps_per_epoch,
            # validation_steps=dev_steps_per_epoch,
            )

我首先在10K数据集中训练了该模型，一个时代大约需要20分钟。对我来说，这很多。训练集并不大，我正在使用相当强大的GPU。我尝试搜索网络以搜索一些技巧来加快火车时间，而stackoverflow上的一个人建议设置step_per_epoch参数，我将其设置为您可以在代码中看到的内容。现在，Trainig时间呈指数级改善，我可以在完整数据集（2.5m）上训练30分钟的3个时期的模型，但性能实际上会降低。我介绍了step_per_epoch的定义，对我来说，它几乎就像batch_size。

我现在的问题是：

伯特模型在10k推文的数据集中服用20分钟的一个时代是否正常？
step_per_epoch实际做什么？为什么会这么加快训练时间？为什么表现实际下降？

原文

I am trying to fine tune a Huggingface Bert model using Tensorflow (on ColabPro GPU enabled) for tweets sentiment analysis. I followed step by step the guide on the Huggingface website, but I am experiencing a weird training time. This happens with all the Bert models I tried.
I have two datasets of different sizes (10k and 2.5Millions) consisting of tweets that I need to classify as having a positive sentiment or a negative sentiment.

With this piece of code I perform tokenization of my dataset:

# perform tokenization of the dataset
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

def tokenize_function(sentence):
    return tokenizer(sentence['Phrase'], padding=True, truncation=True, max_length=30)

train = train.map(tokenize_function, batched=True)
test = test.map(tokenize_function, batched=True)
val = val.map(tokenize_function, batched=True)

I then create tensoflow datasets:

# go from 'Dataset' type to tensorflow so that our dataset can be used for training in keras
from transformers import DefaultDataCollator

data_collator = DefaultDataCollator(return_tensors="tf")

tf_train_dataset = train.to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=["Label"],
    shuffle=False,
    collate_fn=data_collator,
    batch_size=256,
)

tf_val_dataset = val.to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=["Label"],
    shuffle=False,
    collate_fn=data_collator,
    batch_size=256,
)

tf_test_dataset = test.to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    shuffle=False,
    collate_fn=data_collator,
    batch_size=256,
)

Download and compile the model:

from transformers import TFAutoModelForSequenceClassification

# download pre-trained model
model = TFAutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=2)

model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=tf.metrics.SparseCategoricalAccuracy(),
)

And I finally train the model

# Compute some variables needed to speed up training
batch_size = 64
train_steps_per_epoch = int(len(tf_train_dataset) // batch_size)
dev_steps_per_epoch = int(len(tf_val_dataset) // batch_size)

# train model
model.fit(tf_train_dataset,
           validation_data=tf_val_dataset,
            epochs=1,
            verbose=2,
            # steps_per_epoch=train_steps_per_epoch,
            # validation_steps=dev_steps_per_epoch,
            )

I first trained this model on the 10K dataset and one epochs takes around 20mins. To me this is a lot. The training set is not that big and I am using a rather powerful GPU. I tried searching the web for some tricks to speed up the trainig time and a guy on stackoverflow suggested setting the steps_per_epoch parameter, which I set to what you can see in the code. Now the trainig time improves exponentially, I can train the model on the full dataset (2.5M) for 3 epochs in 30mins, but the performance actually decrease. I loked up the definition of steps_per_epoch and to me it is almost like batch_size.

My questions now would be:

is it normal for a bert model to take 20mins for one epoch on a datatset of 10k tweets?
what does steps_per_epoch actually do? why does it speed up the training time so much? and why does the performance actually decrease?

分享到QQ

分享到微博