为什么我验证的BERT模型总是预测最频繁的令牌（包括[PAD]）？

发布于 2025-01-22 02:23:33 字数 1481 浏览 3 评论 0原文

我试图在一个域内数据集（与法律相关）上进一步用MLM为荷兰BERT模型预处理。我已经设置了整个预处理和训练阶段，但是当我使用训练有素的模型预测一个蒙面的单词时，它总是以相同的顺序输出相同的单词，包括[PAD]令牌。这很奇怪，因为我认为它甚至都不应该能够预测填充式（因为我的代码确保没有掩盖垫子）。请参阅我的模型预测的图片我尝试使用更多数据（超过50.000个实例）和更多的时代（大约20个）。我已经完成了我的代码，并且很确定它可以为模型提供正确的输入。该型号的英文版本似乎可以工作，这让我想知道荷兰模型是否不那么健壮。

有人会知道这可能的原因/解决方案吗？还是我的语言模型可能根本不起作用？

我将添加训练循环和掩盖功能，以防万一我忽略了其中一个错误：

def mlm(tensor):
    rand = torch.rand(tensor.shape)
    mask_arr = (rand < 0.15) * (tensor > 3)
    for i in range(tensor.shape[0]):
        selection = torch.flatten(mask_arr[i].nonzero()).tolist()
        tensor[i, selection] = 4
    return tensor    

model.train()
    optim = optim.Adam(model.parameters(), lr=0.005)
    epochs = 1
    losses = []
    for epoch in range(epochs):
        epochloss = []
        loop = tqdm(loader, leave=True)
        for batch in loop:
            optim.zero_grad()
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            outputs = model(input_ids, attention_mask=attention_mask, labels = labels)
            loss = outputs.loss
            epochloss.append(loss)
            loss.backward()
            optim.step()
            loop.set_description(f'Epoch {epoch}')
            loop.set_postfix(loss=loss.item())
        losses.append(epochloss)

原文

I am trying to further pretrain a Dutch BERT model with MLM on an in-domain dataset (law-related). I have set up my entire preprocessing and training stages, but when I use the trained model to predict a masked word, it always outputs the same words in the same order, including the [PAD] token. Which is weird, because I thought it wasn't even supposed to be able to predict the pad-token at all (since my code makes sure pad-tokens are not masked).
See picture of my models predictions
I have tried to use more data (more than 50.000 instances) and more epochs (about 20). I have gone through my code and am pretty sure that it gives the right input to the model. The English version of the model seems to work, which makes me wonder if the Dutch model is less robust.

Would anyone know any possible causes/solutions for this? Or is it possible that my language model just simply doesn't work?

I will add my training loop and mask-function just in case I overlooked a mistake in them:

def mlm(tensor):
    rand = torch.rand(tensor.shape)
    mask_arr = (rand < 0.15) * (tensor > 3)
    for i in range(tensor.shape[0]):
        selection = torch.flatten(mask_arr[i].nonzero()).tolist()
        tensor[i, selection] = 4
    return tensor    

model.train()
    optim = optim.Adam(model.parameters(), lr=0.005)
    epochs = 1
    losses = []
    for epoch in range(epochs):
        epochloss = []
        loop = tqdm(loader, leave=True)
        for batch in loop:
            optim.zero_grad()
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            outputs = model(input_ids, attention_mask=attention_mask, labels = labels)
            loss = outputs.loss
            epochloss.append(loss)
            loss.backward()
            optim.step()
            loop.set_description(f'Epoch {epoch}')
            loop.set_postfix(loss=loss.item())
        losses.append(epochloss)

分享到QQ

分享到微博