使用Visualbert掩盖图像和语言建模

发布于 2025-02-07 13:18:01 字数 538 浏览 1 评论 0原文

我正在编码此

features = torch.mean(output_dict.get("roi_features"), axis=1).reshape(1,1,2048)

它被馈送给Visualbert以进行训练预读模型，从而为我提供了预测。因此，现在正如您在笔记本中所看到的那样，在使用Argmax之后，预测逻辑是：

prediction_logits[0].argmax(-1)

>> tensor([1012, 1037, 6302, 1997, 1037, 5723, 1012, 2003])

现在，当我尝试使用上述预测和Tokenizer的词汇来获取单词时，这就是输出的内容：

.
a
photo
of
a
bathroom
.
is

相反：在浴室里，我应该在猫附近有猫或至少有猫，但是浴室之间似乎有10个值（在我们的产量中被评为最高，得分为9.5069）和猫（得分为6.3830）。我们能以某种方式获得猫的分数并使其成为最理想的输出吗？

原文

I was coding this piece of code which heavily relies on the demo of visual question answering, and I'm masking inputs while feeding it to the bert using [MASK] token, and providing a label which accompanies the mask. Visual embeddings are being extracted through rcnn, giving me 36 such vectors, in which I'm taking the mean of all 36 vectors as shown below :

features = torch.mean(output_dict.get("roi_features"), axis=1).reshape(1,1,2048)

which is being fed to the visualbert for pretraining model, thus giving me prediction_logits. So, now as you can see in the notebook and here too, after taking argmax, prediction logits are :

prediction_logits[0].argmax(-1)

>> tensor([1012, 1037, 6302, 1997, 1037, 5723, 1012, 2003])

Now, when I'm trying to get words using the above predictions and the vocabulary of the tokenizer, this is what's being outputted :

.
a
photo
of
a
bathroom
.
is

Instead of bathroom, I should've got cat or atleast near cat but there seems to be difference of 10 values between bathroom (which is voted highest in our output, with score of 9.5069) and cat (with a score of 6.3830). Can we somehow get the score of cat up and make it most desirable output?

分享到QQ

分享到微博