使用Visualbert掩盖图像和语言建模
我正在编码此
features = torch.mean(output_dict.get("roi_features"), axis=1).reshape(1,1,2048)
它被馈送给Visualbert以进行训练预读模型,从而为我提供了预测。因此,现在正如您在笔记本中所看到的那样,在使用Argmax之后,预测逻辑是:
prediction_logits[0].argmax(-1)
>> tensor([1012, 1037, 6302, 1997, 1037, 5723, 1012, 2003])
现在,当我尝试使用上述预测和Tokenizer的词汇来获取单词时,这就是输出的内容:
.
a
photo
of
a
bathroom
.
is
相反 :在浴室里,我应该在猫附近有猫或至少有猫,但是浴室之间似乎有10个值(在我们的产量中被评为最高,得分为9.5069)和猫(得分为6.3830)。我们能以某种方式获得猫的分数并使其成为最理想的输出吗?
I was coding this piece of code which heavily relies on the demo of visual question answering, and I'm masking inputs while feeding it to the bert using [MASK] token, and providing a label which accompanies the mask. Visual embeddings are being extracted through rcnn, giving me 36 such vectors, in which I'm taking the mean of all 36 vectors as shown below :
features = torch.mean(output_dict.get("roi_features"), axis=1).reshape(1,1,2048)
which is being fed to the visualbert for pretraining model, thus giving me prediction_logits. So, now as you can see in the notebook and here too, after taking argmax, prediction logits are :
prediction_logits[0].argmax(-1)
>> tensor([1012, 1037, 6302, 1997, 1037, 5723, 1012, 2003])
Now, when I'm trying to get words using the above predictions and the vocabulary of the tokenizer, this is what's being outputted :
.
a
photo
of
a
bathroom
.
is
Instead of bathroom, I should've got cat or atleast near cat but there seems to be difference of 10 values between bathroom (which is voted highest in our output, with score of 9.5069) and cat (with a score of 6.3830). Can we somehow get the score of cat up and make it most desirable output?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论