使用Visualbert掩盖图像和语言建模

发布于 2025-02-07 13:18:01 字数 538 浏览 1 评论 0原文

我正在编码此

features = torch.mean(output_dict.get("roi_features"), axis=1).reshape(1,1,2048)

它被馈送给Visualbert以进行训练预读模型,从而为我提供了预测。因此,现在正如您在笔记本中所看到的那样,在使用Argmax之后,预测逻辑是:

prediction_logits[0].argmax(-1)

>> tensor([1012, 1037, 6302, 1997, 1037, 5723, 1012, 2003])

现在,当我尝试使用上述预测和Tokenizer的词汇来获取单词时,这就是输出的内容:

.
a
photo
of
a
bathroom
.
is

相反 :在浴室里,我应该在猫附近有猫或至少有猫,但是浴室之间似乎有10个值(在我们的产量中被评为最高,得分为9.5069)和猫(得分为6.3830)。我们能以某种方式获得猫的分数并使其成为最理想的输出吗?

I was coding this piece of code which heavily relies on the demo of visual question answering, and I'm masking inputs while feeding it to the bert using [MASK] token, and providing a label which accompanies the mask. Visual embeddings are being extracted through rcnn, giving me 36 such vectors, in which I'm taking the mean of all 36 vectors as shown below :

features = torch.mean(output_dict.get("roi_features"), axis=1).reshape(1,1,2048)

which is being fed to the visualbert for pretraining model, thus giving me prediction_logits. So, now as you can see in the notebook and here too, after taking argmax, prediction logits are :

prediction_logits[0].argmax(-1)

>> tensor([1012, 1037, 6302, 1997, 1037, 5723, 1012, 2003])

Now, when I'm trying to get words using the above predictions and the vocabulary of the tokenizer, this is what's being outputted :

.
a
photo
of
a
bathroom
.
is

Instead of bathroom, I should've got cat or atleast near cat but there seems to be difference of 10 values between bathroom (which is voted highest in our output, with score of 9.5069) and cat (with a score of 6.3830). Can we somehow get the score of cat up and make it most desirable output?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文