当前位置：文江博客话题详情

为什么DETR需要设置一个空类？

发布于 2025-01-18 09:36:34 字数 48 浏览 1 评论 0原文

为什么Detr需要设置一个空班？它设置了一个“背景”类，这意味着非对象，为什么？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

挽清梦 2025-01-25 09:36:34

tl; dr

默认情况下始终预测100个边界框。空类用作过滤毫无意义的边界框的条件。

完整说明

如果您查看源代码，变压器解码器将每个查询从self.query_embed.Weater.Weater转换为输出hs：

hs = self.transformer(self.input_proj(src), mask, self.query_embed.weight, pos[-1])[0]

然后是线性层self.class_embed地图hs进入对象类outputs_class。另一个线性图层self.bbox_embed将相同的hs映射到边界框outputs_coord：

outputs_class = self.class_embed(hs)
outputs_coord = self.bbox_embed(hs).sigmoid()
out = {'pred_logits': outputs_class[-1], 'pred_boxes': outputs_coord[-1]}

界限框的数量设置为num_queries （默认情况下，

detr = DETR(backbone_with_pos_enc, transformer, num_classes=num_classes, num_queries=100)

如您现在所看到的，如果没有空的类，DETR将始终预测100个边界框（Detr将始终尝试绑定它，而该框架始终可以绑定100次），即使图像中只有一个对象。

现在，让我们考虑下面的示例。只有两个有意义的物体（两只鸟）。但是DEDR仍然可以预测100个边界框。值得庆幸的是，与“空班”相对应的盒子中有98个被丢弃（下面的绿色框和蓝色框，以及下图中未显示的其余96个盒子）。只有具有“鸟”输出类别的红色框和黄色框有意义，因此被认为是预测。

这就是DETR对动态对象预测的方式。它可以预测任何数量的对象小于或等于num_queries，但不超过此。如果您希望DITR预测100多个对象，例如500。则可以将num_queries设置为500或更高。

TL;DR

By default DETR always predict 100 bounding boxes. Empty class is used as a condition to filter out meaningless bounding boxes.

Full explanation

If you look at the source code, the transformer decoder transforms each query from self.query_embed.weight into the output hs:

hs = self.transformer(self.input_proj(src), mask, self.query_embed.weight, pos[-1])[0]

Then a linear layer self.class_embed maps hs into the object class outputs_class. Another linear layer self.bbox_embed maps the same hs into bounding box outputs_coord:

outputs_class = self.class_embed(hs)
outputs_coord = self.bbox_embed(hs).sigmoid()
out = {'pred_logits': outputs_class[-1], 'pred_boxes': outputs_coord[-1]}

The number of bounding boxes is set to num_queries (by default 100).

detr = DETR(backbone_with_pos_enc, transformer, num_classes=num_classes, num_queries=100)

As you can see now, without the empty class, DETR will always predict 100 bounding boxes (DETR will always try to bound this and that 100 times), even though when there is only one object in the image.

Now, let us consider the example below. There are only two meaningful objects (two birds). But DETR still predicts 100 bounding boxes. Thankfully 98 of the boxes corresponding to "empty class" are discarded (the green box and the blue box below, plus the remaining 96 boxes not shown in the pic below). Only the red box and yellow box having the output class of "bird" are meaningful and hence considered as the prediction.

That is how DETR makes dynamic object prediction. It can predict any number of objects less than or equal to num_queries, but not more than this. If you want DETR to predict more than 100 objects, say 500. Then you can set num_queries to 500 or above.

回复收藏 0 原文

沙与沫 2025-01-25 09:36:34

我认为第一个解码器层的交叉注意力将根据学习到的位置嵌入来更新查询的类嵌入。

DETR 中使用的交叉注意力权重计算如下：

(query + query_pos) @ (key + key_pos)^T

这里 query 是查询的类嵌入，在第一层它们没有意义（初始化为全零），但是 query_pos code> 被学习代表查询的粗略检测区域。在第一层之后，类嵌入主要根据 query_pos 和 key_pos 之间的相似性进行更新。因此，第一层之后的类嵌入主要关注查询位置周围的特征。

I think the cross-attention at the first decoder layer will update the class embeddings of the queries based on the learned positional embeddings.

The cross-attention weights used in DETR are computed as:

(query + query_pos) @ (key + key_pos)^T

here query is the class embeddings of queries, at the first layer they were meaningless (initialized as all zeros), but the query_pos are learned representing the rough detection region of the queries. After the first layer, the class embeddings are updated mainly based on the similarity between query_pos and key_pos. Therefore, the class embeddings after the first layer are focusing mainly on the features around the position of the queries.

回复收藏 0 原文

~没有更多了~