为什么DETR需要设置一个空类?
为什么Detr需要设置一个空班? 它设置了一个“背景”类,这意味着非对象,为什么?
Why DETR need to set a empty class?
It has set a "Background" class, which means non-object, why?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
tl; dr
默认情况下始终预测100个边界框。空类用作过滤毫无意义的边界框的条件。
完整说明
如果您查看源代码,变压器解码器将每个查询从
self.query_embed.Weater.Weater
转换为输出hs
:然后是线性层
self.class_embed
地图hs
进入对象类outputs_class
。另一个线性图层self.bbox_embed
将相同的hs
映射到边界框outputs_coord
:界限框的数量设置为
num_queries (默认情况下,
如您现在所看到的,如果没有空的类,DETR将始终预测100个边界框(Detr将始终尝试绑定它,而该框架始终可以绑定100次),即使图像中只有一个对象。
现在,让我们考虑下面的示例。只有两个有意义的物体(两只鸟)。但是DEDR仍然可以预测100个边界框。值得庆幸的是,与“空班”相对应的盒子中有98个被丢弃(下面的绿色框和蓝色框,以及下图中未显示的其余96个盒子)。只有具有“鸟”输出类别的红色框和黄色框有意义,因此被认为是预测。
这就是DETR对动态对象预测的方式。它可以预测任何数量的对象小于或等于
num_queries
,但不超过此。如果您希望DITR预测100多个对象,例如500。则可以将num_queries
设置为500或更高。TL;DR
By default DETR always predict 100 bounding boxes. Empty class is used as a condition to filter out meaningless bounding boxes.
Full explanation
If you look at the source code, the transformer decoder transforms each query from
self.query_embed.weight
into the outpuths
:Then a linear layer
self.class_embed
mapshs
into the object classoutputs_class
. Another linear layerself.bbox_embed
maps the samehs
into bounding boxoutputs_coord
:The number of bounding boxes is set to
num_queries
(by default 100).As you can see now, without the empty class, DETR will always predict 100 bounding boxes (DETR will always try to bound this and that 100 times), even though when there is only one object in the image.
Now, let us consider the example below. There are only two meaningful objects (two birds). But DETR still predicts 100 bounding boxes. Thankfully 98 of the boxes corresponding to "empty class" are discarded (the green box and the blue box below, plus the remaining 96 boxes not shown in the pic below). Only the red box and yellow box having the output class of "bird" are meaningful and hence considered as the prediction.
That is how DETR makes dynamic object prediction. It can predict any number of objects less than or equal to
num_queries
, but not more than this. If you want DETR to predict more than 100 objects, say 500. Then you can setnum_queries
to 500 or above.我认为第一个解码器层的交叉注意力将根据学习到的位置嵌入来更新查询的类嵌入。
DETR 中使用的交叉注意力权重计算如下:
这里
query
是查询的类嵌入,在第一层它们没有意义(初始化为全零),但是query_pos
code> 被学习代表查询的粗略检测区域。在第一层之后,类嵌入主要根据query_pos
和key_pos
之间的相似性进行更新。因此,第一层之后的类嵌入主要关注查询位置周围的特征。I think the cross-attention at the first decoder layer will update the class embeddings of the queries based on the learned positional embeddings.
The cross-attention weights used in DETR are computed as:
here
query
is the class embeddings of queries, at the first layer they were meaningless (initialized as all zeros), but thequery_pos
are learned representing the rough detection region of the queries. After the first layer, the class embeddings are updated mainly based on the similarity betweenquery_pos
andkey_pos
. Therefore, the class embeddings after the first layer are focusing mainly on the features around the position of the queries.