是什么让对象表示和识别变得困难?

发布于 2024-10-18 09:48:54 字数 126 浏览 12 评论 0原文

直观上,似乎给定几乎任何物体的不同角度的十几个左右的 2D 图像,应该很容易构建该物体的 3D 表示。随后,以这种方式获得的 3D 表示库可用于识别新的 2D 图像。

有哪些类似的文献,为什么还没有产生强大的物体识别能力?

Intuitively, it would seem that given a dozen or so 2d images from different angles of almost any object, it should be easy to construct a 3d representation of that object. Subsequently a library of 3d representations attained in this way could be used to identify new 2d images.

What literature is there along these lines, and why has it not yet produced strong object recognition?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

因为看清所以看轻 2024-10-25 09:48:54

正是你的“直觉”一词给你带来了麻烦。你的大脑并不是被设计来非常擅长某些任务,比如瞬间乘以数千个数字。然而,就原始计算能力而言,你的大脑使得最快的计算机看起来也不过是眨眼而已(神经响应时间只有大约 10 毫秒,但所有这些 10^14 个左右的神经元全部并行工作,完全击败了任何现代机器)。只是你的大脑被设计用来解决计算上更加复杂的问题,比如识别图片中的物体、解析声音数据以及在背景噪音中挑选出单独的说话者。学习对数以万计的物体进行分类和处理。

你的大脑被设计用来真正出色地完成那些计算强度极高的事情,对一个人来说,这些事情似乎是“直觉的”。它的设计目的并不是很好,但它看起来“不直观”或很困难。但是强大的对象识别需要原始计算(因为对象的种类很多,其中许多确实有子对象,并且有多种分类和非刚性形式,例如“裤子”,“水”,“狗”)远远超出了完成人们认为只有计算机才能完成的事情所需的能力。像使用“常识”来解决日常问题这样的事情对于一个人来说同样是微不足道的,但计算上却极其复杂。

It is your word "intuitively" that is causing you trouble there. Your brain is not designed to be very good at certain tasks, like multiplying thousands of numbers in an instant. However for raw computational power your brain makes the fastest computer look like mere tiddly-winks (neural response time of only about 10 milliseconds, but all those 10^14 or so neurons all working in parallel totally beats any modern machine). Its just that your brain is designed to solve problems that are intensely more computationally complex, like recognizing objects in a picture, parsing sound data and picking out individual speakers amidst background noise. Learning to classify and deal with tens of thousands of types of objects.

The incredibly computationally intense things your brain is designed to do really well are the things that, to a person, seem "intuitive". The things it isn't designed to do really well seem "unintuitive" or difficult. But the raw computation needed for strong object recognition (because there are just so MANY kinds of objects, many of which really have subobjects, and multiple classifications, and non-rigid forms, e.g. "trousers", "water", "dog") is WAY more than what is needed accomplish things one considers only possible for a computer. Things like using "common sense" to solve an every day problem are similarly trivial for a person, but computationally incredibly complex.

伪心 2024-10-25 09:48:54

(有很多但是)

你想做的事情确实是可能的,但是对于 3D 重建

  • :对于除了最简单的形状之外的任何东西,你需要的不仅仅是几十个图像。
  • 您正在重建的形状需要具有许多可识别的特征,这些特征从不同角度看起来足够相似,以便您可以将它们匹配。
  • 整个图像集上的光照需要相当恒定,否则
  • 即使对于特征非常丰富的对象(即颜色和形状有很多变化),阴影也会让您感到困惑(或者您需要更多图像) 任何匹配对的 3D 重建精度如果您不完全了解用于拍摄每张照片的相机的参数(位置、视角方向和张角),那么功能的丰富性将会很糟糕。

这些都是可以解决的问题,所以假设您已经解决了,现在您有了一张来自要与 3D 形状匹配的对象的新图片。

您当然可以尝试找到适合新图片的形状的 2D 投影,但搜索空间巨大。使用您为初始 3D 重建构建的特征查找和匹配系统来直接将新图片与现有图片集进行匹配,并以这种方式找到它适合对象的位置,可能会更容易、更快捷。

因此,一旦解决了创建初始 3D 重建的问题,第二步就基本上完成了。

Photosynth 是这两个步骤的一个出色示例。浏览该网站,尝试找到他们在那里的一些参考资料。

至于最后一步,强大的物体识别,想象一下搜索空间!要实现强大的对象识别,除了对要识别的对象有良好的表示之外,还需要一种搜索​​已知对象空间的好方法,以及表示新对象的好方法(对象的图像)。在这种情况下)在那个空间。这是我几乎一无所知的事情。

为了匹配不同 2D 图像中的同一对象,可以使用 SIFT 功能。但我认为这不能很好地转化为 3D。

What you want to do is indeed possible, but (there are quite a few buts)

for the 3D reconstruction:

  • For anything but the simplest shapes you need more than just a few dozen images.
  • The shape you are reconstructing needs to have a lot of recognizable features that look similar enough from different angles so that you can match them.
  • Lighting needs to be fairly constant over your entire set of images, otherwise shadows will throw you off (or you need even more images)
  • even with very feature rich objects (i.e. lot of variation in colour and shape) 3D reconstruction accuracy from any matched pair of features is going to be terrible if you do not have full knowledge of the parameters (position, view direction and opening angle) of the camera used to take each picture.

These are all problems can be solved, so suppose you did, and now you have a new picture from the object that you want to match to your 3D shape.

You could of course try to find a 2D projection of your shape that fit the new picture, but the search space there is enormous. It would probably be a lot easier and faster to use the feature finding and matching system you built for the initial 3D reconstruction to directly match the new picture to the existing set, and find where it fits on the object that way.

So once you've solved the problem of creating the initial 3D reconstruction your second step is basically done as well.

Photosynth is a brilliant example of these two steps. Browse the site, try to find some of the references they have there.

As for your final step, strong object recognition, just imagine the search space! What you need for strong object recognition, apart from a good representation of the objects you want to recognize, is a good way to search the space of objects you know, and a good way to represent your new object (the image of an object in this case) in that space. This is something I know nearly nothing about.

For just matching the same object in different 2D images there are SIFT features. But I don't think this translates well to 3D.

凉栀 2024-10-25 09:48:54

请注意,您所描述的是实例识别。如今,计算机确实可以很好地进行实例识别。例如,Google Goggles 非常擅长识别金门大桥和埃菲尔铁塔等地标建筑。

然而,计算机不太擅长类别识别和分类。在所有类型的照明条件等下为所有可能的对象创建数十个 2D 快照很快就会变得非常棘手。事实上,某些物体(例如狗)可以四处移动,这使得可能性的空间更大。计算机在这方面变得更糟。

另外,从生物学的角度来看,我们的视野约为 1 亿像素。显卡现在才开始能够实时渲染这么多数据。理解如此多的数据需要更多的计算。

人们经常谈论让一台机器达到 5 岁儿童处理信息的能力。但让我们想想有多少数据。 1 亿像素,3 个颜色通道,每像素 1 字节 = 300MB/s。现在乘以每秒 30 帧、每年 31,556,926 秒、5 年,最终得到大约 1.4 艾字节 (1.4x10^18)。

Note that what you're describing is instance recognition. Computer can indeed do a good job of instance recognition these days. For example, Google Goggles is very good at recognizing landmarks like the Golden Gate Bridge and Eiffel Tower.

However, computers are less good at doing category recognition and classification. Creating dozens of 2D snapshots for all possible objects under all types of lighting conditions etc. becomes intractable very quickly. The fact that certain objects such as a dog can move around makes the space of possibilities even bigger. Computers become much worse at this.

Also, from the biological standpoint, our visual field is around 100 million pixels. Graphics cards have only now started to become capable of rendering that much data in real-time. Making sense of that much data is even more computationally intensive.

One often talks about having a machine reach a 5 year old's ability to process information. But let's think about how much data that is. 100 million pixels with 3 color channels and 1 byte per pixel = 300MB/s. Now multiply that by 30 frames per second, 31,556,926 seconds per year, and 5 years, you end up with roughly 1.4 exabytes (1.4x10^18).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文