当前位置：文江博客话题详情

通过数据库（数百万）、指纹查找重复的视频文件？模式识别？

发布于 2024-09-16 06:04:26 字数 1334 浏览 11 评论 0原文

在以下场景中：

我有一个项目，目前有大约一万个视频文件的目录，这个数字将会急剧增加。

然而其中很多都是重复的。对于每个视频文件，我都具有关联的语义和描述性信息，我希望合并重复项，以便为每个视频文件获得更好的结果。

现在我需要某种程序，在数据库中索引元数据，每当新视频进入目录时，都会计算相同的数据并与数据库中的数据进行匹配。

问题是视频并不完全重复。它们可以有不同的质量、经过多次裁剪、加水印或有续集/前传。或者在开头和/或结尾处被切断。

不幸的是，比较越好，CPU 和内存就越密集，因此我计划实现多层比较，从非常优雅但快速的比较开始（可能是视频长度，容差为 10%），最后以决定是否进行比较的最终比较结束它确实是重复的（这将是社区投票）。

因此，由于我有一个社区来验证结果，因此足以以较低的失误率提供“良好的猜测”。

所以现在我的问题是你们能想到哪些层或者你们有更好的方法吗？

我不在乎创建元数据的努力，我有足够的奴隶来做到这一点。只是比较应该很快。因此，如果它有帮助，我也可以将视频转换 100 次...

这是我当前的想法：

视频长度（秒）
第一帧和最后一帧图片分析

我会将图片重新采样为缩略图大小并获取平均 RGB 值，然后如果该像素处的颜色大于/小于 0 或 1 表示的平均值，则逐像素序列化。所以我得到一个二进制字符串，我可以将其存储到 mysql 中并执行布尔位和（由 mysql 内部支持）并计算剩余的无效位（内部也支持，这将是二进制字符串的编辑距离）

比特率的开发随着时间的推移，使用相同的 vbr 编解码器，

我会将视频转码为具有完全相同设置的 vbr 视频文件。然后我会查看某些时间点的比特率（视频完成的百分比或绝对秒数......然后我们只会分析视频的一部分）。与图片相同。如果比特率大于平均值，则为 1，否则为 0。我们制作一个二进制字符串并将其存储在数据库中，然后计算 Levenshtein 距离

音频分析（比特率和分贝随时间的变化，就像视频的比特率）
关键帧分析

图像对比就像第一帧和最后一帧但在关键帧位置？我们将使用与比特率计算相同的源文件，因为关键帧很大程度上依赖于编解码器和设置。

随着时间的推移颜色的发展

也许让我们在图像中选取一个或多个区域/像素，看看它们如何随着时间的推移而发展。以及高于/低于平均水平的变化。我认为黑/白就足够了。

向用户提出建议以供最终批准...

或者我走的路完全错误吗？我想我不可能是第一个遇到这个问题的人，但我没有找到解决方案。

原文

In the following scenario:

I got a project having a catalog of currently some ten thousand video files, the number is going to increase dramatically.

However lots of them are duplicates. With every video file I have associated semantic and descriptive information which I want to merge duplicates to achive better results for every one.

Now I need some sort of procedure where I index metadata in a database, and whenever a new video enters the catalog the same data is calculated and matched against in the database.

Problem is the videos aren't exact duplicates. They can have different quality, are amby cropped, watermarked or have a sequel/prequel. Or are cut off at the beginning and/or end.

Unfortunately the better the comparision the more cpu and memory intensive it gets so I plan on implementing several layers of comparision that begin with very graceful but fast comparision (maby video lengh with a tolerance of 10%) and end with the final comparision that decides whether its really a duplicate (that would be a community vote).

So as I have a community to verify the results it suffices to deliver "good guesses" with a low miss ratio.

So now my question is what layers can you guys think of or do you have a better approach?

I don't care the effort to create the metadata, I have enough slaves to do that. Just the comparision should be fast. So if it helps I can convert the video 100 times as well...

Here are my current ideas:

video length (seconds)
first and last frame picture analysis

I would resample the picture to a thumbnail size and get the average rgb values then serialize pixel by pixel if the color at this pixel is greater/smaller than the average represented by 0 or 1. So I get a binary string which I can store into mysql and do a boolean bit-sum (supported by mysql internally) and count the remaining uneval bits (as well supported internally, that would then be the Levenshtein distance of the bianry strings)

developement of the bitrate over time with the same vbr codec

I would transcode the video into a vbr videofile with the exact same settings.
then I would look at the bitrate at certain points of time (percentage of the video completed or absolute seconds.. then we would only analyze a portion of the video).
same thing as with the picture. Iif the bitrate is greater the average its 1 else its 0.
we make a binary string and store it in db and calculate the Levenshtein distance later

audio analyisis (bitrate and decibel varaition over time just as bitrate of the video)
keyframe analysis

Image comarision just like the first and last frame but at keyframe positions? We would use the same source files we used for bitrate calcluiations because keyframes are heavy depended on the codec and settings.

developement of color over time

Maybe let's take one or more areas/pixels inside the image and see how they develope over time. As well the change abov/below average.
black/white would suffice I think.

present the suggestions to the user for final approval...

Or am I going the completely wrong way? I think I can't be the first one having this problem but I have not had any luck finding solutions.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

感情废物 2024-09-23 06:04:26

这是一个很大的问题，所以我选择写一个相当长的回复，试图将问题分解成更容易解决的部分。

重要的是，使用可用的计算和时间资源来执行比较：我怀疑需要几个月才能运行的解决方案在动态视频数据库中是否会非常有用。而且数据库的大小可能使得云计算资源的使用变得不可行。因此，我们真正关心几个不同领域中每次比较的本地成本：1) 数据存储，2) 计算资源，以及 3) 时间。

需要考虑的一个关键成本是从每个视频中提取所需的数据以使用任何比较指标。一旦提取的数据可用，就必须考虑执行比较的成本。最后，必须执行将所有视频相互匹配所需的比较。

前两步的成本在视频数量上是 O(1)。最后一步的成本肯定比 O(1) 更糟糕，甚至可能更糟糕。因此，我们的主要目标应该是最小化最后一步的成本，即使这意味着添加许多早期的简单步骤。

此过程的最佳算法将在很大程度上取决于数据库的特性、单个匹配和多个匹配存在的级别。如果 100% 的视频与其他视频匹配，那么我们将希望最小化成功匹配的成本。然而，更可能的情况是匹配很少，因此我们希望将不成功匹配的成本降至最低。也就是说，如果有一种快速而肮脏的方式来表示“这两个视频不能匹配”，那么我们应该在开始确认匹配之前首先使用它。

要表征数据库，首先要做一些采样和手工匹配来估计数据库内的匹配程度。这个实验应该显示冗余视频如何“聚集”：如果给定的视频有匹配，那么它有多个匹配的可能性是多少？将用于帮助算法选择和调整系统。

所有匹配都是多重匹配的一部分？此过程将产生数据库的“模型”（统计分布），该模型聚集”，从而有效地使数据库变小，从而使问题变得更简单，让我们假设问题尽可能困难。

所有，如果有很多匹配，视频就会“ 我们将构建一系列算法，重复执行“这两个视频不匹配”/“这两个视频可能匹配”的二元决策。只有链中的最后一个算法需要输出答案“这两个视频匹配”。

分类/匹配算法可能会以两种方式之一或同时失败：误报（不匹配的视频被错误标记为匹配）和误报（匹配视频被错误标记为不匹配）。每个错误决策都有一系列与之相关的概率，我们希望将这两种概率最小化。

由于我们正在构建算法管道，因此我们希望算法能够非常擅长识别不匹配而不会出错，这意味着它们必须具有极低的错误拒绝率，并且我们不太关心错误接受率。例如，Wierd Al 的视频克隆可能看起来和听起来非常像原始视频，而我们可能直到稍后在算法流程中才能证明它与原始视频不匹配。

最简单、最快、最可靠的算法应该首先运行，因为绝大多数测试都会产生“不匹配”的结果。最简单的检查是在数据库中搜索相同的文件，这是由许多快速且简单的文件系统和数据库维护实用程序完成的。运行此扫描后，我们可以假设我们实际上需要打开并读取视频文件以检测差异。

由于视频比较比较困难，所以我们先从音频开始。首先将数据库视为可能包含重复项的 MP3 集合。毕竟，如果我们获得良好的音频匹配，那么我们很可能会获得视频匹配，反之亦然。我们可以有把握地说音频是视频的“公平”代表。幸运的是，快速的网络搜索将产生许多可靠、快速且成熟的音频指纹识别和比较包。需要为数据库中的每个视频生成音频指纹。缺少音轨的视频将自动落入“可以匹配”组。

但这里有一个“陷阱”：画外音怎么样？如果给定的视频被编码两次，有画外音和没有画外音，它们是否匹配？法语音频与西班牙语或英语相比怎么样？如果这些都被认为是匹配的，那么可能需要跳过音频测试。

此时，我们知道文件系统条目都“足够不同”，并且我们知道音轨都“足够不同”（如果经过测试），这意味着我们不能再推迟查看视频数据。幸运的是，这应该只需要对视频数据库的一小部分进行，因此我们可以容忍一些成本。和以前一样，在尝试积极标记匹配之前，我们仍然希望首先尝试快速消除更多不匹配。

由于我们需要考虑分辨率的变化（例如，从 1080p 到 iPod），因此我们需要一种方法来表征视频信息，该信息不仅与分辨率无关，而且还能容忍作为部分内容添加的噪声和/或数据丢失。改变分辨率。我们必须容忍帧速率变化（例如，从电影的 24 fps 到视频的 30 fps）。还需要考虑宽高比的变化，例如从 4:3 NTSC 到 16:9 HD。我们想要处理颜色空间的变化，例如从彩色到单色。

然后还有同时影响所有这些的转换，例如 HD 和 PAL 之间的转码，它可以同时影响色彩空间、帧速率、宽高比和分辨率。特征描述还应该能够容忍某种程度的裁剪和/或填充，例如在 4:3 和 16:9 宽高比之间来回切换时会发生的情况（信箱，但不是平移和扫描）。我们还应该处理已被截断的视频，例如删除故事片末尾的片尾字幕。而且，显然，我们还必须处理由输入相同视频流的不同编码器产生的差异。

这是一个相当大的清单！让我们考虑一些我们可能选择不考虑的事情：我怀疑当存在图像扭曲时找不到匹配项是可以接受的，尽管事实上变形扭曲并不罕见，尤其是在直接在 35 毫米宽银幕电影中。扫描时未进行变形重建（高瘦的人）。当帧中间出现大水印时，我们也可能选择失败，尽管我们希望容忍角落有较小的水印。最后，无法匹配时间扭曲或空间翻转的视频也是可以的，例如当一个视频是另一个的慢动作，或者从左到右翻转时。

这仅仅覆盖了视频空间吗？希望大家清楚为什么从文件系统和音频开始很重要！也就是说，首先将您的数据库视为 MP3 集合，然后再将其视为视频集合。

忽略音频，视频只是静态图像的有序序列。因此，我们实际上正在寻找一种或多种图像比较算法与一种或多种时间序列比较算法的结合。这可以是成对的单独算法（表征每个帧，然后表征帧序列），也可以合并为单个算法（查看帧之间的差异）。

图像本身可以进一步分解为单色“结构”图像和彩色“覆盖”。我相信，如果计算方便的话，我们可以安全地忽略颜色信息。

从上面的内容来看，我可能认为我们必须完全解码视频才能对其进行任何比较。情况不一定如此，尽管编码数据的比较有许多困难限制了其用途。一个重要的例外是对象级视频编码，例如 MP4，其中执行了非常高级的多帧比较。不幸的是，MP4 流之间的对象比较还没有进行太多研究，而且我知道没有软件包能够执行此功能。但如果你找到了，请使用它！

大多数其他数字视频流使用编码方案，例如 MPEG2、Quicktime 或类似的方案。这些方案都使用关键帧和差异帧的概念，尽管每个方案的实现方式不同。当比较不同的视频（大小不同的视频）时，关键帧和差异帧不太可能匹配到任何有用的程度。然而，这并不意味着这是不可能的，并且存在尝试从此类流中提取有用信息而不执行完整解码的包。如果您发现一个速度很快的测试，它可能会属于“为什么不尝试一下”的测试类别。

我将使用的一个技巧是，不是完全解码帧，而是仅将它们解码到单独的组件通道（HSV、HSL、YUV 等），而不是一直解码到 RGB 帧缓冲区（当然，除非那是已编码的内容））。接下来，我将创建单独的亮度和色度（颜色）帧，以便可以在相关领域中进行比较。一直解码到 RGB 帧缓冲区可能会引入错误，从而使查找匹配变得更加困难。

接下来，我将丢弃颜色信息。由于单色视频应该与其原始颜色相匹配，因此我们根本不关心颜色！

如何最好地将所得的单色帧序列与可能看起来非常不同但仍然可能匹配的另一个序列进行比较？这一领域的研究实际上已经进行了数十年，其中大部分都属于“尺度不变匹配检测”。不幸的是，这项研究很少直接应用于确定视频何时匹配或不匹配。

出于我们的目的，我们可以从几个方向来解决这个问题。首先，我们必须亲自了解单色域中什么是匹配的，什么不是匹配的。例如，我们不关心像素级差异，因为即使两个匹配但不同的视频具有相同的分辨率，我们也必须容忍由于编码器差异等原因而产生的一定程度的噪声。

一种简单（但缓慢）的方法是将每个图像转换为独立于分辨率和纵横比的形式。其中一种变换是空间频域变换，而 2D FFT 非常适合这种变换。在丢弃虚部之后，可以在高频处截断实部以消除噪声，并在低频处截断实部以消除纵横比影响，然后归一化至标准比例以消除分辨率差异。生成的数据看起来像一个奇怪的小图像，可以直接在视频流之间进行比较。

还有许多其他可能的帧转换策略，其中许多比 FFT 更有效，文献搜索应该突出显示它们。不幸的是，据我所知，在软件库中实现的软件库中很少有像 FFT 那样易于使用的。

一旦我们将单色帧转换为更小、更有用的域，我们仍然必须与另一个视频中的另一个此类流进行比较。而且该视频几乎可以肯定不是逐帧匹配的，因此简单的比较肯定会失败。我们需要进行比较，考虑时域的差异，包括添加/删除的帧以及帧速率的差异。

如果您查看 FFT 帧序列，您会注意到一些非常独特的行为。场景淡入淡出是突然的并且非常容易发现，剪切也可以被区分，并且在剪切之间的 FFT 中通常只看到缓慢的变化。根据 FFT 序列，我们可以将每个帧标记为剪切/淡入淡出后的第一帧，或者标记为剪切/淡入淡出之间的帧。重要的是每次剪切/淡入淡出之间的时间，与它们之间的帧数无关，这会创建很大程度上与帧速率无关的签名或指纹。

获取整个视频的指纹会产生比视频本身小得多的数据。它也是一个线性数字序列，一个简单的时间序列向量，很像音频，并且可以使用许多相同的工具进行分析。

第一个工具是执行关联，以确定一个视频中的剪辑模式是否与另一个视频中的剪辑模式非常匹配。如果存在显着差异，则视频会有所不同。如果它们非常匹配，则需要比较每次相关剪切后的仅有的几个 FFT，以确定帧是否足够相似以匹配。

我不会在这里讨论 2D FFT 的比较，因为有大量的参考资料可以比我更好地完成这项工作。

注意：还有许多其他操作（除了 2D FFT 之外）可以应用于单色帧以获得额外的指纹。实际图像内容的表示可以通过提取图像的内部边缘（字面上像 FBI 指纹）或通过选择性地对图像进行阈值处理并执行“blobbing”操作（创建相关区域描述符的链接列表）来创建。跟踪帧之间的边缘和/或斑点的演变不仅可以用于生成切割列表，还可以用于提取使用 2D FFT 会丢失的附加高级图像特征。

我们构建了一系列比较算法，这些算法应该能够非常快地找到不匹配项，并且不需要太多时间来最终确定匹配项。唉，有算法并不能解决问题！我们必须考虑与如何最好地实现这些算法相关的几个问题。

首先，我们不希望打开和读取每个视频文件的次数超过必要的次数，否则 CPU 可能会停止等待来自磁盘的数据。我们也不想对文件进行超出需要的任何读取，尽管我们不想太快停止读取并可能错过稍后的匹配。应该保存表征每个视频的信息，还是应该在需要时重新计算？解决这些问题将有助于开发、测试和部署高效且有效的视频比较系统。

我们已经证明，可以通过计算效率来比较视频，并希望在高度可变的条件下找到匹配项。

剩下的部分留给读者作为练习。 ;^)

This is a huge problem, so I've chosen to write a rather lengthy reply to try to decompose the problem into parts that may be easier to solve.

It is important that the comparisons be performed using the compute and time resources available: I doubt a solution that takes months to run will be very useful in a dynamic video database. And the size of the database likely makes the use of cloud computing resources unfeasible. So we really care about the local cost of each comparison in several different domains: 1) Data storage, 2) compute resources, and 3) time.

One key cost to consider is that of extracting the data needed from each video for whatever comparison metrics are to be used. Once the extracted data is available, then the cost of performing a comparison must be considered. Finally, the comparisons needed to match all videos to each other must be performed.

The cost of the first two steps is O(1) on the number of videos. The cost of the last step must be worse than O(1), potentially much worse. So our primary goal should be minimizing the costs of the last step, even if it means adding many early, simple steps.

The optimal algorithms for this process will greatly depend on the characteristics of the database, the level to which single and multiple matches exist. If 100% of the videos match some other video, then we will want to minimize the cost of a successful match. However, the more likely case is that matches will be rare, so we will want to minimize the cost of an unsuccessful match. That is to say, if there is a quick and dirty way to say "these two videos can't be matches', then we should use it first, before we even start to confirm a match.

To characterize the database, first do some sampling and hand-matching to estimnate the degree of matching within the database. This experiment should show how the redundant videos "clumped": If a given video had a match, how likely was it to have more than a single match? What percentage of all matches were also part of a multiple match? This process will yield a 'model' of the database (a statistical distribution) that will be used to aid algorithm selection and tune the system.

Going forward I will assume matches are relatively rare. After all, if there are lots of matches, the videos will "clump", effectively making the database smaller, and thus making the problem simpler. Let's assume the problem stays as hard as possible.

I'd advocate a multi-level categorization approach, where we'd build a sequence of algorithms that repeatedly perform the binary decision of "these two videos do not match" / "these two videos may possibly match". Only the very last algorithm in the chain needs to output the answer "These two videos match."

Classification/matching algorithms can fail in either or both of two ways: False Positive (non-matching videos are mislabled as matching) and False Negative (matching videos are mislabeled as non-matching). Each of these wrong decisions has a range of probabilities associated with it, and we want to minimize both.

Since we are building an algorithm pipeline, we want algorithms that are very good at identifying non-matches without error, meaning they must have an extremely low False Reject rate, and we don't much care about the False Accept rate. For example, Wierd Al's clone of a video may look and sound very much like the original, and we may not be able to show it is not a match to the original until later in the algorithm pipeline.

The simplest, fastest, most reliable algorithms should be run first, since the overwhelmingly vast majority of tests will yield the "do not match" result. The simplest check would be to search for identical files within the database, something done by many fast and simple filesystem and database maintenance utilities. After this scan is run, we can assume we will actually need to open and read the video files to detect differences.

Since video comparison is relatively tough, let's start with the audio. Think of the database as first being an MP3 collection that may contain duplicates. After all, if we get a good audio match, it is very likely we will have a video match, and vice-versa. We can safely say the audio is a 'fair' representative for the video. Fortunately, a quick web search will yield many audio fingerprinting and comparison packages that are reliable, fast and mature. The audio fingerprint would need to be generated for every video in the database. Videos lacking an audio track would automatically fall into the "could match" set.

But there is a 'gotcha' here: What about voice-overs? If a given video is encoded twice, with and without a voice-over, are they a match or not? What about the French audio vs the Spanish or English? If these should all be considered to be a match, then audio testing may need to be skipped.

At this point, we know the filesystem entries are all "different enough", and we know the audio tracks are all "different enough" (if tested), which means we can't put off looking at the video data any longer. Fortunately, this should need to be done for only a small fraction of the video database, so we can tolerate some cost. As before, we will still want to first try to quickly eliminate more non-matches before we try to positively label a match.

Since we need to take resolution changes into account (e.g., from 1080p to iPod), we will need a way to characterize video information that is not only resolution-independent, but also tolerant of the noise added and/or data lost as part of changing the resolution. We must tolerate frame rate changes (say, from a movie's 24 fps to video's 30 fps). There are also aspect ratio changes to consider, such as from 4:3 NTSC to 16:9 HD. We would want to handle color-space changes, such as from color to monochrome.

Then there are transformations that affect all these at once, such as transcoding between HD and PAL, which can simultaneously affect color-space, frame-rate, aspect ratio, and resolution. The characterization should also be tolerant of some degree of cropping and/or filling, such as would happen from a switch back and forth between 4:3 and 16:9 aspect ratios (letterboxing, but not pan & scan). We also should handle videos that have been truncated, such as removing the credits from the end of a feature movie. And, obviously, we must also handle the differences created by different encoders that were fed an identical video stream.

That's quite a list! Let's consider some things we may choose not to account for: I suspect it is OK to fail to find a match when image warping is present, despite the fact that anamorphic warping isn't uncommon, especially in 35mm wide-screen movies that were directly scanned without anamorphic reconstruction (tall-skinny people). We may also choose to fail when large watermarks are present in the middle of the frame, though we will want to tolerate smaller watermarks in the corners. And finally, it is OK to fail to match videos that have been temporally distorted or spatially flipped, such as when one is a slo-mo of the other, or has been flipped left-to-right.

Does that just about cover the video space? Hopefully it is clear why it is important to start with the filesystem and the audio! That is, first think of your database more like an MP3 collection before considering it as a video collection.

Ignoring the audio, video is just an ordered sequence of still images. So we're actually looking for one or more image comparison algorithms combined with one or more time-series comparison algorithms. This could be either pairs of separate algorithms (characterize each frame, then characterize the sequence of frames), or it could be merged into a single algorithm (look at the differences between frames).

The images themselves may be decomposed further, into a monochrome 'structural' image and a color 'overlay'. I believe we can safely ignore the color information, if it is computationally convenient to do so.

From the above, it may sound like I've assumed we'll have to fully decode a video in order to perform any comparisons on it. That is not necessarily the case, though the comparison of encoded data has many difficulties that limit its usefulness. The one significant exception to this is for object-level video encodings such as MP4, where very high-level multi-frame comparisons have been performed. Unfortunately, object comparisons between MP4 streams has not seen much research, and I am aware of no packages able to perform this function. But if you find one, use it!

Most other digital video streams use encoding schemes such as MPEG2, Quicktime, or something similar. These schemes all use the concept of key frames and difference frames, though each implements it differently. When different videos are being compared (ones that are not the same size), it is unlikely the key frames and difference frames will match to any useful degree. However, this does not mean it is impossible, and packages exist that attempt to extract useful information from such streams without performing full decoding. If you find one that is fast, it may fall into a "why not try it" category of tests.

The one trick I will use is instead of decoding frames completely, I would instead decode them only into separate component channels (HSV, HSL, YUV, whatever) and not all the way to the RGB framebuffer (unless that's what's been encoded, of course). From here, I'd next create separate luminance and chrominance (color) frames so comparisons may be performed in related domains. Decoding all the way to an RGB framebuffer may introduce errors that may make finding matches more difficult.

Next, I'd discard the color information. Since a monochrome video should match its color original, we simply don't care about color!

How may the resulting sequence of monochrome frames best be compared to another sequence that may appear very different, yet still may possibly be a match? There have been literally decades of research in this area, much of it categorized under "scale-invariant match detection". Unfortunately, very little of this research has been directly applied to determining when videos do or do not match.

For our purposes, we can approach this issue from several directions. First, we must know for ourselves what is and is not a match in the monochrome domain. For example, we do not care about pixel-level differences, since even if two matching-but-different videos had the same resolution, we must tolerate some level of noise due to things like encoder differences.

A simple (but slow) way forward is to transform each image into a form that is independent of both resolution and aspect ratio. One such transformation is into the spatial frequency domain, and the 2D FFT is ideal for this. After discarding the imaginary component, the real component may be truncated at high frequencies to remove noise and at low frequencies to remove aspect ratio effects, then normalized to a standard scale eliminate resolution differences. The resulting data looks like an odd tiny image that may be directly compared across video streams.

There are many other possible frame transformation strategies, many vastly more efficient than the FFT, and a literature search should highlight them. Unfortunately, I know of few that have been implemented in software libraries that are as easy to use as the FFT.

Once we have transformed the monochrome frames into a smaller and more useful domain, we still must perform the comparison to another such stream from another video. And that video is almost certain not to be a frame-to-frame match, so a simple comparison will certainly fail. We need a comparison that will take onto account differences in the time domain, including added/removed frames and differences in frame rate.

If you look at the sequence of FFT frames, you will notice some very distinct behavior. Scene fades are abrupt and extremely easy to spot, cuts can also be distinguished, and there are typically only slow changes seen in the FFT between cuts. From the sequence of FFTs we can label each frame as being the first after a cut/fade, or as a frame between cuts/fades. What's important is the time between each cut/fade, independent of the number frames between them, which creates a signature or fingerprint which is largely independent of the frame rate.

Taking this fingerprint of an entire video yields data that is massively smaller than the video itself. It is also a linear sequence of numbers, a simple time-series vector, much like audio, and can be analyzed using many of the same tools.

The first tool is to perform a correlation, to determine if the pattern of cuts in one video is a close match to that in another video. If there are significant differences, then the videos are different. If they are a close match, then the only a few FFTs after each correlated cut need be compared to determine if the frames are similar enough to be a match.

I'll not go into the comparison of 2D FFTs here, since there are abundant references that do the job far better than I can.

Note: There are many other manipulations (beyond a 2D FFT) that may be applied to monochrome frames to obtain additional fingerprints. Representations of actual image content may be created by extracting the interior edges of the image (literally like an FBI fingerprint), or by selectively thresholding the image and performing a 'blobbing' operation (creating a linked list of related region descriptors). Tracking the evolution of the edges and/or blobs between frames can be used not only to generate cut lists, but can also be used to extract additional high-level image characteristics that would be lost using a 2D FFT.

We have constructed a sequence of comparison algorithms that should be very fast at finding non-matches, and not require too much time to conclusively determine matches. Alas, having algorithms does not a solution make! We must consider several issues related to how these algorithms should best be implemented.

First, we don't want to open and read each video file any more times than necessary, else the CPU could stall waiting for data from the disk. We also don't want to read any further into a file than needed, though we don't want to stop reading too soon and potentially miss a later match. Should the information that characterizes each video be saved, or should it be recomputed when needed? Addressing these issues will permit an efficient and effective video comparison system to be developed, tested and deployed.

We have shown it is possible to compare videos with some hope of finding matches under highly variable conditions, with computational efficiency.

The rest has been left as an exercise for the reader. ;^)

回复收藏 0 原文