文件比较策略

发布于 2024-08-26 17:04:02 字数 358 浏览 10 评论 0原文

我正在寻找可用于以编程方式查找可能彼此重复的文件的策略。具体来说,在这种情况下,视频。

我并不是在寻找完全匹配的对象(就像在彩虹和阳光的土地上一样好)。我只是想收集内容可能相同的视频对,以便人们可以比较它们以进行确认。例如,相同的内容,不同的分辨率。

到目前为止我所采取的策略:

  • 散列
  • 比较文件大小
  • 比较视频长度
  • 比较文件名
  • 持续保留结果以“记住”以前的重复项
  • 上面的混合和匹配策略

您知道上面列出的策略有什么策略或改进吗?

有谁知道有任何散列函数可以产生散列范围以表明整体内容“接近”。

I'm searching for strategies one might use to programmatically find files which may be duplicates of each other. Specifically in this case, videos.

I'm not looking for exact matches (as nice as that would be in the land of rainbows and sunshine). I'm just looking to collect pairs of video which content might be the same so that a human can compare them to confirm. For example, same content, different resolution.

The strategies I have so far:

  • Hashing
  • Comparing file size
  • Comparing length of video
  • Comparing file names
  • Holding findings persistently to "remember" previous duplicates
  • Mixing and matching strategies above

Are there any strategies, or refinements of the strategies listed above you are aware of?

Does anyone know of any hash functions that produce ranges of hashing to indicate that the overall content is "close".

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

暗恋未遂 2024-09-02 17:04:02

为了进行有效的多向比较,您需要将视频缩小到一个小的参数空间(“指纹”),该空间具有与视频相似性良好相关的相似性度量。例如,散列并不是一个好的参数空间,因为输入视频的微小差异会导致散列的巨大差异。另一方面,视频长度不是一个好的参数,因为不同的视频可以具有相同的长度。

一个好的参数空间取决于你想忽略什么样的差异,以及放大什么样的差异。一种可行的选择是将视频在时间维度上划分为 10 秒间隔,在空间维度上划分为 16 个矩形。然后取 10 秒间隔内每个矩形的平均颜色。然后使用参数向量之间的欧氏距离作为相似性度量。 (即对于每个时间间隔、每个方块、每个颜色通道,减去两个强度,取平方并将其全部加在一起)如果您需要检测可能是其他剪辑的一小部分的剪辑,那就有点棘手了,但计算特征向量的一般原理应该可行。例如,场景变化检测应该有助于创建视频长度不变参数。

For efficient n-way comparison you'll need to reduce the videos to a small parameter space (a "fingerprint") that has a similarity metric that correlates well with video similarity. Hashing for instance isn't a good parameter space, because small differences in input videos leads to large differences in hashes. On the opposite side of the spectrum, video length isn't a good parameter because rather different videos can have the same length.

A good parameter space depends on what kind of differences you want to ignore, and what kind to amplify. One option that might work would be to divide the video into 10 second intervals in the time dimensions and into 16 rectangles in the space dimension. Then take the average color of each rectangle over the 10 second interval. Then use the euclidean distance between the parameter vectors as the similarity metric. (i.e. for each time interval, for each square, for each color channel, subtract the two intensities, take the square and add it all together) If you need to detect clips that might be small parts of other clips it gets a bit trickier, but the general principle of calculating feature vectors should work. For instance scene change detection should help in creating video length invariant parameters.

这对于计算机来说几乎是不可能分辨的。视频流中最微小的差异(例如宽度少一个像素)将导致完全不同的数据流。为了进行任何有意义的比较,您必须将视频重新编码为已知的格式和分辨率,并且帧速率非常低。然后您可以开始查看每一帧,看看它们是否彼此相似。无论是在计算上还是在算法上,这都是一项非常密集的工作。

This will be almost impossible for a computer to tell. The slightest difference in video streams, such as the width being one pixel less, will result in a completely different datastream. To make any meaningful comparison you will have to recode the videos to a known format and resolution, with a very low framerate. Then you can start looking at each frame to see if they are similar to each other. This is a very intensive job, both computationally and algorithmically.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文