模糊匹配/分块算法
背景:我有视频剪辑和音轨,我想与所述视频同步。
我将从视频剪辑中提取参考音轨。 我还有另一首曲目想要与参考曲目同步。不同步来自于编辑,它改变了每个过场动画的间隔。
我需要操纵目标轨道使其看起来像(在本例中听起来像)ref
轨道。这相当于在正确的位置添加或消除静音。这可以手动完成,但会非常乏味。所以我希望能够以编程方式确定这些位置。
示例:
0 1 2
012345678901234567890123
ref: --part1------part2------
syn: -----part1----part2-----
# (let `-` denote silence)
输出:
[(2,6), (5,9) # part1
(13, 17), (14, 18)] # part2
我的想法是,从头开始:
Fingerprint 2 large chunks* of audio and see if they match:
If yes: move on to the next chunk
If not:
Go down both tracks looking for the first non-silent portion of each
Offset the target to match the original
Go back to the beginning of the loop
# * chunk size determined by heuristics and modifiable
这里的主要问题是声音匹配和指纹识别是模糊且相对昂贵的操作。
理想情况下,我希望尽可能少地接触它们。有想法吗?
Background: I have video clips and audio tracks that I want to sync with said videos.
From the video clips, I'll extract a reference audio track.
I also have another track that I want to synchronize with the reference track. The desync comes from editing, which altered the intervals for each cutscene.
I need to manipulate the target track to look like (sound like, in this case) the ref
track. This amounts to adding or removing silence at the correct locations. This could be done manually, but it'd be extremely tedious. So I want to be able to determine these locations programatically.
Example:
0 1 2
012345678901234567890123
ref: --part1------part2------
syn: -----part1----part2-----
# (let `-` denote silence)
Output:
[(2,6), (5,9) # part1
(13, 17), (14, 18)] # part2
My idea is, starting from the beginning:
Fingerprint 2 large chunks* of audio and see if they match:
If yes: move on to the next chunk
If not:
Go down both tracks looking for the first non-silent portion of each
Offset the target to match the original
Go back to the beginning of the loop
# * chunk size determined by heuristics and modifiable
The main problem here is sound matching and fingerprinting are fuzzy and relatively expensive operations.
Ideally I want to them as few times as possible. Ideas?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
听起来您不想花费大量时间钻研音频处理/工程,因此您想要一些可以快速理解并且可以正常工作的东西。如果您愿意使用更复杂的内容,请参阅此处 非常好的参考。
在这种情况下,我希望简单的响度和零交叉措施足以识别部分的声音。这很棒,因为您可以使用类似于 rsync 的技术。
选择一定数量的样本作为块大小,并定期浏览参考音频数据。 (我们称之为“块大小”。)计算过零度量(您可能需要简单过零计数的对数(或快速近似值))。根据时间和过零度量将块存储在 2D 空间结构中。
然后一次更精细地遍历您的实际音频数据。 (可能不需要像一个样本那么小。)请注意,您不必重新计算整个块大小的度量 - 只需减去块中不再存在的过零并添加新的过零那些是。 (您仍然需要计算对数或其近似值。)
寻找频率足够接近的“下一个”块。请注意,由于您要查找的内容是按从开始到结束的顺序排列的,因此没有理由查看 -all- 块。事实上,我们不想这样做,因为我们更有可能得到误报。
如果块匹配得足够好,看看它是否完全匹配到静音。
唯一值得关注的一点是二维空间结构,但老实说,如果您愿意原谅严格的近似窗口,这可以变得容易得多。然后你就可以有重叠的垃圾箱。这样,您需要做的就是在一定时间后检查两个容器中的所有值 - 本质上是通过搜索结构进行两次二分搜索。
所有这些的缺点是它可能需要一些调整才能正确,并且不是一种经过验证的方法。
Sounds like you're not looking to spend a lot of time delving into audio processing/engineering, and hence you want something you can quickly understand and just works. If you're willing to go with something more complex see here for a very good reference.
That being the case, I'd expect simple loudness and zero crossing measures would be sufficient to identify portions of sound. This is great because you can use techniques similar to rsync.
Choose some number of samples as a chunk size and march through your reference audio data at a regular interval. (Let's call it 'chunk size'.) Calculate the zero-crossing measure (you likely want a logarithm (or a fast approximation) of a simple zero-crossing count). Store the chunks in a 2D spatial structure based on time and the zero-crossing measure.
Then march through your actual audio data a much finer step at a time. (Probably doesn't need to be as small as one sample.) Note that you don't have to recompute the measures for the entire chunk size -- just subtract out the zero-crossings no longer in the chunk and add in the new ones that are. (You'll still need to compute the logarithm or approximation thereof.)
Look for the 'next' chunk with a close enough frequency. Note that since what you're looking for is in order from start to finish, there's no reason to look at -all- chunks. In fact, we don't want to since we're far more likely to get false positives.
If the chunk matches well enough, see if it matches all the way out to silence.
The only concerning point is the 2D spatial structure, but honestly this can be made much easier if you're willing to forgive a strict window of approximation. Then you can just have overlapping bins. That way all you need to do is check two bins for all the values after a certain time -- essentially two binary searches through a search structure.
The disadvantage to all of this is it may require some tweaking to get right and isn't a proven method.
如果您能够按照您的建议可靠地区分沉默和非沉默,并且唯一的区别是沉默的插入,那么似乎唯一不平凡的情况是在以前没有的地方插入沉默:
如果您可以使块大小适应沉默,你的算法应该没问题。也就是说,如果您的块大小等于上例中的两个字符,您的算法将识别“pa”匹配“pa”,“rt”匹配“rt”,但对于第三个块,它必须识别
中的静音syn
并调整块大小以将“1”与“1”进行比较,而不是“1p”与“1-”。对于更复杂的编辑,您可以采用加权最短编辑距离算法来消除静音0成本。
If you can reliably distinguish silence from non-silence as you suggest and if the only differences are insertions of silence, then it seems the only non-trivial case is where silence is inserted where there was none before:
If you can make your chunk size adaptive to the silence, your algorithm should be fine. That is, if your chunk size is equivalent to two characters in the above example, your algorithm would recognize "pa" matches "pa" and "rt" matches "rt" but for the third chunk it must recognize the silence in
syn
and adapt the chunk size to compare "1" to "1" instead of "1p" to "1-".For more complicated edits, you might be able to adapt a weighted Shortest Edit Distance algorithm with removing silence have 0 cost.