使用卷积在连续的声音流中查找参考音频样本

发布于 2024-11-04 11:58:15 字数 3459 浏览 0 评论 0原文

我之前关于查找参考音频样本的问题在更大的音频样本中,有人建议我应该使用卷积。
使用 DSPUtil,我能够做到这一点。我玩了一下它并尝试了音频样本的不同组合,看看结果是什么。为了可视化数据,我只是将原始音频作为数字转储到 Excel 中,并使用这些数字创建了一个图表。峰值可见,但我真的不知道这对我有什么帮助。我有这些问题:

  • 我不知道如何从峰值的位置推断原始音频样本中匹配的起始位置。
  • 我不知道,我应该如何将其应用于连续的音频流,以便我可以在参考音频样本出现时立即做出反应。
  • 我不明白为什么图 2 和图 4(见下文)差异如此之大,尽管它们都代表了与自身卷积的音频样本......

非常感谢任何帮助。

下图是使用Excel分析的结果:

  1. 较长的音频样本,接近尾声时有参考音频(嘟嘟声):
  2. 嘟嘟声与自身纠缠在一起:
  3. 没有蜂鸣声与蜂鸣声的较长音频样本:
  4. 第 3 点的较长音频样本与其自身卷积:

更新和解决方案:
感谢韩老师的大力帮助,我才得以实现我的目标。
在我在没有 FFT 的情况下推出了自己的缓慢实现之后,我发现 alglib 它提供了快速实现。 我的问题有一个基本假设:其中一个音频样本完全包含在另一个音频样本中。
因此,以下代码返回两个音频样本中较大者的样本偏移量以及该偏移量处的归一化互相关值。 1 表示完全相关,0 表示完全不相关,-1 表示完全负相关:

private void CalcCrossCorrelation(IEnumerable<double> data1, 
                                  IEnumerable<double> data2, 
                                  out int offset, 
                                  out double maximumNormalizedCrossCorrelation)
{
    var data1Array = data1.ToArray();
    var data2Array = data2.ToArray();
    double[] result;
    alglib.corrr1d(data1Array, data1Array.Length, 
                   data2Array, data2Array.Length, out result);

    var max = double.MinValue;
    var index = 0;
    var i = 0;
    // Find the maximum cross correlation value and its index
    foreach (var d in result)
    {
        if (d > max)
        {
            index = i;
            max = d;
        }
        ++i;
    }
    // if the index is bigger than the length of the first array, it has to be
    // interpreted as a negative index
    if (index >= data1Array.Length)
    {
        index *= -1;
    }

    var matchingData1 = data1;
    var matchingData2 = data2;
    var biggerSequenceCount = Math.Max(data1Array.Length, data2Array.Length);
    var smallerSequenceCount = Math.Min(data1Array.Length, data2Array.Length);
    offset = index;
    if (index > 0)
        matchingData1 = data1.Skip(offset).Take(smallerSequenceCount).ToList();
    else if (index < 0)
    {
        offset = biggerSequenceCount + smallerSequenceCount + index;
        matchingData2 = data2.Skip(offset).Take(smallerSequenceCount).ToList();
        matchingData1 = data1.Take(smallerSequenceCount).ToList();
    }
    var mx = matchingData1.Average();
    var my = matchingData2.Average();
    var denom1 = Math.Sqrt(matchingData1.Sum(x => (x - mx) * (x - mx)));
    var denom2 = Math.Sqrt(matchingData2.Sum(y => (y - my) * (y - my)));
    maximumNormalizedCrossCorrelation = max / (denom1 * denom2);
}

BOUNTY:
不需要新的答案!我开始悬赏给 Han,以表彰他在这个问题上的持续努力!

in my previous question on finding a reference audio sample in a bigger audio sample, it was proposed, that I should use convolution.
Using DSPUtil, I was able to do this. I played a little with it and tried different combinations of audio samples, to see what the result was. To visualize the data, I just dumped the raw audio as numbers to Excel and created a chart using this numbers. A peak is visible, but I don't really know how this helps me. I have these problems:

  • I don't know, how to infer the starting position of the match in the original audio sample from the location of the peak.
  • I don't know, how I should apply this with a continuous stream of audio, so I can react, as soon as the reference audio sample occurs.
  • I don't understand, why picture 2 and picture 4 (see below) differ so much, although, both represent an audio sample convolved with itself...

Any help is highly appreciated.

The following pictures are the result of the analysis using Excel:

  1. A longer audio sample with the reference audio (a beep) near the end:
  2. The beep convolved with itself:
  3. A longer audio sample without the beep convolved with the beep:
  4. The longer audio sample of point 3 convolved with itself:

UPDATE and solution:
Thanks to the extensive help of Han, I was able to achieve my goal.
After I rolled my own slow implementation without FFT, I found alglib which provides a fast implementation.
There is one basic assumption to my problem: One of the audio samples is contained completely within the other.
So, the following code returns the offset in samples in the larger of the two audio samples and the normalized cross-correlation value at that offset. 1 means complete correlation, 0 means no correlation at all and -1 means complete negative correlation:

private void CalcCrossCorrelation(IEnumerable<double> data1, 
                                  IEnumerable<double> data2, 
                                  out int offset, 
                                  out double maximumNormalizedCrossCorrelation)
{
    var data1Array = data1.ToArray();
    var data2Array = data2.ToArray();
    double[] result;
    alglib.corrr1d(data1Array, data1Array.Length, 
                   data2Array, data2Array.Length, out result);

    var max = double.MinValue;
    var index = 0;
    var i = 0;
    // Find the maximum cross correlation value and its index
    foreach (var d in result)
    {
        if (d > max)
        {
            index = i;
            max = d;
        }
        ++i;
    }
    // if the index is bigger than the length of the first array, it has to be
    // interpreted as a negative index
    if (index >= data1Array.Length)
    {
        index *= -1;
    }

    var matchingData1 = data1;
    var matchingData2 = data2;
    var biggerSequenceCount = Math.Max(data1Array.Length, data2Array.Length);
    var smallerSequenceCount = Math.Min(data1Array.Length, data2Array.Length);
    offset = index;
    if (index > 0)
        matchingData1 = data1.Skip(offset).Take(smallerSequenceCount).ToList();
    else if (index < 0)
    {
        offset = biggerSequenceCount + smallerSequenceCount + index;
        matchingData2 = data2.Skip(offset).Take(smallerSequenceCount).ToList();
        matchingData1 = data1.Take(smallerSequenceCount).ToList();
    }
    var mx = matchingData1.Average();
    var my = matchingData2.Average();
    var denom1 = Math.Sqrt(matchingData1.Sum(x => (x - mx) * (x - mx)));
    var denom2 = Math.Sqrt(matchingData2.Sum(y => (y - my) * (y - my)));
    maximumNormalizedCrossCorrelation = max / (denom1 * denom2);
}

BOUNTY:
No new answers required! I started the bounty to award it to Han for his continued effort with this question!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

染墨丶若流云 2024-11-11 11:58:15

您应该使用相关性而不是卷积。相关峰值的大小告诉您两个信号的相似程度、峰值的位置、它们在时间上的相对位置,或者两个信号之间的延迟。

Instead of a convolution you should use a correlation. The size of the correlation peak tells you how much both signals are alike, the position of the peak their relative position in time, or the delay between both signals.

铃予 2024-11-11 11:58:15

我们在这里寻求赏金:)

要在较大的音频片段中找到特定的参考信号,您需要使用互相关算法。基本公式可以在这篇维基百科文章中找到。

互相关是比较两个信号的过程。这是通过将两个信号相乘并对所有样本的结果求和来完成的。然后将其中一个信号移动(通常移动 1 个样本),并重复计算。如果您尝试将其可视化为非常简单的信号,例如单个脉冲(例如,1 个样本具有特定值,而其余样本为零)或纯正弦波,您将看到互相关的结果确实是衡量两个信号的相似程度以及它们之间的延迟的指标。另一篇可能提供更多见解的文章可以在此处找到。

这篇 Paul Bourke 撰写的文章还包含简单时域实现的源代码。请注意,本文是针对一般信号而写的。音频具有特殊的属性,即长期平均值通常为 0。这意味着 Paul Bourkes 公式(mx 和 my)中使用的平均值可以省略。
还有基于 FFT 的互相关的快速实现(请参阅 ALGLIB)。

相关性的(最大)值取决于音频信号中的样本值。然而,在 Paul Bourke 的算法中,最大值缩放为 1.0。在其中一个信号完全包含在另一个信号内的情况下,最大值将达到 1。在更一般的情况下,最大值将更低,并且必须确定阈值来决定信号是否足够相似。

Here we go for the bounty :)

To find a particular reference signal in a larger audio fragment, you need to use a cross-correlation algorithm. The basic formulae can be found in this Wikipedia article.

Cross-correlation is a process by which 2 signals are compared. This is done by multiplying both signals and summing the results for all samples. Then one of the signals is shifted (usually by 1 sample), and the calculation is repeated. If you try to visualize this for very simple signals such as a single impulse (e.g. 1 sample has a certain value while the remaining samples are zero), or a pure sine wave, you will see that the result of the cross-correlation is indeed a measure for for how much both signals are alike and the delay between them. Another article that may provide more insight can be found here.

This article by Paul Bourke also contains source code for a straightforward time-domain implementation. Note that the article is written for a general signal. Audio has the special property that the long-time average is usualy 0. This means that the averages used in Paul Bourkes formula (mx and my) can be left out.
There are also fast implementations of the cross-correlation based on the FFT (see ALGLIB).

The (maximum) value of the correlation depends on the sample values in the audio signals. In Paul Bourke's algorithm however the maximum is scaled to 1.0. In cases where one of the signals is contained entirely within another signal, the maximum value will reach 1. In the more general case the maximum will be lower and a threshold value will have to be determined to decide whether the signals are sufficiently alike.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文