如何从音频样本中确定频率的大小和相位角？

发布于 2024-11-05 00:32:02 字数 1518 浏览 10 评论 0原文

我目前正在从事这个需要一些 DSP 技能的项目。我必须从电影中提取音频，然后通过分析它，我必须确定某人何时说话或不说话，这更像是语音活动检测器。

我正在用 Java 编写代码（是的，我知道这不是最好的选择），并且仅使用库从视频和 JLayer 中提取音频，以便我可以处理 MP3。

我的提取音频样本的类连续获取每个通道的样本，在我的例子中是两个：LEFT0、RIGHT0、LEFT1、RIGHT1、LEFT2、RIGHT2 等。

这就是我到目前为止所做的：

我将每个通道的样本放入一个数组中。
我应用汉明窗 [N = 8192]： <块引用>
double w = 0.54 - 0.46 * (Math.cos(2*Math.PI*buffer[i]/buffer.length-1)); fftBuffer[i] = new Complex(w, 0);
然后我对每个通道执行简单的 FFT，然后计算幅度 mag = re^2 + im^2; 之后，我进行对数标度 (dB)： mag_dB = 10 * log10(abs(mag));

因为我我在这里“寻找声音”，我需要 80 到 1000 之间的频率（即使声音范围在 80 Hz 到 255 Hz 之间）。因此，从 FFT 中我得到了一个镜像 N = 8129 数组，我只需要前 N/2。

每个 bin 的频率（由 FFT 产生的阵列中的槽）将是采样率 (48.000 kHz) /N；即每个 bin 为 48000 / 8192 = 5 Hz。所以我只在数组中查看从 FFT_Result[15] 到 FFT_Result[199] (16 * 5Hz = 80 Hz; 200 * 5 = 1000 Hz) 的值，对吗？！

我查看了 Cool Edit Pro 中的频率分析仪，所有幅度均为负值。就我而言，第一个（声音在背景中并且声音不大）是负面的，之后，它们都是正面的。他们不应该是消极的吗？我在这里错过了什么吗？

到目前为止，根据我通过查看 Cool Edit Pro 中的频率分析仪和相位分析仪所说的内容，我需要在此频率范围上的阈值，某种算法来确定在 n 毫秒的时间内幅度是否恒定在该频率范围内并确定声音是否居中。最后一项必须做（我认为）分析相位角，当有人说话时，声音总是居中的。

我没有找到一种方法来做到这一点，而且我对迄今为止所做的一切感到困惑，因为我不知道到目前为止所做的是否正确。

所以，如果您阅读了所有这些，感谢您的耐心等待，我的问题是：
- 到目前为止我做的对吗？
- 幅度必须为负吗？
- 有谁知道如何计算多个样本的相位？

原文

I'm currently working on this project that implies some DSP skills.
I must extract the audio from a movie and then, by analyzing it, I must determine when someone speaks or not, more like an voice activity detector.

I'm writing the code in Java (yes, I know it's not the best choice) and only use a library to extract the audio from the video and JLayer so I can process an MP3.

My class that extracts the audio samples gets the samples consecutively for each channel, in my case two: LEFT0, RIGHT0, LEFT1, RIGHT1, LEFT2, RIGHT2, etc.

So this is what I've done so far:

I put the samples for each channel in an array.
I apply a Hamming window [N = 8192]:
double w = 0.54 - 0.46 * (Math.cos(2*Math.PI*buffer[i]/buffer.length-1));
fftBuffer[i] = new Complex(w, 0);
I then perform a simple FFT on each channel and then compute the magnitude
mag = re^2 + im^2; after that, I do a log scale (dB): mag_dB = 10 * log10(abs(mag));

Because I am "looking for voice" here, I need frequencies between 80 and 1000 (even tough the voice ranges between 80 Hz and 255 Hz). So, from the FFT I get a mirrored N = 8129 array from witch I need only the first N/2.

The frequency per bin (slot in the array resulted from the FFT) would be the sample rate (48.000 kHz)
/ N; that would be 48000 / 8192 = 5 Hz per bin. So I only look in the array at the values from FFT_Result[15] to FFT_Result[199] (16 * 5Hz = 80 Hz; 200 * 5 = 1000 Hz) right?!

I took a look on the frequency analyzer in Cool Edit Pro and all the amplitudes are negative. In my case, the first ones (the sound is in the background and isn't loud) are negative, and after that, they are all positive. Aren't they supposed to be negative? Am I missing out something over here?

So far, based on what I've remarked by looking at the frequency analyzer and phase analyzer in Cool Edit Pro, I need a threshold on this frequency range, some kind of algorithm to determine over a period of n milliseconds if the magnitude is constant over that frequency range and determine if the sound is centered. The last one must be done (I think) analyzing the phase angle, when someone speaks, the sound is always centered.

I didn't manage to find a way to do that and I'm all confused with what I've done so far because I do not know if what I've done so far is right.

So, if you read all this, thank you for your patience and my questions are:
- have I done right what I've done so far?
- does the amplitude has to be negative?
- does anyone know how I can compute the phase for a number of samples?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

醉生梦死 2024-11-12 00:32:02

以 dB 为单位，幅度可以是负值也可以是正值，这并不重要。重要的是相对于某个阈值的值。我会将阈值基于周围的样本。因为口语单词中的能量随着音节的发音而上下波动，所以简单的平均值（乘以某个任意因子，您必须使用它才能找到效果最好的因子）可以很好地作为阈值。

对于时域中的相位，可以首先进行希尔伯特变换，然后对每个样本的实部和虚部使用atan2来估计瞬时相位。

回复收藏 0 原文

長街聽風 2024-11-12 00:32:02

您可以检查两个通道之间的延迟，而不是查看各个通道的相位。假设向两个通道呈现相同的信号，则可以从该通道间延迟找到声源的方向。假设耳朵到耳朵的距离约为 20 厘米，则此延迟最多为 0.2/340=0.58 毫秒或约 30 个样本 @ 48kHz。如果您计算此范围（30 个样本）上的互相关，您应该会找到一个指示源方向的峰值。

要查找类似语音信号的存在，您可以计算 80-1000Hz 频带内的总能量，并根据某个合理值对其进行阈值处理。您可以在频域中通过将 80 到 1000Hz 范围内的幅度相加来完成此操作，也可以在时域中使用带通滤波器和 RMS 值计算来完成此操作。

回复收藏 0 原文