如何从音频样本中确定频率的大小和相位角?
我目前正在从事这个需要一些 DSP 技能的项目。 我必须从电影中提取音频,然后通过分析它,我必须确定某人何时说话或不说话,这更像是语音活动检测器。
我正在用 Java 编写代码(是的,我知道这不是最好的选择),并且仅使用库从视频和 JLayer 中提取音频,以便我可以处理 MP3。
我的提取音频样本的类连续获取每个通道的样本,在我的例子中是两个:LEFT0、RIGHT0、LEFT1、RIGHT1、LEFT2、RIGHT2 等。
这就是我到目前为止所做的:
- 我将每个通道的样本放入一个数组中。
- 我应用汉明窗 [N = 8192]: <块引用>
double w = 0.54 - 0.46 * (Math.cos(2*Math.PI*buffer[i]/buffer.length-1));
fftBuffer[i] = new Complex(w, 0);
- 然后我 对每个通道执行简单的 FFT,然后计算幅度
mag = re^2 + im^2;
之后,我进行对数标度 (dB):mag_dB = 10 * log10(abs(mag));
因为我我在这里“寻找声音”,我需要 80 到 1000 之间的频率(即使声音范围在 80 Hz 到 255 Hz 之间)。因此,从 FFT 中我得到了一个镜像 N = 8129 数组,我只需要前 N/2。
每个 bin 的频率(由 FFT 产生的阵列中的槽)将是采样率 (48.000 kHz) /N;即每个 bin 为 48000 / 8192 = 5 Hz。所以我只在数组中查看从 FFT_Result[15] 到 FFT_Result[199] (16 * 5Hz = 80 Hz; 200 * 5 = 1000 Hz) 的值,对吗?!
我查看了 Cool Edit Pro 中的频率分析仪,所有幅度均为负值。就我而言,第一个(声音在背景中并且声音不大)是负面的,之后,它们都是正面的。他们不应该是消极的吗?我在这里错过了什么吗?
到目前为止,根据我通过查看 Cool Edit Pro 中的频率分析仪和相位分析仪所说的内容,我需要在此频率范围上的阈值,某种算法来确定在 n 毫秒的时间内幅度是否恒定在该频率范围内并确定声音是否居中。最后一项必须做(我认为)分析相位角,当有人说话时,声音总是居中的。
我没有找到一种方法来做到这一点,而且我对迄今为止所做的一切感到困惑,因为我不知道到目前为止所做的是否正确。
所以,如果您阅读了所有这些,感谢您的耐心等待,我的问题是:
- 到目前为止我做的对吗?
- 幅度必须为负吗?
- 有谁知道如何计算多个样本的相位?
I'm currently working on this project that implies some DSP skills.
I must extract the audio from a movie and then, by analyzing it, I must determine when someone speaks or not, more like an voice activity detector.
I'm writing the code in Java (yes, I know it's not the best choice) and only use a library to extract the audio from the video and JLayer so I can process an MP3.
My class that extracts the audio samples gets the samples consecutively for each channel, in my case two: LEFT0, RIGHT0, LEFT1, RIGHT1, LEFT2, RIGHT2, etc.
So this is what I've done so far:
- I put the samples for each channel in an array.
- I apply a Hamming window [N = 8192]:
double w = 0.54 - 0.46 * (Math.cos(2*Math.PI*buffer[i]/buffer.length-1));
fftBuffer[i] = new Complex(w, 0);
- I then perform a simple FFT on each channel and then compute the magnitude
mag = re^2 + im^2;
after that, I do a log scale (dB):mag_dB = 10 * log10(abs(mag));
Because I am "looking for voice" here, I need frequencies between 80 and 1000 (even tough the voice ranges between 80 Hz and 255 Hz). So, from the FFT I get a mirrored N = 8129 array from witch I need only the first N/2.
The frequency per bin (slot in the array resulted from the FFT) would be the sample rate (48.000 kHz)
/ N; that would be 48000 / 8192 = 5 Hz per bin. So I only look in the array at the values from FFT_Result[15] to FFT_Result[199] (16 * 5Hz = 80 Hz; 200 * 5 = 1000 Hz) right?!
I took a look on the frequency analyzer in Cool Edit Pro and all the amplitudes are negative. In my case, the first ones (the sound is in the background and isn't loud) are negative, and after that, they are all positive. Aren't they supposed to be negative? Am I missing out something over here?
So far, based on what I've remarked by looking at the frequency analyzer and phase analyzer in Cool Edit Pro, I need a threshold on this frequency range, some kind of algorithm to determine over a period of n milliseconds if the magnitude is constant over that frequency range and determine if the sound is centered. The last one must be done (I think) analyzing the phase angle, when someone speaks, the sound is always centered.
I didn't manage to find a way to do that and I'm all confused with what I've done so far because I do not know if what I've done so far is right.
So, if you read all this, thank you for your patience and my questions are:
- have I done right what I've done so far?
- does the amplitude has to be negative?
- does anyone know how I can compute the phase for a number of samples?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
以 dB 为单位,幅度可以是负值也可以是正值,这并不重要。重要的是相对于某个阈值的值。我会将阈值基于周围的样本。因为口语单词中的能量随着音节的发音而上下波动,所以简单的平均值(乘以某个任意因子,您必须使用它才能找到效果最好的因子)可以很好地作为阈值。
对于时域中的相位,可以首先进行希尔伯特变换,然后对每个样本的实部和虚部使用atan2来估计瞬时相位。
In dB, the amplitude can be negative or positive, it doesn't matter. What matters, is the value relative to some threshold. I would base the threshold on surrounding samples. Because the energy in spoken words goes up and down as syllables are spoken, a simple average (multiplied by some arbitrary factor you'll have to play with to find what works well) would work fine as a threshold.
For phase in the time domain, you can first take a Hilbert transform, and then use atan2 on the real and imaginary parts of each sample to estimate instantaneous phase.
您可以检查两个通道之间的延迟,而不是查看各个通道的相位。假设向两个通道呈现相同的信号,则可以从该通道间延迟找到声源的方向。假设耳朵到耳朵的距离约为 20 厘米,则此延迟最多为 0.2/340=0.58 毫秒或约 30 个样本 @ 48kHz。如果您计算此范围(30 个样本)上的互相关,您应该会找到一个指示源方向的峰值。
要查找类似语音信号的存在,您可以计算 80-1000Hz 频带内的总能量,并根据某个合理值对其进行阈值处理。您可以在频域中通过将 80 到 1000Hz 范围内的幅度相加来完成此操作,也可以在时域中使用带通滤波器和 RMS 值计算来完成此操作。
Instead of looking at the phase of the individual channels, you could check the delay between both channels. Assuming that the same signal is presented to both channels, the direction of the sound source can be found from this inter-channel delay. Assuming an ear-to-ear distance of some 20cm, this delay is at most .2/340=.58ms or some 30 samples @ 48kHz. If you calculate the cross-correlation over this range (30 samples) you should find a peak indicating the source direction.
To find the presence of a voice-like signal, you could calculate the total energy in the 80-1000Hz band and threshold it against some reasonable value. You can do this either in the frequency domain by summing the magnitudes in the bins from 80 to 1000Hz, or in the time-domain using a band-filter and an RMS value calculation.
你有一个双面变换。中点是直流分量。负频率实际上是一个 180 度异相的正频率!因此,如果您使用带负频率的 FFT 值的前半部分,您需要按 pi 更改相位才能准确了解正在发生的情况。
或者,使用 FFT 值的后半部分,其中频率为正且相位正确。
You have a double sided transform. The midpoint is the DC component. A negative frequency is really a positive frequency that is 180 degrees out of phase! So, if you use the first half of the FFT values w/negative freqs you need to change the phase by pi to have an accurate picture of what is happening.
Alternatively, use the second half of the FFT values where the freqs are positive and the phases are correct.