注意起始检测

发布于 2024-07-08 07:30:52 字数 451 浏览 7 评论 0 原文

我正在开发一个系统来帮助音乐家进行转录。 目的是在单个乐器单声道录音上执行自动音乐转录(它不必是完美的,因为用户稍后会纠正小故障/错误)。 这里有人有自动音乐转录的经验吗? 或者一般的数字信号处理? 无论您的背景如何,我们都非常感谢任何人的帮助。

到目前为止,我已经研究了使用快速傅里叶变换进行音调检测,并且 MATLAB 和我自己的 Java 测试程序中的大量测试表明它足够快速和准确,足以满足我的需求。 需要解决的任务的另一个要素是以乐谱形式显示生成的 MIDI 数据,但这是我现在不关心的事情。

简而言之,我正在寻找一种用于音符开始检测的好方法,即信号中新音符开始的位置。 由于缓慢的开始可能很难正确检测,因此我最初将使用带有钢琴录音的系统。 这也部分是由于我弹钢琴,应该能够更好地获得合适的录音进行测试。 如上所述,该系统的早期版本将用于简单的单声道录音,根据未来几周的进展,可能会进一步发展为更复杂的输入。

I am developing a system as an aid to musicians performing transcription. The aim is to perform automatic music transcription (it does not have to be perfect, as the user will correct glitches / mistakes later) on a single instrument monophonic recording. Does anyone here have experience in automatic music transcription? Or digital signal processing in general? Help from anyone is greatly appreciated no matter what your background.

So far I have investigated the use of the Fast Fourier Transform for pitch detection, and a number of tests in both MATLAB and my own Java test programs have shown it to be fast and accurate enough for my needs. Another element of the task that will need to be tackled is the display of the produced MIDI data in sheet music form, but this is something I am not concerned with right now.

In brief, what I am looking for is a good method for note onset detection, i.e. the position in the signal where a new note begins. As slow onsets can be quite difficult to detect properly, I will initially be using the system with piano recordings. This is also partially due to the fact I play piano and should be in a better position to obtain suitable recordings for testing. As stated above, early versions of this system will be used for simple monophonic recordings, possibly progressing later to more complex input depending on progress made in the coming weeks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

平定天下 2024-07-15 07:30:53

下面的图形说明了音符开始检测的阈值方法:

alt text

该图显示了一个典型的 WAV 文件,包含三个离散的音符连续演奏。 红线代表选定的信号阈值,蓝线代表由简单算法返回的音符开始位置,该算法在信号电平穿过阈值时标记开始。

如图所示,选择合适的绝对阈值很困难。 在这种情况下,第一个音符拾取得很好,第二个音符完全错过了,而第三个音符(勉强)开始得很晚。 一般来说,较低的阈值会导致您拾取虚假音符,而提高阈值会导致您错过音符。 该问题的一种解决方案是使用相对阈值,如果信号在一定时间内增加一定百分比,则触发启动,但这有其自身的问题。

一个更简单的解决方案是首先在波形文件上使用有点违反直觉的压缩(不是 MP3 压缩 - 这完全是另一回事)。 压缩本质上是压平音频数据中的峰值,然后放大所有内容,以便更多的音频接近最大值。 上面示例的效果如下所示(这说明了为什么名称“压缩”似乎没有意义 - 在音频设备上它通常标记为“响度”):

alt text

压缩后,绝对阈值方法会工作得更好(虽然很容易过度压缩并开始拾取虚构音符开始,与降低阈值的效果相同)。 有很多波形编辑器在压缩方面做得很好,最好让他们处理这个任务 - 在检测波形文件中的音符之前,您可能需要做大量的工作“清理”波形文件。不管怎样,他们。

在编码方面,加载到内存中的 WAV 文件本质上只是一个两字节整数数组,其中 0 表示无信号,32,767 和 -32,768 表示峰值。 在最简单的形式中,阈值检测算法将从第一个样本开始并读取数组,直到找到大于阈值的值。

short threshold = 10000;
for (int i = 0; i < samples.Length; i++)
{
    if ((short)Math.Abs(samples[i]) > threshold) 
    {
        // here is one note onset point
    }
}

在实践中,这种方法的效果非常糟糕,因为正常音频具有各种高于给定阈值的瞬态尖峰。 一种解决方案是使用运行平均信号强度(即,直到最后 n 个样本的平均值高于阈值才标记开始)。

short threshold = 10000;
int window_length = 100;
int running_total = 0;
// tally up the first window_length samples
for (int i = 0; i < window_length; i++)
{
    running_total += samples[i];
}
// calculate moving average
for (int i = window_length; i < samples.Length; i++)
{
    // remove oldest sample and add current
    running_total -= samples[i - window_length];
    running_total += samples[i];
    short moving_average = running_total / window_length;
    if (moving_average > threshold)
    {
        // here is one note onset point 
        int onset_point = i - (window_length / 2);
    }
}

所有这些都需要进行大量调整和调整设置,才能准确找到 WAV 文件的起始位置,通常适用于一个文件的方法不适用于另一个文件。 这是您选择的一个非常困难且未完美解决的问题领域,但我认为您解决它很酷。

更新:此图显示了我遗漏的注释检测的细节,即检测注释何时结束:

alt text

黄线代表关闭阈值。 一旦算法检测到音符开始,它就会假设音符继续,直到运行平均信号强度降至该值以下(此处以紫色线显示)。 当然,这是另一个困难来源,就像两个或多个音符重叠(复调)的情况一样。

检测到每个音符的开始点和停止点后,您现在可以分析 WAV 文件数据的每个切片以确定音高。

更新 2:我刚刚阅读了您更新的问题。 如果您从头开始编写自己的代码,那么通过自相关进行基音检测比 FFT 更容易实现,但如果您已经签出并使用了预构建的 FFT 库,那么您最好使用它。 一旦您确定了每个音符的开始和停止位置(并在开始和结束时为错过的起音和释放部分添加了一些填充),您现在可以提取每个音频数据片段并将其传递给 FFT 函数以确定音高。

这里重要的一点是不使用压缩音频数据的切片,而是使用原始的、未修改的数据的切片。 压缩过程会使音频失真,并可能产生不准确的音调读数。

关于音符起音时间的最后一点是,它可能没有您想象的那么严重。 通常在音乐中,起音缓慢的乐器(如软合成器)会比起音尖锐的乐器(如钢琴)更早开始音符,并且两个音符听起来就像是同时开始的。 如果您以这种方式演奏乐器,算法会为两种乐器拾取相同的开始时间,从 WAV 到 MIDI 的角度来看,这很好。

最后更新(我希望):忘记我说过的关于包含每个音符的早期攻击部分的一些填充样本的内容 - 我忘记这对于音高检测来说实际上是一个坏主意。 许多乐器(尤其是钢琴和其他打击乐器)的起音部分包含不是基本音高倍数的瞬变,并且往往会搞砸音高检测。 出于这个原因,您实际上希望在攻击后稍微开始每个切片。

哦,还有一点很重要:这里的术语“压缩”并不是指 MP3 风格的压缩

再次更新:这是一个简单的非动态压缩函数:

public void StaticCompress(short[] samples, float param)
{
    for (int i = 0; i < samples.Length; i++)
    {
        int sign = (samples[i] < 0) ? -1 : 1;
        float norm = ABS(samples[i] / 32768); // NOT short.MaxValue
        norm = 1.0 - POW(1.0 - norm, param);
        samples[i] = 32768 * norm * sign;
    }
}

当 param = 1.0 时,该函数对音频没有影响。 较大的参数值(2.0 较好,它将每个样本与最大峰值之间的归一化差值平方)将产生更多的压缩和更响亮的整体(但蹩脚)声音。 低于1.0的值会产生扩张效应。

另一个可能显而易见的点是:您应该在一个小的、无回声的房间中录制音乐,因为回声经常被该算法拾取为幻影音符。

更新:这是 StaticCompress 的一个版本,它将在 C# 中编译并显式转换所有内容。 这将返回预期结果:

public void StaticCompress(short[] samples, double param)
{
    for (int i = 0; i < samples.Length; i++)
    {
        Compress(ref samples[i], param);
    }
}

public void Compress(ref short orig, double param)
{
    double sign = 1;
    if (orig < 0)
    {
        sign = -1;
    }
    // 32768 is max abs value of a short. best practice is to pre-
    // normalize data or use peak value in place of 32768
    double norm = Math.Abs((double)orig / 32768.0);
    norm = 1.0 - Math.Pow(1.0 - norm, param);
    orig = (short)(32768.0 * norm * sign); // should round before cast,
        // but won't affect note onset detection
}

抱歉,我在 Matlab 上的知识得分为 0。如果您发布了另一个关于为什么您的 Matlab 函数不能按预期工作的问题,它将得到回答(只是不是由我回答)。

Here is a graphic that illustrates the threshold approach to note onset detection:

alt text

This image shows a typical WAV file with three discrete notes played in succession. The red line represents a chosen signal threshold, and the blue lines represent note start positions returned by a simple algorithm that marks a start when the signal level crosses the threshold.

As the image shows, selecting a proper absolute threshold is difficult. In this case, the first note is picked up fine, the second note is missed completely, and the third note (barely) is started very late. In general, a low threshold causes you to pick up phantom notes, while raising it causes you to miss notes. One solution to this problem is to use a relative threshold that triggers a start if the signal increases by a certain percentage over a certain time, but this has problems of its own.

A simpler solution is to use the somewhat-counterintuitively named compression (not MP3 compression - that's something else entirely) on your wave file first. Compression essentially flattens the spikes in your audio data and then amplifies everything so that more of the audio is near the maximum values. The effect on the above sample would look like this (which shows why the name "compression" appears to make no sense - on audio equipment it's usually labelled "loudness"):

alt text

After compression, the absolute threshold approach will work much better (although it's easy to over-compress and start picking up fictional note starts, the same effect as lowering the threshold). There are a lot of wave editors out there that do a good job of compression, and it's better to let them handle this task - you'll probably need to do a fair amount of work "cleaning up" your wave files before detecting notes in them anyway.

In coding terms, a WAV file loaded into memory is essentially just an array of two-byte integers, where 0 represents no signal and 32,767 and -32,768 represent the peaks. In its simplest form, a threshold detection algorithm would just start at the first sample and read through the array until it finds a value greater than the threshold.

short threshold = 10000;
for (int i = 0; i < samples.Length; i++)
{
    if ((short)Math.Abs(samples[i]) > threshold) 
    {
        // here is one note onset point
    }
}

In practice this works horribly, since normal audio has all sorts of transient spikes above a given threshold. One solution is to use a running average signal strength (i.e. don't mark a start until the average of the last n samples is above the threshold).

short threshold = 10000;
int window_length = 100;
int running_total = 0;
// tally up the first window_length samples
for (int i = 0; i < window_length; i++)
{
    running_total += samples[i];
}
// calculate moving average
for (int i = window_length; i < samples.Length; i++)
{
    // remove oldest sample and add current
    running_total -= samples[i - window_length];
    running_total += samples[i];
    short moving_average = running_total / window_length;
    if (moving_average > threshold)
    {
        // here is one note onset point 
        int onset_point = i - (window_length / 2);
    }
}

All of this requires much tweaking and playing around with settings to get it to find the start positions of a WAV file accurately, and usually what works for one file will not work very well on another. This is a very difficult and not-perfectly-solved problem domain you've chosen, but I think it's cool that you're tackling it.

Update: this graphic shows a detail of note detection I left out, namely detecting when the note ends:

alt text

The yellow line represents the off-threshold. Once the algorithm has detected a note start, it assumes the note continues until the running average signal strength drops below this value (shown here by the purple lines). This is, of course, another source of difficulties, as is the case where two or more notes overlap (polyphony).

Once you've detected the start and stop points of each note, you can now analyze each slice of WAV file data to determine the pitches.

Update 2: I just read your updated question. Pitch-detection through auto-correlation is much easier to implement than FFT if you're writing your own from scratch, but if you've already checked out and used a pre-built FFT library, you're better off using it for sure. Once you've identified the start and stop positions of each note (and included some padding at the beginning and end for the missed attack and release portions), you can now pull out each slice of audio data and pass it to an FFT function to determine the pitch.

One important point here is not to use a slice of the compressed audio data, but rather to use a slice of the original, unmodified data. The compression process distorts the audio and may produce an inaccurate pitch reading.

One last point about note attack times is that it may be less of a problem than you think. Often in music an instrument with a slow attack (like a soft synth) will begin a note earlier than a sharp attack instrument (like a piano) and both notes will sound as if they're starting at the same time. If you're playing instruments in this manner, the algorithm with pick up the same start time for both kinds of instruments, which is good from a WAV-to-MIDI perspective.

Last update (I hope): Forget what I said about including some paddings samples from the early attack part of each note - I forgot this is actually a bad idea for pitch detection. The attack portions of many instruments (especially piano and other percussive-type instruments) contain transients that aren't multiples of the fundamental pitch, and will tend to screw up pitch detection. You actually want to start each slice a little after the attack for this reason.

Oh, and kind of important: the term "compression" here does not refer to MP3-style compression.

Update again: here is a simple function that does non-dynamic compression:

public void StaticCompress(short[] samples, float param)
{
    for (int i = 0; i < samples.Length; i++)
    {
        int sign = (samples[i] < 0) ? -1 : 1;
        float norm = ABS(samples[i] / 32768); // NOT short.MaxValue
        norm = 1.0 - POW(1.0 - norm, param);
        samples[i] = 32768 * norm * sign;
    }
}

When param = 1.0, this function will have no effect on the audio. Larger param values (2.0 is good, which will square the normalized difference between each sample and the max peak value) will produce more compression and a louder overall (but crappy) sound. Values under 1.0 will produce an expansion effect.

One other probably obvious point: you should record the music in a small, non-echoic room since echoes are often picked up by this algorithm as phantom notes.

Update: here is a version of StaticCompress that will compile in C# and explicity casts everything. This returns the expected result:

public void StaticCompress(short[] samples, double param)
{
    for (int i = 0; i < samples.Length; i++)
    {
        Compress(ref samples[i], param);
    }
}

public void Compress(ref short orig, double param)
{
    double sign = 1;
    if (orig < 0)
    {
        sign = -1;
    }
    // 32768 is max abs value of a short. best practice is to pre-
    // normalize data or use peak value in place of 32768
    double norm = Math.Abs((double)orig / 32768.0);
    norm = 1.0 - Math.Pow(1.0 - norm, param);
    orig = (short)(32768.0 * norm * sign); // should round before cast,
        // but won't affect note onset detection
}

Sorry, my knowledge score on Matlab is 0. If you posted another question on why your Matlab function doesn't work as expected it would get answered (just not by me).

在巴黎塔顶看东京樱花 2024-07-15 07:30:53

您想要做的通常称为WAV-to-MIDI(google“wav-to-midi”)。 在这个过程中已经有很多尝试,但结果各不相同(注意开始是困难之一;复调更难处理)。 我建议您从彻底搜索现成的解决方案开始,只有在没有可接受的解决方案时才开始自己工作。

您需要的过程的另一部分是将 MIDI 输出呈现为传统乐谱,但有无数的产品可以做到这一点。

另一个答案是:是的,我已经做了很多数字信号处理(请参阅我网站上的软件 - 这是一个用 VB 和 C 编写的无限语音软件合成器),并且我有兴趣帮助您解决这个问题。 WAV 到 MIDI 部分在概念上并不是那么困难,只是让它在实践中可靠地工作才是困难的。 音符开始只是设置一个阈值 - 可以轻松地及时向前或向后调整错误,以补偿音符起音的差异。 在录音中进行音调检测比实时进行要容易得多,并且仅涉及实现自相关例程。

What you want to do is often called WAV-to-MIDI (google "wav-to-midi"). There have been many attempts at this process, with varying results (note onset is one of the difficulties; polyphony is much harder to deal with). I'd recommend starting with a thorough search of the off-the-shelf solutions, and only start work on your own if there's nothing acceptable out there.

The other part of the process you'd need is something to render the MIDI output as a traditional musical score, but there are umpteen billion products that do that.

Another answer is: yes, I've done a lot of digital signal processing (see the software on my website - it's an infinite-voice software synthesizer written in VB and C), and I'm interested in helping you with this problem. The WAV-to-MIDI part isn't really that difficult conceptually, it's just making it work reliably in practice that's hard. Note onset is just setting a threshold - errors can be easily adjusted forward or backward in time to compensate for note attack differences. Pitch detection is much easier to do on a recording than it is to do in real time, and involves just implementing an auto-correlation routine.

天暗了我发光 2024-07-15 07:30:53

您应该查看 MIRToolbox - 它是为 Matlab 编写,并且内置了一个起始检测器 - 它工作得很好。 源代码是 GPL 的,因此您可以用任何适合您的语言来实现该算法。 您的生产代码将使用什么语言?

You should look at MIRToolbox - it is written for Matlab, and has an onset detector built in - it works pretty well. The source code is GPL'd, so you can implement the algorithm in whatever language works for you. What language is your production code going to use?

書生途 2024-07-15 07:30:53

这个库以音频标签为中心:

aubio

aubio 是一个音频标签库。 其功能包括在每次攻击之前对声音文件进行分段、执行音高检测、敲击节拍以及从现场音频生成 MIDI 流。 aubio 这个名字来自“音频”,但有一个拼写错误:结果中也可能会发现几个转录错误。

我在开始检测和音高检测方面很幸运。 它是用 c 编写的,但有 swig/python 包装器。

此外,该库的作者在页面上有他论文的 pdf 文件,其中包含有关标签的大量信息和背景。

this library is centered around audio labeling:

aubio

aubio is a library for audio labelling. Its features include segmenting a sound file before each of its attacks, performing pitch detection, tapping the beat and producing midi streams from live audio. The name aubio comes from 'audio' with a typo: several transcription errors are likely to be found in the results too.

and I have had good luck with it for onset detection and pitch detection. It's in c, but there is swig/python wrappers.

also, the author of the library has a pdf of his thesis on the page, which has great info and background about labeling.

森林迷了鹿 2024-07-15 07:30:53

通过使用平均能量测量,可以在时域中轻松检测硬起始。

从 0 到 N (X^2) 求和

对整个信号块执行此操作。 当开始发生时,您应该看到峰值(窗口大小由您决定,我的建议是 50 毫秒或更长)。

关于起始检测的大量论文:

对于核心工程师:

http ://www.nyu.edu/classes/bello/MIR_files/2005_BelloEtAl_IEEE_TSALP.pdf

普通人更容易理解:

https://adamhess.github.io/Onset_Detection_Nov302011.pdf

Hard onsets are easily detected in the time domain by using an average energy measurement.

SUM from 0 to N (X^2)

Do this with chunks of the entire signal. You should see peaks when onsets occur (the window size is up to you, my suggestion is 50ms or more).

Extensive Papers on Onset Detection:

For Hardcore Engineers:

http://www.nyu.edu/classes/bello/MIR_files/2005_BelloEtAl_IEEE_TSALP.pdf

Easier for average person to understand:

https://adamhess.github.io/Onset_Detection_Nov302011.pdf

千年*琉璃梦 2024-07-15 07:30:53

您可以尝试将 wav 信号转换为幅度与时间的关系图。 然后,确定一致起始点的方法是计算信号上升沿拐点的切线与 x 轴的交点。

You could try to transform the wav signal into a graph of amplitude against time. Then a way to determine a consistent onset is to calculate the intersection of a tangent in the inflection point of the rising flank of a signal with the x axis.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文