当前位置：文江博客话题详情

Audio signal-processing onset-detection

注意起始检测

发布于 2024-07-08 07:30:52 字数 451 浏览 10 评论 0 原文

我正在开发一个系统来帮助音乐家进行转录。目的是在单个乐器单声道录音上执行自动音乐转录（它不必是完美的，因为用户稍后会纠正小故障/错误）。这里有人有自动音乐转录的经验吗？或者一般的数字信号处理？无论您的背景如何，我们都非常感谢任何人的帮助。

到目前为止，我已经研究了使用快速傅里叶变换进行音调检测，并且 MATLAB 和我自己的 Java 测试程序中的大量测试表明它足够快速和准确，足以满足我的需求。需要解决的任务的另一个要素是以乐谱形式显示生成的 MIDI 数据，但这是我现在不关心的事情。

简而言之，我正在寻找一种用于音符开始检测的好方法，即信号中新音符开始的位置。由于缓慢的开始可能很难正确检测，因此我最初将使用带有钢琴录音的系统。这也部分是由于我弹钢琴，应该能够更好地获得合适的录音进行测试。如上所述，该系统的早期版本将用于简单的单声道录音，根据未来几周的进展，可能会进一步发展为更复杂的输入。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

平定天下 2024-07-15 07:30:53

下面的图形说明了音符开始检测的阈值方法：

alt text

该图显示了一个典型的 WAV 文件，包含三个离散的音符连续演奏。红线代表选定的信号阈值，蓝线代表由简单算法返回的音符开始位置，该算法在信号电平穿过阈值时标记开始。

如图所示，选择合适的绝对阈值很困难。在这种情况下，第一个音符拾取得很好，第二个音符完全错过了，而第三个音符（勉强）开始得很晚。一般来说，较低的阈值会导致您拾取虚假音符，而提高阈值会导致您错过音符。该问题的一种解决方案是使用相对阈值，如果信号在一定时间内增加一定百分比，则触发启动，但这有其自身的问题。

一个更简单的解决方案是首先在波形文件上使用有点违反直觉的压缩（不是 MP3 压缩 - 这完全是另一回事）。压缩本质上是压平音频数据中的峰值，然后放大所有内容，以便更多的音频接近最大值。上面示例的效果如下所示（这说明了为什么名称“压缩”似乎没有意义 - 在音频设备上它通常标记为“响度”）：

alt text

压缩后，绝对阈值方法会工作得更好（虽然很容易过度压缩并开始拾取虚构音符开始，与降低阈值的效果相同）。有很多波形编辑器在压缩方面做得很好，最好让他们处理这个任务 - 在检测波形文件中的音符之前，您可能需要做大量的工作“清理”波形文件。不管怎样，他们。

在编码方面，加载到内存中的 WAV 文件本质上只是一个两字节整数数组，其中 0 表示无信号，32,767 和 -32,768 表示峰值。在最简单的形式中，阈值检测算法将从第一个样本开始并读取数组，直到找到大于阈值的值。

short threshold = 10000;
for (int i = 0; i < samples.Length; i++)
{
    if ((short)Math.Abs(samples[i]) > threshold) 
    {
        // here is one note onset point
    }
}

在实践中，这种方法的效果非常糟糕，因为正常音频具有各种高于给定阈值的瞬态尖峰。一种解决方案是使用运行平均信号强度（即，直到最后 n 个样本的平均值高于阈值才标记开始）。

short threshold = 10000;
int window_length = 100;
int running_total = 0;
// tally up the first window_length samples
for (int i = 0; i < window_length; i++)
{
    running_total += samples[i];
}
// calculate moving average
for (int i = window_length; i < samples.Length; i++)
{
    // remove oldest sample and add current
    running_total -= samples[i - window_length];
    running_total += samples[i];
    short moving_average = running_total / window_length;
    if (moving_average > threshold)
    {
        // here is one note onset point 
        int onset_point = i - (window_length / 2);
    }
}

所有这些都需要进行大量调整和调整设置，才能准确找到 WAV 文件的起始位置，通常适用于一个文件的方法不适用于另一个文件。这是您选择的一个非常困难且未完美解决的问题领域，但我认为您解决它很酷。

更新：此图显示了我遗漏的注释检测的细节，即检测注释何时结束：

alt text

黄线代表关闭阈值。一旦算法检测到音符开始，它就会假设音符继续，直到运行平均信号强度降至该值以下（此处以紫色线显示）。当然，这是另一个困难来源，就像两个或多个音符重叠（复调）的情况一样。

检测到每个音符的开始点和停止点后，您现在可以分析 WAV 文件数据的每个切片以确定音高。

更新 2：我刚刚阅读了您更新的问题。如果您从头开始编写自己的代码，那么通过自相关进行基音检测比 FFT 更容易实现，但如果您已经签出并使用了预构建的 FFT 库，那么您最好使用它。一旦您确定了每个音符的开始和停止位置（并在开始和结束时为错过的起音和释放部分添加了一些填充），您现在可以提取每个音频数据片段并将其传递给 FFT 函数以确定音高。

这里重要的一点是不使用压缩音频数据的切片，而是使用原始的、未修改的数据的切片。压缩过程会使音频失真，并可能产生不准确的音调读数。

关于音符起音时间的最后一点是，它可能没有您想象的那么严重。通常在音乐中，起音缓慢的乐器（如软合成器）会比起音尖锐的乐器（如钢琴）更早开始音符，并且两个音符听起来就像是同时开始的。如果您以这种方式演奏乐器，算法会为两种乐器拾取相同的开始时间，从 WAV 到 MIDI 的角度来看，这很好。

最后更新（我希望）：忘记我说过的关于包含每个音符的早期攻击部分的一些填充样本的内容 - 我忘记这对于音高检测来说实际上是一个坏主意。许多乐器（尤其是钢琴和其他打击乐器）的起音部分包含不是基本音高倍数的瞬变，并且往往会搞砸音高检测。出于这个原因，您实际上希望在攻击后稍微开始每个切片。

哦，还有一点很重要：这里的术语“压缩”并不是指 MP3 风格的压缩。

再次更新：这是一个简单的非动态压缩函数：

public void StaticCompress(short[] samples, float param)
{
    for (int i = 0; i < samples.Length; i++)
    {
        int sign = (samples[i] < 0) ? -1 : 1;
        float norm = ABS(samples[i] / 32768); // NOT short.MaxValue
        norm = 1.0 - POW(1.0 - norm, param);
        samples[i] = 32768 * norm * sign;
    }
}

当 param = 1.0 时，该函数对音频没有影响。较大的参数值（2.0 较好，它将每个样本与最大峰值之间的归一化差值平方）将产生更多的压缩和更响亮的整体（但蹩脚）声音。低于1.0的值会产生扩张效应。

另一个可能显而易见的点是：您应该在一个小的、无回声的房间中录制音乐，因为回声经常被该算法拾取为幻影音符。

更新：这是 StaticCompress 的一个版本，它将在 C# 中编译并显式转换所有内容。这将返回预期结果：

public void StaticCompress(short[] samples, double param)
{
    for (int i = 0; i < samples.Length; i++)
    {
        Compress(ref samples[i], param);
    }
}

public void Compress(ref short orig, double param)
{
    double sign = 1;
    if (orig < 0)
    {
        sign = -1;
    }
    // 32768 is max abs value of a short. best practice is to pre-
    // normalize data or use peak value in place of 32768
    double norm = Math.Abs((double)orig / 32768.0);
    norm = 1.0 - Math.Pow(1.0 - norm, param);
    orig = (short)(32768.0 * norm * sign); // should round before cast,
        // but won't affect note onset detection
}

抱歉，我在 Matlab 上的知识得分为 0。如果您发布了另一个关于为什么您的 Matlab 函数不能按预期工作的问题，它将得到回答（只是不是由我回答）。

Here is a graphic that illustrates the threshold approach to note onset detection:

alt text

This image shows a typical WAV file with three discrete notes played in succession. The red line represents a chosen signal threshold, and the blue lines represent note start positions returned by a simple algorithm that marks a start when the signal level crosses the threshold.

As the image shows, selecting a proper absolute threshold is difficult. In this case, the first note is picked up fine, the second note is missed completely, and the third note (barely) is started very late. In general, a low threshold causes you to pick up phantom notes, while raising it causes you to miss notes. One solution to this problem is to use a relative threshold that triggers a start if the signal increases by a certain percentage over a certain time, but this has problems of its own.

A simpler solution is to use the somewhat-counterintuitively named compression (not MP3 compression - that's something else entirely) on your wave file first. Compression essentially flattens the spikes in your audio data and then amplifies everything so that more of the audio is near the maximum values. The effect on the above sample would look like this (which shows why the name "compression" appears to make no sense - on audio equipment it's usually labelled "loudness"):

alt text

After compression, the absolute threshold approach will work much better (although it's easy to over-compress and start picking up fictional note starts, the same effect as lowering the threshold). There are a lot of wave editors out there that do a good job of compression, and it's better to let them handle this task - you'll probably need to do a fair amount of work "cleaning up" your wave files before detecting notes in them anyway.

In coding terms, a WAV file loaded into memory is essentially just an array of two-byte integers, where 0 represents no signal and 32,767 and -32,768 represent the peaks. In its simplest form, a threshold detection algorithm would just start at the first sample and read through the array until it finds a value greater than the threshold.

short threshold = 10000;
for (int i = 0; i < samples.Length; i++)
{
    if ((short)Math.Abs(samples[i]) > threshold) 
    {
        // here is one note onset point
    }
}

In practice this works horribly, since normal audio has all sorts of transient spikes above a given threshold. One solution is to use a running average signal strength (i.e. don't mark a start until the average of the last n samples is above the threshold).

short threshold = 10000;
int window_length = 100;
int running_total = 0;
// tally up the first window_length samples
for (int i = 0; i < window_length; i++)
{
    running_total += samples[i];
}
// calculate moving average
for (int i = window_length; i < samples.Length; i++)
{
    // remove oldest sample and add current
    running_total -= samples[i - window_length];
    running_total += samples[i];
    short moving_average = running_total / window_length;
    if (moving_average > threshold)
    {
        // here is one note onset point 
        int onset_point = i - (window_length / 2);
    }
}

All of this requires much tweaking and playing around with settings to get it to find the start positions of a WAV file accurately, and usually what works for one file will not work very well on another. This is a very difficult and not-perfectly-solved problem domain you've chosen, but I think it's cool that you're tackling it.

Update: this graphic shows a detail of note detection I left out, namely detecting when the note ends:

alt text

The yellow line represents the off-threshold. Once the algorithm has detected a note start, it assumes the note continues until the running average signal strength drops below this value (shown here by the purple lines). This is, of course, another source of difficulties, as is the case where two or more notes overlap (polyphony).

Once you've detected the start and stop points of each note, you can now analyze each slice of WAV file data to determine the pitches.

Update 2: I just read your updated question. Pitch-detection through auto-correlation is much easier to implement than FFT if you're writing your own from scratch, but if you've already checked out and used a pre-built FFT library, you're better off using it for sure. Once you've identified the start and stop positions of each note (and included some padding at the beginning and end for the missed attack and release portions), you can now pull out each slice of audio data and pass it to an FFT function to determine the pitch.

One important point here is not to use a slice of the compressed audio data, but rather to use a slice of the original, unmodified data. The compression process distorts the audio and may produce an inaccurate pitch reading.

One last point about note attack times is that it may be less of a problem than you think. Often in music an instrument with a slow attack (like a soft synth) will begin a note earlier than a sharp attack instrument (like a piano) and both notes will sound as if they're starting at the same time. If you're playing instruments in this manner, the algorithm with pick up the same start time for both kinds of instruments, which is good from a WAV-to-MIDI perspective.

Last update (I hope): Forget what I said about including some paddings samples from the early attack part of each note - I forgot this is actually a bad idea for pitch detection. The attack portions of many instruments (especially piano and other percussive-type instruments) contain transients that aren't multiples of the fundamental pitch, and will tend to screw up pitch detection. You actually want to start each slice a little after the attack for this reason.

Oh, and kind of important: the term "compression" here does not refer to MP3-style compression.

Update again: here is a simple function that does non-dynamic compression:

public void StaticCompress(short[] samples, float param)
{
    for (int i = 0; i < samples.Length; i++)
    {
        int sign = (samples[i] < 0) ? -1 : 1;
        float norm = ABS(samples[i] / 32768); // NOT short.MaxValue
        norm = 1.0 - POW(1.0 - norm, param);
        samples[i] = 32768 * norm * sign;
    }
}

When param = 1.0, this function will have no effect on the audio. Larger param values (2.0 is good, which will square the normalized difference between each sample and the max peak value) will produce more compression and a louder overall (but crappy) sound. Values under 1.0 will produce an expansion effect.

One other probably obvious point: you should record the music in a small, non-echoic room since echoes are often picked up by this algorithm as phantom notes.

Update: here is a version of StaticCompress that will compile in C# and explicity casts everything. This returns the expected result:

public void StaticCompress(short[] samples, double param)
{
    for (int i = 0; i < samples.Length; i++)
    {
        Compress(ref samples[i], param);
    }
}

public void Compress(ref short orig, double param)
{
    double sign = 1;
    if (orig < 0)
    {
        sign = -1;
    }
    // 32768 is max abs value of a short. best practice is to pre-
    // normalize data or use peak value in place of 32768
    double norm = Math.Abs((double)orig / 32768.0);
    norm = 1.0 - Math.Pow(1.0 - norm, param);
    orig = (short)(32768.0 * norm * sign); // should round before cast,
        // but won't affect note onset detection
}

Sorry, my knowledge score on Matlab is 0. If you posted another question on why your Matlab function doesn't work as expected it would get answered (just not by me).

回复收藏 0 原文

在巴黎塔顶看东京樱花 2024-07-15 07:30:53

您想要做的通常称为WAV-to-MIDI（google“wav-to-midi”）。在这个过程中已经有很多尝试，但结果各不相同（注意开始是困难之一；复调更难处理）。我建议您从彻底搜索现成的解决方案开始，只有在没有可接受的解决方案时才开始自己工作。

您需要的过程的另一部分是将 MIDI 输出呈现为传统乐谱，但有无数的产品可以做到这一点。

另一个答案是：是的，我已经做了很多数字信号处理（请参阅我网站上的软件 - 这是一个用 VB 和 C 编写的无限语音软件合成器），并且我有兴趣帮助您解决这个问题。 WAV 到 MIDI 部分在概念上并不是那么困难，只是让它在实践中可靠地工作才是困难的。音符开始只是设置一个阈值 - 可以轻松地及时向前或向后调整错误，以补偿音符起音的差异。在录音中进行音调检测比实时进行要容易得多，并且仅涉及实现自相关例程。