The reason it works already is probably quite simply that you didn't train it to pick one and only one output (at least I assume you didn't). In the simple case when the output is just a dot product of the input and the weights, the weights would become matched filters for the corresponding pitch. Since everything is linear, multiple outputs would simultaneously get activated if multiple matched filters simultaneously saw good matches (as is the case for polyphonic notes). Since your network probably includes nonlinearities, the picture is a bit more complex, but the idea is probably the same.
Regarding ways to improve it, training with polyphonic samples is certainly one possibility. Another possibility is to switch to a linear filter. The DFT of a polyphonic sound is basically the sum of DFTs of each individual sound. You want a linear combination of inputs to become a corresponding linear combination of outputs, so a linear filter is appropriate.
Incidentally, why do you use a neural network for this in the first place? It seems that just looking at the DFT and, say, taking the maximum frequency would give you better results more easily.
复调录音的音高检测是一个非常困难的主题,并且包含许多争议 - 准备好进行大量阅读。下面的链接包含另一种对和弦录音进行音高检测的方法,该方法是我为名为 PitchScope Player 的免费应用程序开发的。我的 C++ 源代码可在 GitHub.com 上找到,并在下面的链接中引用。 PitchScope Player 的免费可执行版本也可在网络上获取并在 Windows 上运行。
Anssi Klapuri is a well-respected audio researcher who has published a method to perform pitch detection upon polyphonic recordings using Neural Networks.
You might want to compare Klapuri's method to yours. It is fully described in his master's thesis, Signal Processing Methods for the Automatic Transcription of Music. You can find his many papers online, or buy his book which explains his algorithm and test results. His master's thesis is linked below.
Pitch Detection upon polyphonic recordings is a very difficult topic and contains many controversies -- be prepared to do a lot of reading. The link below contains another approach to pitch detection upon polyphonic recordings which I developed for a free app called PitchScope Player. My C++ source code is available on GitHub.com, and is referenced within the link below. A free executable version of PitchScope Player is also available on the web and runs on Windows.
I experimented with evolving a CTRNN (Continuous Time Recurrent Neural Network) on detecting the difference between 2 sine waves. I had moderate success, but never had time to follow up with a bank of these neurons (ie in bands similar to the cochlear).
One possible approach would be to employ Genetic Programming (GP), to generate short snippets of code that detects the pitch. This way you would be able to generate a rule for how the pitch detection works, which would hopefully be human readable.
发布评论
评论(4)
它已经起作用的原因可能很简单,因为您没有训练它选择一个且仅一个输出(至少我假设您没有)。在简单的情况下,当输出只是输入和权重的点积时,权重将成为相应音高的匹配滤波器。由于一切都是线性的,如果多个匹配的滤波器同时看到良好的匹配(如复调音符的情况),多个输出将同时被激活。由于您的网络可能包含非线性,因此情况会稍微复杂一些,但想法可能是相同的。
关于改进它的方法,使用复调样本进行训练当然是一种可能性。另一种可能性是切换到线性滤波器。复调声音的 DFT 基本上是每个单独声音的 DFT 之和。您希望输入的线性组合成为输出的相应线性组合,因此线性滤波器是合适的。
顺便说一句,为什么首先要使用神经网络呢?看起来,仅查看 DFT 并获取最大频率就可以更轻松地获得更好的结果。
The reason it works already is probably quite simply that you didn't train it to pick one and only one output (at least I assume you didn't). In the simple case when the output is just a dot product of the input and the weights, the weights would become matched filters for the corresponding pitch. Since everything is linear, multiple outputs would simultaneously get activated if multiple matched filters simultaneously saw good matches (as is the case for polyphonic notes). Since your network probably includes nonlinearities, the picture is a bit more complex, but the idea is probably the same.
Regarding ways to improve it, training with polyphonic samples is certainly one possibility. Another possibility is to switch to a linear filter. The DFT of a polyphonic sound is basically the sum of DFTs of each individual sound. You want a linear combination of inputs to become a corresponding linear combination of outputs, so a linear filter is appropriate.
Incidentally, why do you use a neural network for this in the first place? It seems that just looking at the DFT and, say, taking the maximum frequency would give you better results more easily.
Anssi Klapuri 是一位备受尊敬的音频研究人员,他发表了一种使用神经网络对复调录音进行音高检测的方法。
您可能想将 Klapuri 的方法与您的方法进行比较。他的硕士论文音乐自动转录的信号处理方法对此进行了全面描述。你可以在网上找到他的许多论文,或者购买他的书,其中解释了他的算法和测试结果。他的硕士论文链接如下。
https://www.cs.tut.fi/sgn/ arg/klap/phd/klap_phd.pdf
复调录音的音高检测是一个非常困难的主题,并且包含许多争议 - 准备好进行大量阅读。下面的链接包含另一种对和弦录音进行音高检测的方法,该方法是我为名为 PitchScope Player 的免费应用程序开发的。我的 C++ 源代码可在 GitHub.com 上找到,并在下面的链接中引用。 PitchScope Player 的免费可执行版本也可在网络上获取并在 Windows 上运行。
实时音调检测
Anssi Klapuri is a well-respected audio researcher who has published a method to perform pitch detection upon polyphonic recordings using Neural Networks.
You might want to compare Klapuri's method to yours. It is fully described in his master's thesis, Signal Processing Methods for the Automatic Transcription of Music. You can find his many papers online, or buy his book which explains his algorithm and test results. His master's thesis is linked below.
https://www.cs.tut.fi/sgn/arg/klap/phd/klap_phd.pdf
Pitch Detection upon polyphonic recordings is a very difficult topic and contains many controversies -- be prepared to do a lot of reading. The link below contains another approach to pitch detection upon polyphonic recordings which I developed for a free app called PitchScope Player. My C++ source code is available on GitHub.com, and is referenced within the link below. A free executable version of PitchScope Player is also available on the web and runs on Windows.
Real time pitch detection
我尝试发展 CTRNN(连续时间递归神经网络)来检测 2 个正弦波之间的差异。我取得了一定的成功,但从未有时间跟踪这些神经元的银行(即类似于耳蜗的带)。
I experimented with evolving a CTRNN (Continuous Time Recurrent Neural Network) on detecting the difference between 2 sine waves. I had moderate success, but never had time to follow up with a bank of these neurons (ie in bands similar to the cochlear).
一种可能的方法是使用遗传编程(GP)来生成简短的代码片段,检测音高。通过这种方式,您将能够生成音调检测如何工作的规则,该规则有望是人类可读的。
One possible approach would be to employ Genetic Programming (GP), to generate short snippets of code that detects the pitch. This way you would be able to generate a rule for how the pitch detection works, which would hopefully be human readable.