从音轨中删除人声的算法

发布于 2024-09-18 07:54:38 字数 1435 浏览 12 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

余生再见 2024-09-25 07:54:38

这与其说是一种“算法”,不如说是一种“技巧”,但它可以在代码中实现自动化。它主要适用于人声居中的立体声轨道。如果人声居中,则它们在两首曲目中的表现相同。如果您反转其中一首曲目,然后将它们重新合并在一起,则中心人声的波形会抵消并几乎被删除。您可以使用大多数优秀的音频编辑器(例如 audacity)手动执行此操作。它不会给你完美的结果,其余的音频也会受到一点影响,但它可以制作很棒的卡拉 OK 曲目:)

This isn't so much an "algorithm" as a "trick" but it could be automated in code. It works mostly for stereo tracks with where the vocals are centered. If the vocals are centered, they manifest equally in both tracks. If you invert one of the tracks and then merge them back together, the wave forms of the center vocals cancel out and are virtually removed. You can do this manually with most good audio editors like audacity. It doesn't give you perfect results and the rest of the audio suffers a bit too but it makes for great karaoke tracks :)

李不 2024-09-25 07:54:38

来源:http://www.cdf.utoronto.ca/~csc209h/summer/a2/ a2.html,由 Daniel Zingaro 编写。

声音是气压波。什么时候
产生声音,声波
包括压缩(增加
在压力)和稀疏
(压力减小)移动通过
空气。这类似于
如果你往里面扔一块石头就会发生
池塘:水的上升和下降
重复波。

当麦克风录制声音时,
测量气压
并将其作为值返回。这些
值称为样本,可以是
正或负对应于
空气的增加或减少
压力。每次气压
被记录,我们正在采样
声音。每个样本都会记录声音
在某一时刻;我们越快
样品,我们的越准确
声音的表示。这
采样率是指多少次
每秒我们对声音进行采样。为了
例如,CD 品质的声音使用
采样率为 44100 个样本
第二;对某人的声音进行采样
在 VOIP 通话中使用远
小于这个。采样率
11025(语音质量)、22050 和
44100(CD品质)很常见...

对于单声道声音(只有一种声音的声音)
通道),样本只是一个
正或负整数
代表压缩量
在样品所在点的空气中
采取。对于立体声(我们使用
在此作业中),样本是
实际上由两个整数组成
值: 1 为左扬声器,
右边的一个...

以下是算法[删除人声]的工作原理。

  • 将输入文件中的前 44 个字节逐字复制到输出
    文件。这44个字节包含重要的
    不应包含的标头信息
    已修改。

  • 接下来,将输入文件的其余部分视为短裤序列。拿
    左右各一条短裤,
    并计算组合=(左-右)
    / 2. 将两份组合写入
    输出文件。

为什么这有效?

为了好奇,简单解释一下
声音去除算法的
命令。正如您从
算法,我们只是简单地减去
一个通道与另一个通道(然后
除以 2 以保持体积
声音太大)。那么为什么
从左声道中减去
右声道神奇地消除了人声?

录制音乐时,
有时人声的情况是
由单个麦克风录制,以及
该单声道用于
两个通道中的人声。另一个
歌曲中的乐器被录制
通过多个麦克风,以便他们
两个通道的声音不同。
从一个通道中减去另一个通道
带走了“里面的一切”
这两个频道之间的共同点
如果我们幸运的话,这意味着删除
人声。

当然,事情很少能这么顺利。
尝试一下你的声音去除器
行为不当的 wav 文件。当然,
声音消失了,但身体也消失了
音乐!显然,一些
乐器也被记录下来
“居中”,以便将它们删除
与频道时的人声一起
被减去。

Source: http://www.cdf.utoronto.ca/~csc209h/summer/a2/a2.html, written by Daniel Zingaro.

Sounds are waves of air pressure. When
a sound is generated, a sound wave
consisting of compressions (increases
in pressure) and rarefactions
(decreases in pressure) moves through
the air. This is similar to what
happens if you throw a stone into a
pond: the water rises and falls in a
repeating wave.

When a microphone records sound, it
takes a measure of the air pressure
and returns it as a value. These
values are called samples and can be
positive or negative corresponding to
increases or decreases in air
pressure. Each time the air pressure
is recorded, we are sampling the
sound. Each sample records the sound
at an instant in time; the faster we
sample, the more accurate is our
representation of the sound. The
sampling rate refers to how many times
per second we sample the sound. For
example, CD-quality sound uses a
sampling rate of 44100 samples per
second; sampling someone's voice for
use in a VOIP conversation uses far
less than this. Sampling rates of
11025 (voice quality), 22050, and
44100 (CD quality) are common...

For mono sounds (those with one sound
channel), a sample is simply a
positive or negative integer that
represents the amount of compression
in the air at the point the sample was
taken. For stereo sounds (which we use
in this assignment), a sample is
actually made up of two integer
values: one for the left speaker and
one for the right...

Here's how the algorithm [to remove vocals] works.

  • Copy the first 44 bytes verbatim from the input file to the output
    file. Those 44 bytes contain important
    header information that should not be
    modified.

  • Next, treat the rest of the input file as a sequence of shorts. Take
    each pair of shorts left and right,
    and compute combined = (left - right)
    / 2. Write two copies of combined to
    the output file.

Why Does This Work?

For the curious, a brief explanation
of the vocal-removal algorithm is in
order. As you noticed from the
algorithm, we are simply subtracting
one channel from the other (and then
dividing by 2 to keep the volume from
getting too loud). So why does
subtracting the left channel from the
right channel magically remove vocals?

When music is recorded, it is
sometimes the case that vocals are
recorded by a single microphone, and
that single vocal track is used for
the vocals in both channels. The other
instruments in the song are recorded
by multiple microphones, so that they
sound different in both channels.
Subtracting one channel from the other
takes away everything that is ``in
common'' between those two channels
which, if we're lucky, means removing
the vocals.

Of course, things rarely work so well.
Try your vocal remover on this
badly-behaved wav file. Sure, the
vocals are gone, but so is the body of
the music! Apparently, some of the
instruments were also recorded
"centred", so that they are removed
along with the vocals when channels
are subtracted.

地狱即天堂 2024-09-25 07:54:38

您可以使用 pydub 工具箱,详细信息请参阅此处,另请参阅此处了解相关问题。它依赖于 FFmpeg 并且可以读取任何文件格式

然后您可以执行以下操作:

from pydub import AudioSegment
from pydub.playback import play

# read in audio file and get the two mono tracks
sound_stereo = AudioSegment.from_file(myAudioFile, format="mp3")
sound_monoL = sound_stereo.split_to_mono()[0]
sound_monoR = sound_stereo.split_to_mono()[1]

# Invert phase of the Right audio file
sound_monoR_inv = sound_monoR.invert_phase()

# Merge two L and R_inv files, this cancels out the centers
sound_CentersOut = sound_monoL.overlay(sound_monoR_inv)

# Export merged audio file
fh = sound_CentersOut.export(myAudioFile_CentersOut, format="mp3")

You can use the pydub Toolbox, see here for details, also see here for related question. It's dependent on FFmpeg and can read any fileformat

Then you can do the following:

from pydub import AudioSegment
from pydub.playback import play

# read in audio file and get the two mono tracks
sound_stereo = AudioSegment.from_file(myAudioFile, format="mp3")
sound_monoL = sound_stereo.split_to_mono()[0]
sound_monoR = sound_stereo.split_to_mono()[1]

# Invert phase of the Right audio file
sound_monoR_inv = sound_monoR.invert_phase()

# Merge two L and R_inv files, this cancels out the centers
sound_CentersOut = sound_monoL.overlay(sound_monoR_inv)

# Export merged audio file
fh = sound_CentersOut.export(myAudioFile_CentersOut, format="mp3")
许一世地老天荒 2024-09-25 07:54:38

超过指定限制?听起来像高通滤波器...如果您有阿卡贝拉音轨和原始音轨,则可以使用相位取消。否则,除非它是一首 60 年代的老曲目,中间直接有人声并且其他所有内容都经过严格的平移,否则我认为没有一种超级干净的方法来消除人声。

Above a specified limit? sounds like a high pass filter...You could use phase cancellation if you had the acapella track along with the original. Otherwise, unless its an old 60s-era track that has vocals directly in the middle and everything else hard panned, i don't think there's a super clean way of removing vocals.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文