I suggest you to extend Common Voice (CV) Danish subset with your own dataset. Analyse dataset first and make your data like CV corpus. At this point: data extension (.wav, .mp3 ...), type (float32, int ...), audio lengths and of course transcription formats are important. Don not make your corpus sparse.
Place you data into CV corpus folder and load dataset. Then you should be able to fine-tune model with extended data using existing code.
Do not create completely new corpus If you are not an expert of wav2vec.
A Note: You should get reasonable result using less data. What WER did you achieve and what is your target. Hyper-parameter tuning may be the first thing you look for instead of data.
from huggingsound import TrainingArguments, ModelArguments, SpeechRecognitionModel, TokenSet
model = SpeechRecognitionModel("facebook/wav2vec2-large-xlsr-53")
output_dir = "my/finetuned/model/output/dir"
# first of all, you need to define your model's token set
# however, the token set is only needed for non-finetuned models
# if you pass a new token set for an already finetuned model, it'll be ignored during training
tokens = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]
token_set = TokenSet(tokens)
# define your custom train data
train_data = [
{"path": "/path/to/sagan.mp3", "transcription": "extraordinary claims require extraordinary evidence"},
{"path": "/path/to/asimov.wav", "transcription": "violence is the last refuge of the incompetent"},
]
# and finally, fine-tune your model
model.finetune(
output_dir,
train_data=train_data,
token_set=token_set,
)
You can install it using: pip install huggingsound
To fine-tune the XLSR model using a custom dataset, you'll need to do something like this:
from huggingsound import TrainingArguments, ModelArguments, SpeechRecognitionModel, TokenSet
model = SpeechRecognitionModel("facebook/wav2vec2-large-xlsr-53")
output_dir = "my/finetuned/model/output/dir"
# first of all, you need to define your model's token set
# however, the token set is only needed for non-finetuned models
# if you pass a new token set for an already finetuned model, it'll be ignored during training
tokens = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]
token_set = TokenSet(tokens)
# define your custom train data
train_data = [
{"path": "/path/to/sagan.mp3", "transcription": "extraordinary claims require extraordinary evidence"},
{"path": "/path/to/asimov.wav", "transcription": "violence is the last refuge of the incompetent"},
]
# and finally, fine-tune your model
model.finetune(
output_dir,
train_data=train_data,
token_set=token_set,
)
发布评论
评论(2)
我建议您使用自己的数据集扩展通用语音(CV)丹麦子集。首先分析数据集并将您的数据像CV语料库一样。此时:数据扩展(.wav,.mp3 ...),type(float32,int ...),音频长度,当然还有转录格式很重要。不要让您的语料库稀疏。
将数据放入CV Copus文件夹和加载数据集中。然后,您应该能够使用现有代码对扩展数据进行微调模型。
如果您不是WAV2VEC的专家,请不要创建全新的语料库。
注意:您应该使用更少的数据获得合理的结果。您实现了什么,目标是什么。超参数调整可能是您寻找的第一件事,而不是数据。
I suggest you to extend Common Voice (CV) Danish subset with your own dataset. Analyse dataset first and make your data like CV corpus. At this point: data extension (.wav, .mp3 ...), type (float32, int ...), audio lengths and of course transcription formats are important. Don not make your corpus sparse.
Place you data into CV corpus folder and load dataset. Then you should be able to fine-tune model with extended data using existing code.
Do not create completely new corpus If you are not an expert of wav2vec.
A Note: You should get reasonable result using less data. What WER did you achieve and what is your target. Hyper-parameter tuning may be the first thing you look for instead of data.
我已经建立了一个工具,可以帮助我使用自定义数据微调WAV2VEC2模型。也许这也可以为您提供帮助: https://github.com/jonatasgrosman/huggingsound 。
您可以使用:
pip install huggingsound
安装它来使用自定义数据集微调XLSR模型,您需要做类似的事情:
I've built a tool to help me to fine-tune wav2vec2 models using custom data. Maybe this can help you too: https://github.com/jonatasgrosman/huggingsound.
You can install it using:
pip install huggingsound
To fine-tune the XLSR model using a custom dataset, you'll need to do something like this: