当前位置：文江博客话题详情

如何使用本地自定义数据集训练WAV2VEC2 XLSR

发布于 2025-01-30 04:49:43 字数 1457 浏览 7 评论 0原文

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

橘香 2025-02-06 04:49:43

我建议您使用自己的数据集扩展通用语音（CV）丹麦子集。首先分析数据集并将您的数据像CV语料库一样。此时：数据扩展（.wav，.mp3 ...），type（float32，int ...），音频长度，当然还有转录格式很重要。不要让您的语料库稀疏。

将数据放入CV Copus文件夹和加载数据集中。然后，您应该能够使用现有代码对扩展数据进行微调模型。

如果您不是WAV2VEC的专家，请不要创建全新的语料库。

注意：您应该使用更少的数据获得合理的结果。您实现了什么，目标是什么。超参数调整可能是您寻找的第一件事，而不是数据。

回复收藏 0 原文

海拔太高太耀眼 2025-02-06 04:49:43

我已经建立了一个工具，可以帮助我使用自定义数据微调WAV2VEC2模型。也许这也可以为您提供帮助： https：//github.com/jonatasgrosman/huggingsound 。

您可以使用：pip install huggingsound安装它

来使用自定义数据集微调XLSR模型，您需要做类似的事情：

from huggingsound import TrainingArguments, ModelArguments, SpeechRecognitionModel, TokenSet

model = SpeechRecognitionModel("facebook/wav2vec2-large-xlsr-53")
output_dir = "my/finetuned/model/output/dir"

# first of all, you need to define your model's token set
# however, the token set is only needed for non-finetuned models
# if you pass a new token set for an already finetuned model, it'll be ignored during training
tokens = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]
token_set = TokenSet(tokens)

# define your custom train data
train_data = [
    {"path": "/path/to/sagan.mp3", "transcription": "extraordinary claims require extraordinary evidence"},
    {"path": "/path/to/asimov.wav", "transcription": "violence is the last refuge of the incompetent"},
]

# and finally, fine-tune your model
model.finetune(
    output_dir, 
    train_data=train_data,
    token_set=token_set,
)

I've built a tool to help me to fine-tune wav2vec2 models using custom data. Maybe this can help you too: https://github.com/jonatasgrosman/huggingsound.

You can install it using: pip install huggingsound

To fine-tune the XLSR model using a custom dataset, you'll need to do something like this:

from huggingsound import TrainingArguments, ModelArguments, SpeechRecognitionModel, TokenSet

model = SpeechRecognitionModel("facebook/wav2vec2-large-xlsr-53")
output_dir = "my/finetuned/model/output/dir"

# first of all, you need to define your model's token set
# however, the token set is only needed for non-finetuned models
# if you pass a new token set for an already finetuned model, it'll be ignored during training
tokens = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]
token_set = TokenSet(tokens)

# define your custom train data
train_data = [
    {"path": "/path/to/sagan.mp3", "transcription": "extraordinary claims require extraordinary evidence"},
    {"path": "/path/to/asimov.wav", "transcription": "violence is the last refuge of the incompetent"},
]

# and finally, fine-tune your model
model.finetune(
    output_dir, 
    train_data=train_data,
    token_set=token_set,
)

回复收藏 0 原文

~没有更多了~

关于作者

橘虞初梦

暂无简介

文章

27 人气

关注发私信

櫻之舞

文章 0 评论 0

关注

弥枳

文章 0 评论 0

关注

m2429

文章 0 评论 0

关注

寻找一个思念的角度

文章 0 评论 0

关注

野却迷人

文章 0 评论 0

关注

我怀念的。

文章 0 评论 0

友情链接

文江博客

如何使用本地自定义数据集训练WAV2VEC2 XLSR

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

櫻之舞