将更多的自定义实体添加到预验证的自定义NER Spacy3中

发布于 2025-01-31 17:22:11 字数 213 浏览 2 评论 0原文

我已经大量的文本数据,想增加约50个不同的实体。最初,当我开始使用它时,我会遇到内存错误。众所周知,Spacy可以处理每GB的1,00,000个令牌,最高可达10,00,000。因此,我将数据集分为5集,并使用注释器创建了相同的多个JSON文件。现在,我从一个JSON开始,然后成功完成创建模型,现在我想在其中添加更多数据,这样我就不会错过任何标签,并且在模型中培训时使用了各种各样的数据。请指导我如何继续进行。

I've a huge amount of textual data and wanted to add around 50 different entities. Initially when I started working with it, I was getting memory error. As we know spacy can handle 1,00,000 tokens per GB and maximum up to 10,00,000. So I chunked my dataset into 5 sets and using annotator created multiple JSON file for the same. Now I started with one JSON and successfully completed creating the model and now I want to add more data into it so that I don't miss out any tags and there's a good variety of data is used while training in the model. Please guide me how to proceed next.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

遮了一弯 2025-02-07 17:22:11

我在评论中提到了一些混乱,但是假设您的问题是如何将大型培训装入Spacy,那么解决方案非常简单。

首先,将培训数据作为多个.spacy文件保存在一个目录中。您不必制作JSON文件,这是V2中的标准文件。有关培训数据的详细信息,请参见文档的培训数据部分。在您的配置中,您可以指定此目录,因为培训数据源和Spacy将使用那里的所有文件。

接下来,为避免将所有训练数据保存在内存中,您可以指定max_epochs = -1(请参阅流媒体公司的文档)。使用此功能意味着您必须提前指定您的标签,如其中的文档所涵盖的。您可能还想手动地洗牌。

这就是您需要大量数据训练的全部。

您的问题的标题提到将实体添加到验证的模型中。通常最好从头开始训练以避免灾难性的遗忘,但是您可以看到执行此操作的指南在这里

I mentioned some points of confusion in a comment, but assuming that your issue is how to load a large training set into spaCy, the solution is pretty simple.

First, save your training data as multiple .spacy files in one directory. You do not have to make JSON files, that was standard in v2. For details on training data see the training data section of the docs. In your config you can specify this directory as the training data source and spaCy will use all the files there.

Next, to avoid keeping all the training data in memory, you can specify max_epochs = -1 (see the docs on streaming corpora). Using this feature means you will have to specify your labels ahead of time as covered in the docs there. You will probably also want to shuffle your training data manually.

That's all you need to train with a lot of data.

The title of your question mentions adding entities to the pretrained model. It's usually better to train from scratch instead to avoid catastrophic forgetting, but you can see a guide to doing it here.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文