SpaCy 3 -- ValueError: [E973] NER 数据的意外类型

发布于 2025-01-10 04:00:05 字数 2601 浏览 3 评论 0原文

我一直在为这个问题苦恼很久，但似乎找不到解决方案。我想训练一个 NER 模型来识别动物和物种名称。我创建了一个模拟训练集来测试它。但是，我不断收到 ValueError: [E973] Unexpected type for NER data

我已经在 StackOverflow 上的其他帖子上尝试了其他解决方案，包括：

仔细检查我的格式和训练集类型是否正确
使用spacy.load('en_core_web_sm') 而不是 spacy.blank('en')
安装 spacy-lookups-data

所有这些都会产生相同的结果错误。

import os
import spacy
from spacy.lang.en import English
from spacy.training.example import Example
import random


def train_spacy(data, iterations = 30):
    TRAIN_DATA = data

    nlp = spacy.blank("en") #start with a blank model

    if "ner" not in nlp.pipe_names:
        ner = nlp.add_pipe("ner", last = True)

    for _, annotations in TRAIN_DATA:
        for ent in annotations.get("entities"):
            ner.add_label(ent[2])

    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
    
    with nlp.disable_pipes(*other_pipes):
        optimizer = nlp.begin_training()
        for itn in range(iterations):
            print ("Starting iterations "+str(itn))
            random.shuffle(TRAIN_DATA)
            losses = {}

            for text, annotations in TRAIN_DATA:
                doc = nlp.make_doc(text)

                print(isinstance(annotations["entities"], (list,tuple))) #this prints True

                example = Example.from_dict(doc, {"entities":annotations})
                nlp.update(
                    [example],
                    drop = 0.2,
                    sgd = optimizer,
                    losses = losses
                )
        print(losses)
    return (nlp)

if __name__ == "__main__":
    #mock training set
    TRAIN_DATA=[('Dog is an animal',{'entities':[(0,3,'ANIMAL')]}),
           ('Cat is on the table',{'entities':[(0,3,'ANIMAL')]}),
           ('Rats are pets',{'entities':[(0,4,'ANIMAL')]})]
    nlp = train_spacy(TRAIN_DATA)

错误信息

  File "c:\...\summarizer\src\feature_extraction\feature_extraction.py", line 49, in <module>
    nlp = train_spacy(TRAIN_DATA)
  File "c:\...\summarizer\src\feature_extraction\feature_extraction.py", line 35, in train_spacy
    example = Example.from_dict(doc, {"entities":annotations})
  File "spacy\training\example.pyx", line 118, in spacy.training.example.Example.from_dict
  File "spacy\training\example.pyx", line 24, in spacy.training.example.annotations_to_doc
  File "spacy\training\example.pyx", line 388, in spacy.training.example._add_entities_to_doc
ValueError: [E973] Unexpected type for NER data```

原文

I've been stressing out on this problem for so long and I can't seem to find a solution.
I want to train a NER model to recognise animal and species names.
I created a mock training set to test it out. However, I keep getting a ValueError: [E973] Unexpected type for NER data

I have tried other solutions on other posts on StackOverflow, including:

Double checking if my formatting and type of the training set was right
Using spacy.load('en_core_web_sm') instead of spacy.blank('en')
Installing spacy-lookups-data

All of these result in the same error.

import os
import spacy
from spacy.lang.en import English
from spacy.training.example import Example
import random


def train_spacy(data, iterations = 30):
    TRAIN_DATA = data

    nlp = spacy.blank("en") #start with a blank model

    if "ner" not in nlp.pipe_names:
        ner = nlp.add_pipe("ner", last = True)

    for _, annotations in TRAIN_DATA:
        for ent in annotations.get("entities"):
            ner.add_label(ent[2])

    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
    
    with nlp.disable_pipes(*other_pipes):
        optimizer = nlp.begin_training()
        for itn in range(iterations):
            print ("Starting iterations "+str(itn))
            random.shuffle(TRAIN_DATA)
            losses = {}

            for text, annotations in TRAIN_DATA:
                doc = nlp.make_doc(text)

                print(isinstance(annotations["entities"], (list,tuple))) #this prints True

                example = Example.from_dict(doc, {"entities":annotations})
                nlp.update(
                    [example],
                    drop = 0.2,
                    sgd = optimizer,
                    losses = losses
                )
        print(losses)
    return (nlp)

if __name__ == "__main__":
    #mock training set
    TRAIN_DATA=[('Dog is an animal',{'entities':[(0,3,'ANIMAL')]}),
           ('Cat is on the table',{'entities':[(0,3,'ANIMAL')]}),
           ('Rats are pets',{'entities':[(0,4,'ANIMAL')]})]
    nlp = train_spacy(TRAIN_DATA)

The error message

  File "c:\...\summarizer\src\feature_extraction\feature_extraction.py", line 49, in <module>
    nlp = train_spacy(TRAIN_DATA)
  File "c:\...\summarizer\src\feature_extraction\feature_extraction.py", line 35, in train_spacy
    example = Example.from_dict(doc, {"entities":annotations})
  File "spacy\training\example.pyx", line 118, in spacy.training.example.Example.from_dict
  File "spacy\training\example.pyx", line 24, in spacy.training.example.annotations_to_doc
  File "spacy\training\example.pyx", line 388, in spacy.training.example._add_entities_to_doc
ValueError: [E973] Unexpected type for NER data```

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

尾戒 2025-01-17 04:00:05

当我将代码从 spacy 的 2.x 版本迁移到 3.x 版本时，我遇到了同样的问题，因为一些事情发生了变化。

另外，在您的情况下，您似乎混合了 spacy 2.x 和 3.x 语法。您的代码的下一个版本进行了一些更改，可以使用 spacy 3.2.1

import random

import spacy
from spacy.training import Example


def train_spacy(data, iterations=30):
    TRAIN_DATA = data

    # nlp = spacy.blank("en")  # start with a blank model
    nlp = spacy.load("en_core_web_lg")

    if "ner" not in nlp.pipe_names:
        ner = nlp.add_pipe("ner", last=True)
    else:
        ner = nlp.get_pipe("ner")

    for _, annotations in TRAIN_DATA:
        for ent in annotations.get("entities"):
            ner.add_label(ent[2])

    # other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]

    # with nlp.disable_pipes(*other_pipes):
    losses = None

    optimizer = nlp.create_optimizer()
    for itn in range(iterations):
        print("Starting iterations " + str(itn))
        random.shuffle(TRAIN_DATA)
        losses = {}

        for text, annotations in TRAIN_DATA:
            doc = nlp.make_doc(text)

            print(isinstance(annotations["entities"], (list, tuple)))  # this prints True

            example = Example.from_dict(doc, annotations)
            losses = nlp.update(
                [example],
                drop=0.2,
                sgd=optimizer
            )

    print(losses)
    return nlp


if __name__ == "__main__":
    # mock training set
    TRAIN_DATA = [('Dog is an animal', {'entities': [(0, 3, 'ANIMAL')]}),
                  ('Cat is on the table', {'entities': [(0, 3, 'ANIMAL')]}),
                  ('Rats are pets', {'entities': [(0, 4, 'ANIMAL')]})]
    nlp = train_spacy(TRAIN_DATA)

请注意以下更改：

我将示例类的导入更改为 from spacy.training import Example。我认为您导入了错误的类。
我正在使用 en_core_web_lg，但对于空白模型，它也应该可以工作！
我评论了其他管道模型禁用，因为在 spacy 3.x 管道中更复杂，我认为你不能禁用 NER 任务的整个管道。如果不需要其他一些模型，请随意阅读官方文档并尝试。
优化器现在使用 nlp.create_optimizer() 初始化，而不是 nlp.begin_training()
请注意，注释已经是预期格式的字典，因此您不必需要将其包装在一个新字典中：Example.from_dict(doc,annotations) 应该可以完成这项工作。
最后，现在的损失作为模型更新的结果返回，而不是作为参数传递。

我希望这对您有所帮助，如果您需要更多帮助，请提出问题。

此致！

编辑：

我还想建议对您的训练脚本进行一些更改，以更好地利用 spacy utils：

使用 spacy.utilis.minibatch util 来创建小批量您的训练数据。
传递一整个小批量的示例来更新方法，而不是仅包含一个示例的小批量。

包含此改进以及其他细微更改的代码将如下所示：

import random

import spacy
from spacy.training import Example


def train_spacy(data, iterations=30):
    TRAIN_DATA = data

    # nlp = spacy.blank("en")  # start with a blank model
    nlp = spacy.load("en_core_web_lg")

    if "ner" not in nlp.pipe_names:
        ner = nlp.add_pipe("ner", last=True)
    else:
        ner = nlp.get_pipe("ner")

    for _, annotations in TRAIN_DATA:
        for ent in annotations.get("entities"):
            ner.add_label(ent[2])

    # Init loss
    losses = None

    # Init and configure optimizer
    optimizer = nlp.create_optimizer()
    optimizer.learn_rate = 0.001  # Change to some lr you prefers
    batch_size = 32  # Choose batch size you prefers

    for itn in range(iterations):
        print("Starting iterations " + str(itn))
        random.shuffle(TRAIN_DATA)
        losses = {}

        # Batch the examples and iterate over them
        for batch in spacy.util.minibatch(TRAIN_DATA, size=batch_size):
            # Create Example instance for each training example in mini batch
            examples = [Example.from_dict(nlp.make_doc(text), annotations) for text, annotations in batch]
            # Update model with mini batch
            losses = nlp.update(examples, drop=0.2, sgd=optimizer)

    print(losses)
    return nlp


if __name__ == "__main__":
    # mock training set
    TRAIN_DATA = [('Dog is an animal', {'entities': [(0, 3, 'ANIMAL')]}),
                  ('Cat is on the table', {'entities': [(0, 3, 'ANIMAL')]}),
                  ('Rats are pets', {'entities': [(0, 4, 'ANIMAL')]})]
    nlp = train_spacy(TRAIN_DATA)

I had the same problem when I migrated a code that I had from a 2.x version of spacy to a 3.x version since several things changed.

Also, in your case it looks like you have a mix of spacy 2.x and 3.x syntaxt. The next version of your code with a few changes work for me using spacy 3.2.1

import random

import spacy
from spacy.training import Example


def train_spacy(data, iterations=30):
    TRAIN_DATA = data

    # nlp = spacy.blank("en")  # start with a blank model
    nlp = spacy.load("en_core_web_lg")

    if "ner" not in nlp.pipe_names:
        ner = nlp.add_pipe("ner", last=True)
    else:
        ner = nlp.get_pipe("ner")

    for _, annotations in TRAIN_DATA:
        for ent in annotations.get("entities"):
            ner.add_label(ent[2])

    # other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]

    # with nlp.disable_pipes(*other_pipes):
    losses = None

    optimizer = nlp.create_optimizer()
    for itn in range(iterations):
        print("Starting iterations " + str(itn))
        random.shuffle(TRAIN_DATA)
        losses = {}

        for text, annotations in TRAIN_DATA:
            doc = nlp.make_doc(text)

            print(isinstance(annotations["entities"], (list, tuple)))  # this prints True

            example = Example.from_dict(doc, annotations)
            losses = nlp.update(
                [example],
                drop=0.2,
                sgd=optimizer
            )

    print(losses)
    return nlp


if __name__ == "__main__":
    # mock training set
    TRAIN_DATA = [('Dog is an animal', {'entities': [(0, 3, 'ANIMAL')]}),
                  ('Cat is on the table', {'entities': [(0, 3, 'ANIMAL')]}),
                  ('Rats are pets', {'entities': [(0, 4, 'ANIMAL')]})]
    nlp = train_spacy(TRAIN_DATA)

Notice the following changes:

I changed your import of Example class to from spacy.training import Example. I think you were importing the wrong clase.
I'm using en_core_web_lg but with a blank model it should work too!
I commented other pipeline models disabling because in spacy 3.x pipeline is more complex and I think you can't disable the whole pipeline for NER task. How ever feel free to read official documentation and try if some of the other models are not needed.
Optimizer now is initialized using nlp.create_optimizer() instead of nlp.begin_training()
Note that annotations are already a dictionary in the expected format so you don't need to wrap it in a new dictionary: Example.from_dict(doc, annotations) should do the job.
Finally the loss now is returned as a result of model update instead of being passed as parameter.

I hope this help you and please ask questions if you need more help.

Best regards!

EDIT:

I also want to suggest some changes in your training script to take more advantage of spacy utils:

Use spacy.utilis.minibatch util to create mini batches from your training data.
Pass a whole minibacth of examples to update method instead of a minibatch of only one example.

Your code including this improve among other minor changes would looks as follos:

import random

import spacy
from spacy.training import Example


def train_spacy(data, iterations=30):
    TRAIN_DATA = data

    # nlp = spacy.blank("en")  # start with a blank model
    nlp = spacy.load("en_core_web_lg")

    if "ner" not in nlp.pipe_names:
        ner = nlp.add_pipe("ner", last=True)
    else:
        ner = nlp.get_pipe("ner")

    for _, annotations in TRAIN_DATA:
        for ent in annotations.get("entities"):
            ner.add_label(ent[2])

    # Init loss
    losses = None

    # Init and configure optimizer
    optimizer = nlp.create_optimizer()
    optimizer.learn_rate = 0.001  # Change to some lr you prefers
    batch_size = 32  # Choose batch size you prefers

    for itn in range(iterations):
        print("Starting iterations " + str(itn))
        random.shuffle(TRAIN_DATA)
        losses = {}

        # Batch the examples and iterate over them
        for batch in spacy.util.minibatch(TRAIN_DATA, size=batch_size):
            # Create Example instance for each training example in mini batch
            examples = [Example.from_dict(nlp.make_doc(text), annotations) for text, annotations in batch]
            # Update model with mini batch
            losses = nlp.update(examples, drop=0.2, sgd=optimizer)

    print(losses)
    return nlp


if __name__ == "__main__":
    # mock training set
    TRAIN_DATA = [('Dog is an animal', {'entities': [(0, 3, 'ANIMAL')]}),
                  ('Cat is on the table', {'entities': [(0, 3, 'ANIMAL')]}),
                  ('Rats are pets', {'entities': [(0, 4, 'ANIMAL')]})]
    nlp = train_spacy(TRAIN_DATA)

回复收藏 0 原文

~没有更多了~