如何在 OpenNLP 中训练命名实体识别器标识符?

发布于 2024-11-27 22:57:32 字数 797 浏览 5 评论 0原文

好的,我有以下代码来训练来自 OpenNLP 的 NER 标识符,

FileReader fileReader = new FileReader("train.txt");
ObjectStream fileStream = new PlainTextByLineStream(fileReader);
ObjectStream sampleStream = new NameSampleDataStream(fileStream);
TokenNameFinderModel model = NameFinderME.train("pt-br", "train", sampleStream, Collections.<String, Object>emptyMap());
nfm = new NameFinderME(model); 

我不知道我是否做错了什么,是否缺少某些内容,但分类不起作用。我认为 train.txt 是错误的。

发生的错误是所有令牌都只分类为一种类型。

我的 train.txt 数据类似于以下示例,但条目的变化和数量更多。另一件事是,我每次都从文本中逐字分类查找,而不是所有标记。

<START:distance> 8000m <END>
<START:temperature> 100ºC <END>
<START:weight> 50kg <END>
<START:name> Renato <END>

有人可以表明我做错了什么吗?

Ok, I have the following code to train the NER Identifier from OpenNLP

FileReader fileReader = new FileReader("train.txt");
ObjectStream fileStream = new PlainTextByLineStream(fileReader);
ObjectStream sampleStream = new NameSampleDataStream(fileStream);
TokenNameFinderModel model = NameFinderME.train("pt-br", "train", sampleStream, Collections.<String, Object>emptyMap());
nfm = new NameFinderME(model); 

I don't know if I'm doing something wrong of if something is missing, but the classifying is not working. I'm supposing that the train.txt is wrong.

The error that occurs is that all tokens are classified to only one type.

My train.txt data is something like the following example, but with a lot more of variation and quantity of entries. Another thing is that I'm classifind word by word from a text per time, and not all tokens.

<START:distance> 8000m <END>
<START:temperature> 100ºC <END>
<START:weight> 50kg <END>
<START:name> Renato <END>

Somebody can show what I doing wrong?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

怀中猫帐中妖 2024-12-04 22:57:32

你的训练数据不合格。

您应该将所有实体放在句子中的上下文中:

At an altitude of <START:distance> 8000m <END> the temperature of boiling water is less than <START:temperature> 100ºC <END> .
The climber <START:name> Renato <END> is carrying <START:weight> 50kg <END> of equipment.

如果您的训练数据来自现实世界的句子并且与您正在分类的句子具有相同的风格,您将获得更好的结果。例如,如果您要处理新闻,则应该使用报纸语料库进行训练。

此外,您将需要数千个句子来构建您的模型!也许您可以从一百个开始引导并使用较差的模型来改进您的语料库并再次训练您的模型。

当然,你应该对句子的所有标记进行分类,否则将没有上下文来决定实体的类型。

Your training data is not OK.

You should put all entities in a context inside a sentence:

At an altitude of <START:distance> 8000m <END> the temperature of boiling water is less than <START:temperature> 100ºC <END> .
The climber <START:name> Renato <END> is carrying <START:weight> 50kg <END> of equipment.

You will have better results if your training data derives from real world sentences and have the same style of the sentences you are classifying. For example you should train using a newspaper corpus if you will process news.

Also you will need thousands of sentences to build your model! Maybe you can start with a hundred to bootstrap and use the poor model to improve your corpus and train your model again.

And of course you should classify all tokens of a sentence, otherwise there will be no context to decide the type of an entity.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文