可以使用 Lingpipe 提取除 People、Org 和 Loc 之外的通用实体吗?

发布于 2024-12-06 07:12:40 字数 220 浏览 2 评论 0原文

我读过 Lingpipe for NLP,发现我们有能力识别提及的人名、地点和组织的名称。我的问题是,如果我有一个文档训练集,其中提到了文本中的软件项目,我可以使用这个训练集来训练命名实体识别器吗?培训完成后,我应该能够向经过培训的模型提供一组测试文本文档,并且我应该能够识别那里提到的软件项目。

使用 NER 可以实现这种通用 NER 吗?如果是这样,我应该使用哪些功能?

谢谢 阿布舍克小号

I have read through Lingpipe for NLP and found that we have a capability there to identify mentions of names of people, locations and organizations. My questions is that if I have a training set of documents that have mentions of let's say software projects inside the text, can I use this training set to train a named entity recognizer? Once the training is complete, I should be able to feed a test set of textual documents to the trained model and I should be able to identify mentions of software projects there.

Is this generic NER possible using NER? If so, what features should I be using that I should feed?

Thanks
Abhishek S

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

执笏见 2024-12-13 07:12:40

前提是您有足够的培训数据和标记的软件项目,这是可能的。

如果使用 Lingpipe,我会使用字符 n 元模型作为您任务的第一个选项。它们很简单并且通常可以完成工作。如果结果不够好,一些标准 NER 功能包括:

  • 标记
  • 词性 (POS)
  • 大写
  • 标点符号
  • 字符签名:这些是一些想法: ( LUCENE -> AAAAAA -> A) 、 (Lucene -> Aaaaaa - > Aa ), (Lucene-core --> Aaaaa-aaaa --> Aa-a)
  • 组成一个也可能有用gazzeteer(软件项目列表),如果您可以从 Wikipedia、sourceforge 或任何其他内部资源获取该列表。

最后,对于每个标记,您可以添加上下文特征,当前标记之前的标记(t-1,t-2 ...),当前标记之后的标记(t + 1,t + 2 ...)以及它们的二元组合 (t-2^t-1)、(t+1^t+2)。

Provided that you have enough training data with tagged software projects that would be possible.

If using Lingpipe, I would use character n-grams model as the first option for your task. They are simple and usually do the work. If results are not good enough some of the standard NER features are:

  • tokens
  • part of speech (POS)
  • capitalization
  • punctuaction
  • character signatures: these are some ideas: ( LUCENE -> AAAAAA -> A) , (Lucene -> Aaaaaa -> Aa ), (Lucene-core --> Aaaaa-aaaa --> Aa-a)
  • it may also be useful to compose a gazzeteer (list of software projects) if you can obtain that from Wikipedia, sourceforge or any other internal resource.

Finally, for each token you could add contextual features, tokens before the current one (t-1, t-2...), tokens after the current one (t+1,t+2...) as well as their bigram combinations (t-2^t-1), (t+1^t+2).

太阳公公是暖光 2024-12-13 07:12:40

当然可以。只需获取包含您需要的所有类别的列车数据并按照教程 http:// /alias-i.com/lingpipe/demos/tutorial/ne/read-me.html
不需要进行功能调整,因为 lingpipe 仅使用硬编码的功能(形状、序列词和 ngramms)

Of course you can. Just get train data with all categories you need and follow tutorial http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html.
No feature tuning is required since lingpipe uses only hardcoded one (shapes, sequnce word and ngramms)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文