可以使用 Lingpipe 提取除 People、Org 和 Loc 之外的通用实体吗?
我读过 Lingpipe for NLP,发现我们有能力识别提及的人名、地点和组织的名称。我的问题是,如果我有一个文档训练集,其中提到了文本中的软件项目,我可以使用这个训练集来训练命名实体识别器吗?培训完成后,我应该能够向经过培训的模型提供一组测试文本文档,并且我应该能够识别那里提到的软件项目。
使用 NER 可以实现这种通用 NER 吗?如果是这样,我应该使用哪些功能?
谢谢 阿布舍克小号
I have read through Lingpipe for NLP and found that we have a capability there to identify mentions of names of people, locations and organizations. My questions is that if I have a training set of documents that have mentions of let's say software projects inside the text, can I use this training set to train a named entity recognizer? Once the training is complete, I should be able to feed a test set of textual documents to the trained model and I should be able to identify mentions of software projects there.
Is this generic NER possible using NER? If so, what features should I be using that I should feed?
Thanks
Abhishek S
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
前提是您有足够的培训数据和标记的软件项目,这是可能的。
如果使用 Lingpipe,我会使用字符 n 元模型作为您任务的第一个选项。它们很简单并且通常可以完成工作。如果结果不够好,一些标准 NER 功能包括:
最后,对于每个标记,您可以添加上下文特征,当前标记之前的标记(t-1,t-2 ...),当前标记之后的标记(t + 1,t + 2 ...)以及它们的二元组合 (t-2^t-1)、(t+1^t+2)。
Provided that you have enough training data with tagged software projects that would be possible.
If using Lingpipe, I would use character n-grams model as the first option for your task. They are simple and usually do the work. If results are not good enough some of the standard NER features are:
Finally, for each token you could add contextual features, tokens before the current one (t-1, t-2...), tokens after the current one (t+1,t+2...) as well as their bigram combinations (t-2^t-1), (t+1^t+2).
当然可以。只需获取包含您需要的所有类别的列车数据并按照教程 http:// /alias-i.com/lingpipe/demos/tutorial/ne/read-me.html。
不需要进行功能调整,因为 lingpipe 仅使用硬编码的功能(形状、序列词和 ngramms)
Of course you can. Just get train data with all categories you need and follow tutorial http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html.
No feature tuning is required since lingpipe uses only hardcoded one (shapes, sequnce word and ngramms)