如何处理这个命名实体分类任务?
我正在问一个相关的问题这里,但这个问题更普遍。我获取了一个大型语料库,并用它们的命名实体注释了一些单词。就我而言,它们是特定于领域的,我将它们称为:实体、操作、事件。我想使用它们作为提取更多命名实体的种子。例如,下面是一个句子:
当机器人出现技术故障时,物体被抛出,但后来被另一个机器人接住。
被标记为:
当(机器人)/实体发生(技术故障)/事件时, (对象)/实体曾(抛出)/动作,但后来被(捕获)/动作 (另一个机器人)/实体。
给定这样的例子,我是否可以训练分类器来识别新的命名实体?例如,给出这样的句子:
纳米机器人有一个错误,所以它撞到了墙上。
应该像这样标记:
(纳米机器人)/实体发生了(错误)/事件,因此它(崩溃)/操作进入了 (墙)/实体。
当然,我知道 100% 的准确性是不可能的,但我有兴趣了解任何正式的方法来做到这一点。有什么建议吗?
I am asking a related question here but this question is more general. I have taken a large corpora and annotated some words with their named-entities. In my case, they are domain-specific and I call them: Entity, Action, Incident. I want to use these as a seed for extracting more named-entities. For example, following is one sentence:
When the robot had a technical glitch, the object was thrown but was later caught by another robot.
is tagged as:
When the (robot)/Entity had a (technical glitch)/Incident, the
(object)/Entity was (thrown)/Action but was later (caught)/Action by
(another robot)/Entity.
Given examples like this, is there anyway I can train a classifier to recognize new named-entities? For instance, given a sentence like this:
The nanobot had a bug and so it crashed into the wall.
should be tagged somewhat like this:
The (nanobot)/Entity had a (bug)/Incident and so it (crashed)/Action into the (wall)/Entity.
Of course, I am aware that 100% accuracy is not possible but I would be interested in knowing any formal approaches to do this. Any suggestions?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这根本不是命名实体识别,因为所有标记部分都不是名称,因此 NER 系统的特征集对您没有帮助(英语 NER 系统往往非常依赖大写,并且更喜欢名词)。这是一种信息提取/语义解释。我怀疑这在机器学习环境中会非常困难,因为您的注释确实不一致:
为什么“另一个机器人”没有注释?
如果你想解决这类问题,你最好从一些正则表达式开始,也许可以与字符串的 POS 标记版本进行匹配。
This is not named-entity recognition at all, since none of the labeled parts are names, so the feature sets for NER systems won't help you (English NER systems tend to rely on capitalization quite strongly and will prefer nouns). This is a kind of information extraction/semantic interpretation. I suspect this is going to be quite hard in a machine learning setting because your annotation is really inconsistent:
Why is "another robot" not annotated?
If you want to solve this kind of problem, you'd better start out with some regular expressions, maybe matched against POS-tagged versions of the string.
我可以想到两种方法。
首先是句子中单词的模式匹配。像这样的东西(伪代码,尽管它类似于 NLTK 块解析器语法):
这 2 个模式可以(大致)捕获第一句话的 2 个部分。如果你的句子种类不多,这是一个不错的选择。我相信通过精心选择的模式可以达到 90% 的准确率。缺点是该模型难以扩展/修改。
另一种方法是挖掘句子中单词之间的依赖关系,例如使用 斯坦福依存解析器。除此之外,它还允许挖掘宾语、主语和谓语,这看起来与您想要的非常相似:在您的第一句话中,“机器人”是主语,“had”是谓语,“glitch”是宾语。
I can think of 2 approaches.
First is pattern matching over words in sentence. Something like this (pseudocode, though it is similar to NLTK chunk parser syntax):
These 2 patterns can (roughly) catch 2 parts of your first sentence. This is a good choice if you have not very much kinds of sentences. I believe it is possible to get up to 90% accuracy with well-chosen patterns. Drawback is that this model is hard to extend/modify.
Another approach is to mine dependencies between words in sentence, for example, with Stanford Dependency Parser. Among other things, it allows to mine object, subject and predicate, that seems very similar to what you want: in your first sentence "robot" is subject, "had" is predicate and "glitch" is object.