可增量训练的实体识别分类器
我正在做一些语义网络/自然语言处理研究,并且我有一组稀疏记录,其中包含数字和非数字数据的混合,表示标记有从简单英语句子中提取的各种特征的实体。
例如,
uid|features
87w39423|speaker=432, session=43242, sentence=34, obj_called=bob,favorite_color_is=blue
4535k3l535|speaker=512, session=2384, sentence=7, obj_called=tree,isa=plant,located_on=wilson_street
23432424|speaker=997, session=8945305, sentence=32, obj_called=salty,isa=cat,eats=mice
09834502|speaker=876, session=43242, sentence=56, obj_called=the monkey,ate=the banana
928374923|speaker=876, session=43242, sentence=57, obj_called=it,was=delicious
294234234|speaker=876, session=43243, sentence=58, obj_called=the monkey,ate=the banana
sd09f8098|speaker=876, session=43243, sentence=59, obj_called=it,was=hungry
...
单个实体可能会出现多次(但每次都具有不同的UID),并且可能与其他出现的实体具有重叠的特征。第二个数据集代表上述UID中哪些肯定是相同的。
例如
uid|sameas
87w39423|234k2j,234l24jlsd,dsdf9887s
4535k3l535|09d8fgdg0d9,l2jk34kl,sd9f08sf
23432424|io43po5,2l3jk42,sdf90s8df
09834502|294234234,sd09f8098
...
,我将使用什么算法来增量训练一个可以采用一组特征的分类器,并立即推荐 N 个最相似的 UID 以及这些 UID 是否实际上代表的概率>相同实体?或者,我还想获得缺失特征的建议来填充,然后重新分类以获得更确定的匹配。
我研究了传统的近似最近邻算法。例如 FLANN 和 ANN,我认为这些不合适,因为它们不可训练(在监督学习意义上)也不是他们通常为稀疏设计非数字输入。
作为一个非常幼稚的第一次尝试,我正在考虑使用朴素贝叶斯分类器,将每个 SameAs 关系转换为一组训练样本。因此,对于具有 B 相同关系的每个实体 A,我将迭代每个实体并训练分类器,如下所示:
classifier = Classifier()
for entity,sameas_entities in sameas_dataset:
entity_features = get_features(entity)
for other_entity in sameas_entities:
other_entity_features = get_features(other_entity)
classifier.train(cls=entity, ['left_'+f for f in entity_features] + ['right_'+f for f in other_entity_features])
classifier.train(cls=other_entity, ['left_'+f for f in other_entity_features] + ['right_'+f for f in entity_features])
然后使用它:
>>> print classifier.findSameAs(dict(speaker=997, session=8945305, sentence=32, obj_called='salty',isa='cat',eats='mice'), n=7)
[(1.0, '23432424'),(0.999, 'io43po5', (1.0, '2l3jk42'), (1.0, 'sdf90s8df'), (0.76, 'jerwljk'), (0.34, 'rlekwj32424'), (0.08, '09843jlk')]
>>> print classifier.findSameAs(dict(isa='cat',eats='mice'), n=7)
[(0.09, '23432424'), (0.06, 'jerwljk'), (0.03, 'rlekwj32424'), (0.001, '09843jlk')]
>>> print classifier.findMissingFeatures(dict(isa='cat',eats='mice'), n=4)
['obj_called','has_fur','has_claws','lives_at_zoo']
这种方法有多可行?最初的批量训练会非常慢,至少是 O(N^2),但增量训练支持将允许更新更快地发生。
有哪些更好的方法?
I'm doing some semantic-web/nlp research, and I have a set of sparse records, containing a mix of numeric and non-numeric data, representing entities labeled with various features extracted from simple English sentences.
e.g.
uid|features
87w39423|speaker=432, session=43242, sentence=34, obj_called=bob,favorite_color_is=blue
4535k3l535|speaker=512, session=2384, sentence=7, obj_called=tree,isa=plant,located_on=wilson_street
23432424|speaker=997, session=8945305, sentence=32, obj_called=salty,isa=cat,eats=mice
09834502|speaker=876, session=43242, sentence=56, obj_called=the monkey,ate=the banana
928374923|speaker=876, session=43242, sentence=57, obj_called=it,was=delicious
294234234|speaker=876, session=43243, sentence=58, obj_called=the monkey,ate=the banana
sd09f8098|speaker=876, session=43243, sentence=59, obj_called=it,was=hungry
...
A single entity may appear more than once (but with a different UID each time), and may have overlapping features with its other occurrences. A second data set represents which of the above UIDs are definitely the same.
e.g.
uid|sameas
87w39423|234k2j,234l24jlsd,dsdf9887s
4535k3l535|09d8fgdg0d9,l2jk34kl,sd9f08sf
23432424|io43po5,2l3jk42,sdf90s8df
09834502|294234234,sd09f8098
...
What algorithm(s) would I use to incrementally train a classifier that could take a set of features, and instantly recommend the N most similar UIDs and probability of whether or not those UIDs actually represent the same entity? Optionally, I'd also like to get a recommendation of missing features to populate and then re-classify to get a more certain matches.
I researched traditional approximate nearest neighbor algorithms. such as FLANN and ANN, and I don't think these would be appropriate since they're not trainable (in a supervised learning sense) nor are they typically designed for sparse non-numeric input.
As a very naive first-attempt, I was thinking about using a naive bayesian classifier, by converting each SameAs relation into a set of training samples. So, for each entity A with B sameas relations, I would iterate over each and train the classifier like:
classifier = Classifier()
for entity,sameas_entities in sameas_dataset:
entity_features = get_features(entity)
for other_entity in sameas_entities:
other_entity_features = get_features(other_entity)
classifier.train(cls=entity, ['left_'+f for f in entity_features] + ['right_'+f for f in other_entity_features])
classifier.train(cls=other_entity, ['left_'+f for f in other_entity_features] + ['right_'+f for f in entity_features])
And then use it like:
>>> print classifier.findSameAs(dict(speaker=997, session=8945305, sentence=32, obj_called='salty',isa='cat',eats='mice'), n=7)
[(1.0, '23432424'),(0.999, 'io43po5', (1.0, '2l3jk42'), (1.0, 'sdf90s8df'), (0.76, 'jerwljk'), (0.34, 'rlekwj32424'), (0.08, '09843jlk')]
>>> print classifier.findSameAs(dict(isa='cat',eats='mice'), n=7)
[(0.09, '23432424'), (0.06, 'jerwljk'), (0.03, 'rlekwj32424'), (0.001, '09843jlk')]
>>> print classifier.findMissingFeatures(dict(isa='cat',eats='mice'), n=4)
['obj_called','has_fur','has_claws','lives_at_zoo']
How viable is this approach? The initial batch training would be horribly slow, at least O(N^2), but incremental training support would allow updates to happen more quickly.
What are better approaches?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我认为这更像是一个聚类问题而不是分类问题。您的实体是数据点,而数据也是实体到集群的映射。在这种情况下,集群是您的实体所引用的不同“事物”。
您可能想看看半监督聚类。简单的谷歌搜索找到了论文Active Semi-Supervision for Pairwise约束聚类,它为增量/主动算法提供伪代码,并使用监督,即它采用训练数据来指示哪些实体在同一集群中或不在同一集群中。您可以轻松地从您的 Sameas 数据中得出这一点,假设 - 例如 - uids
87w39423
和4535k3l535
绝对是不同的东西。但是,要使其发挥作用,您需要根据数据中的特征提出距离度量。这里有很多选择,例如,您可以在特征上使用简单的汉明距离,但这里度量函数的选择有点随意。我不知道有什么选择度量的好方法,但也许您在考虑最近邻算法时已经研究过这一点。
您可以使用距集群中心的距离度量得出置信度分数。如果您想要实际的隶属概率,那么您需要使用概率聚类模型,例如高斯混合模型。有很多软件可以进行高斯混合建模,我不知道有哪些软件是半监督或增量的。
如果您想回答的问题类似于“给定一个实体,哪些其他实体可能指代同一事物?”,可能还有其他合适的方法,但我不认为这就是您所追求的。
I think this is more of a clustering than a classification problem. Your entities are data points and the sameas data is a mapping of entities to clusters. In this case, clusters are the distinct 'things' your entities refer to.
You might want to take a look at semi-supervised clustering. A brief google search turned up the paper Active Semi-Supervision for Pairwise Constrained Clustering which gives pseudocode for an algorithm that is incremental/active and uses supervision in the sense that it takes training data indicating which entities are or are not in the same cluster. You could derive this easily from your sameas data, assuming that - for example - uids
87w39423
and4535k3l535
are definitely distinct things.However, to get this to work you need to come up with a distance metric based on the features in the data. You have a lot of options here, for example you could use a simple Hamming distance on the features, but the choice of metric function here is a little bit arbitrary. I'm not aware of any good ways of choosing the metric, but perhaps you have already looked into this when you were considering nearest neighbour algorithms.
You can come up with confidence scores using the distance metric from the centres of the clusters. If you want an actual probability of membership then you would want to use a probabilistic clustering model, like a Gaussian mixture model. There's quite a lot of software to do Gaussian mixture modelling, I don't know of any that is semi-supervised or incremental.
There may be other suitable approaches if the question you wanted to answer was something like "given an entity, which other entities are likely to refer to the same thing?", but I don't think that is what you are after.
您可能想看看这个方法:
更多想法:
“实体”是什么意思?实体是“obj_used”所引用的东西吗?您是否使用“obj_used”的内容来匹配不同的实体,例如“John”与“John Doe”相似?您是否使用句子之间的邻近度来表示相似的实体?映射的更大目标(任务)是什么?
You may want to take a look at this method:
More thoughts:
What do you mean by 'entity'? Is entity the thing that is referred by 'obj_called'? Do you use the content of 'obj_called' to match different entities, e.g. 'John' is similar to 'John Doe'? Do you use proximity between sentences to indicate similar entities? What is the greater goal (task) of the mapping?