可增量训练的实体识别分类器

发布于 2024-12-13 11:31:51 字数 2682 浏览 1 评论 0原文

我正在做一些语义网络/自然语言处理研究，并且我有一组稀疏记录，其中包含数字和非数字数据的混合，表示标记有从简单英语句子中提取的各种特征的实体。

例如，

uid|features
87w39423|speaker=432, session=43242, sentence=34, obj_called=bob,favorite_color_is=blue
4535k3l535|speaker=512, session=2384, sentence=7, obj_called=tree,isa=plant,located_on=wilson_street
23432424|speaker=997, session=8945305, sentence=32, obj_called=salty,isa=cat,eats=mice
09834502|speaker=876, session=43242, sentence=56, obj_called=the monkey,ate=the banana
928374923|speaker=876, session=43242, sentence=57, obj_called=it,was=delicious
294234234|speaker=876, session=43243, sentence=58, obj_called=the monkey,ate=the banana
sd09f8098|speaker=876, session=43243, sentence=59, obj_called=it,was=hungry
...

单个实体可能会出现多次（但每次都具有不同的UID），并且可能与其他出现的实体具有重叠的特征。第二个数据集代表上述UID中哪些肯定是相同的。

例如

uid|sameas
87w39423|234k2j,234l24jlsd,dsdf9887s
4535k3l535|09d8fgdg0d9,l2jk34kl,sd9f08sf
23432424|io43po5,2l3jk42,sdf90s8df
09834502|294234234,sd09f8098
...

，我将使用什么算法来增量训练一个可以采用一组特征的分类器，并立即推荐 N 个最相似的 UID 以及这些 UID 是否实际上代表的概率>相同实体？或者，我还想获得缺失特征的建议来填充，然后重新分类以获得更确定的匹配。

我研究了传统的近似最近邻算法。例如 FLANN 和 ANN，我认为这些不合适，因为它们不可训练（在监督学习意义上）也不是他们通常为稀疏设计非数字输入。

作为一个非常幼稚的第一次尝试，我正在考虑使用朴素贝叶斯分类器，将每个 SameAs 关系转换为一组训练样本。因此，对于具有 B 相同关系的每个实体 A，我将迭代每个实体并训练分类器，如下所示：

classifier = Classifier()
for entity,sameas_entities in sameas_dataset:
    entity_features = get_features(entity)
    for other_entity in sameas_entities:
        other_entity_features = get_features(other_entity)
        classifier.train(cls=entity, ['left_'+f for f in entity_features] + ['right_'+f for f in other_entity_features])
        classifier.train(cls=other_entity, ['left_'+f for f in other_entity_features] + ['right_'+f for f in entity_features])

然后使用它：

>>> print classifier.findSameAs(dict(speaker=997, session=8945305, sentence=32, obj_called='salty',isa='cat',eats='mice'), n=7)
[(1.0, '23432424'),(0.999, 'io43po5', (1.0, '2l3jk42'), (1.0, 'sdf90s8df'), (0.76, 'jerwljk'), (0.34, 'rlekwj32424'), (0.08, '09843jlk')]
>>> print classifier.findSameAs(dict(isa='cat',eats='mice'), n=7)
[(0.09, '23432424'), (0.06, 'jerwljk'), (0.03, 'rlekwj32424'), (0.001, '09843jlk')]
>>> print classifier.findMissingFeatures(dict(isa='cat',eats='mice'), n=4)
['obj_called','has_fur','has_claws','lives_at_zoo']

这种方法有多可行？最初的批量训练会非常慢，至少是 O(N^2)，但增量训练支持将允许更新更快地发生。

有哪些更好的方法？

原文

I'm doing some semantic-web/nlp research, and I have a set of sparse records, containing a mix of numeric and non-numeric data, representing entities labeled with various features extracted from simple English sentences.

e.g.

uid|features
87w39423|speaker=432, session=43242, sentence=34, obj_called=bob,favorite_color_is=blue
4535k3l535|speaker=512, session=2384, sentence=7, obj_called=tree,isa=plant,located_on=wilson_street
23432424|speaker=997, session=8945305, sentence=32, obj_called=salty,isa=cat,eats=mice
09834502|speaker=876, session=43242, sentence=56, obj_called=the monkey,ate=the banana
928374923|speaker=876, session=43242, sentence=57, obj_called=it,was=delicious
294234234|speaker=876, session=43243, sentence=58, obj_called=the monkey,ate=the banana
sd09f8098|speaker=876, session=43243, sentence=59, obj_called=it,was=hungry
...

A single entity may appear more than once (but with a different UID each time), and may have overlapping features with its other occurrences. A second data set represents which of the above UIDs are definitely the same.

e.g.

uid|sameas
87w39423|234k2j,234l24jlsd,dsdf9887s
4535k3l535|09d8fgdg0d9,l2jk34kl,sd9f08sf
23432424|io43po5,2l3jk42,sdf90s8df
09834502|294234234,sd09f8098
...

What algorithm(s) would I use to incrementally train a classifier that could take a set of features, and instantly recommend the N most similar UIDs and probability of whether or not those UIDs actually represent the same entity? Optionally, I'd also like to get a recommendation of missing features to populate and then re-classify to get a more certain matches.

I researched traditional approximate nearest neighbor algorithms. such as FLANN and ANN, and I don't think these would be appropriate since they're not trainable (in a supervised learning sense) nor are they typically designed for sparse non-numeric input.

As a very naive first-attempt, I was thinking about using a naive bayesian classifier, by converting each SameAs relation into a set of training samples. So, for each entity A with B sameas relations, I would iterate over each and train the classifier like:

classifier = Classifier()
for entity,sameas_entities in sameas_dataset:
    entity_features = get_features(entity)
    for other_entity in sameas_entities:
        other_entity_features = get_features(other_entity)
        classifier.train(cls=entity, ['left_'+f for f in entity_features] + ['right_'+f for f in other_entity_features])
        classifier.train(cls=other_entity, ['left_'+f for f in other_entity_features] + ['right_'+f for f in entity_features])

And then use it like:

>>> print classifier.findSameAs(dict(speaker=997, session=8945305, sentence=32, obj_called='salty',isa='cat',eats='mice'), n=7)
[(1.0, '23432424'),(0.999, 'io43po5', (1.0, '2l3jk42'), (1.0, 'sdf90s8df'), (0.76, 'jerwljk'), (0.34, 'rlekwj32424'), (0.08, '09843jlk')]
>>> print classifier.findSameAs(dict(isa='cat',eats='mice'), n=7)
[(0.09, '23432424'), (0.06, 'jerwljk'), (0.03, 'rlekwj32424'), (0.001, '09843jlk')]
>>> print classifier.findMissingFeatures(dict(isa='cat',eats='mice'), n=4)
['obj_called','has_fur','has_claws','lives_at_zoo']

How viable is this approach? The initial batch training would be horribly slow, at least O(N^2), but incremental training support would allow updates to happen more quickly.

What are better approaches?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

山色无中 2024-12-20 11:31:51

我认为这更像是一个聚类问题而不是分类问题。您的实体是数据点，而数据也是实体到集群的映射。在这种情况下，集群是您的实体所引用的不同“事物”。

您可能想看看半监督聚类。简单的谷歌搜索找到了论文Active Semi-Supervision for Pairwise约束聚类，它为增量/主动算法提供伪代码，并使用监督，即它采用训练数据来指示哪些实体在同一集群中或不在同一集群中。您可以轻松地从您的 Sameas 数据中得出这一点，假设 - 例如 - uids 87w39423 和 4535k3l535 绝对是不同的东西。

但是，要使其发挥作用，您需要根据数据中的特征提出距离度量。这里有很多选择，例如，您可以在特征上使用简单的汉明距离，但这里度量函数的选择有点随意。我不知道有什么选择度量的好方法，但也许您在考虑最近邻算法时已经研究过这一点。

您可以使用距集群中心的距离度量得出置信度分数。如果您想要实际的隶属概率，那么您需要使用概率聚类模型，例如高斯混合模型。有很多软件可以进行高斯混合建模，我不知道有哪些软件是半监督或增量的。

如果您想回答的问题类似于“给定一个实体，哪些其他实体可能指代同一事物？”，可能还有其他合适的方法，但我不认为这就是您所追求的。

回复收藏 0 原文