用于查找缺失属性的可扩展分类器

发布于 2024-09-08 16:13:52 字数 672 浏览 5 评论 0原文

我有一个很大的稀疏矩阵,表示数百万个实体的属性。例如,代表一个实体的一条记录可能具有属性“has(fur)”、“has(tail)”、“makesSound(meow)”和“is(cat)”。

然而,这个数据并不完整。例如,另一个实体可能具有典型“is(cat)”实体的所有属性,但它可能缺少“is(cat)”属性。在本例中,我想确定该实体应该具有“is(cat)”属性的概率。

所以我试图解决的问题是确定每个实体应该包含哪些缺失的属性。给定任意记录,我想找到前 N 个最有可能缺失但应包含的属性。我不确定此类问题的正式名称是什么,因此我不确定在研究当前解决方案时要搜索什么。对于此类问题是否有可扩展的解决方案?

我的第一个方法是简单地计算每个缺失属性的条件概率(例如 P(is(cat)|has(fur) 和 has(tail) and ... )),但这似乎是一个非常慢的方法。另外,根据我对条件概率的传统计算的理解,我想我会遇到问题,其中我的实体包含一些与其他 is(cat) 实体不常见的不寻常属性,导致条件概率为零。

我的第二个想法是为每个属性训练一个最大熵分类器,然后根据实体当前的属性对其进行评估。我认为概率计算会更加灵活,但这仍然存在可扩展性问题,因为我必须为潜在的数百万个属性训练单独的分类器。此外,如果我想找到最有可能包含的前 N ​​个属性,我仍然需要评估所有分类器,这可能需要很长时间。

有更好的解决方案吗?

I have a large sparse matrix representing attributes for millions of entities. For example, one record, representing an entity, might have attributes "has(fur)", "has(tail)", "makesSound(meow)", and "is(cat)".

However, this data is incomplete. For example, another entity might have all the attributes of a typical "is(cat)" entity, but it might be missing the "is(cat)" attribute. In this case, I want to determine the probability that this entity should have the "is(cat)" attribute.

So the problem I'm trying to solve is determining which missing attributes each entity should contain. Given an arbitrary record, I want to find the top N most likely attributes that are missing but should be included. I'm not sure what the formal name is for this type of problem, so I'm unsure what to search for when researching current solutions. Is there a scalable solution for this type of problem?

My first is to simply calculate the conditional probability for each missing attribute (e.g. P(is(cat)|has(fur) and has(tail) and ... )), but that seems like a very slow approach. Plus, as I understand the traditional calculation of conditional probability, I imagine I'd run into problems where my entity contains a few unusual attributes that aren't common with other is(cat) entities, causing the conditional probability to be zero.

My second idea is to train a Maximum Entropy classifier for each attribute, and then evaluate it based on the entity's current attributes. I think the probability calculation would be much more flexible, but this would still have scalability problems, since I'd have to train separate classifiers for potentially millions attributes. In addition, if I wanted to find the top N most likely attributes to include, I'd still have to evaluate all the classifiers, which would likely take forever.

Are there better solutions?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

少女七分熟 2024-09-15 16:13:55

如果您有一个大型数据集并且担心可扩展性,那么我会研究 Apache Mahout。 Mahout 是一个机器学习和数据挖掘库,可以帮助您完成项目,特别是它们已经内置了一些最著名的算法:

  • 协作过滤基于
  • 用户和项目的推荐器
  • K-Means、模糊 K-Means 聚类
  • 均值移位聚类
  • 狄利克雷过程聚类
  • 潜在狄利克雷分配
  • 奇异值分解
  • 并行频繁模式挖掘
  • 互补朴素贝叶斯分类器
  • 基于随机森林决策树的分类器
  • 高性能 java 集合(以前是 colt 集合)

If you have a large data set and you're worried about scalability, then I would look into Apache Mahout. Mahout is a Machine Learning and Data Mining library that might help you with your project, in particular they have some of the most well known algorithms already built-in:

  • Collaborative Filtering
  • User and Item based recommenders
  • K-Means, Fuzzy K-Means clustering
  • Mean Shift clustering
  • Dirichlet process clustering
  • Latent Dirichlet Allocation
  • Singular value decomposition
  • Parallel Frequent Pattern mining
  • Complementary Naive Bayes classifier
  • Random forest decision tree based classifier
  • High performance java collections (previously colt collections)
蓝眼睛不忧郁 2024-09-15 16:13:54

这听起来像是一个典型的推荐问题。对于每个属性,使用单词“电影评级”,对于每一行,使用单词“人物”。对于每个人,您都希望找到他们可能喜欢但尚未评分的电影。

您应该了解 Netflix 挑战的一些更成功的方法。数据集非常大,因此效率是重中之重。一个好的起点可能是论文“推荐系统的矩阵分解技术”

This sounds like a typical recommendation problem. For each attribute use the word 'movie rating' and for each row use the word 'person'. For each person, you want to find the movies that they will probably like but haven't rated yet.

You should look at some of the more successful approaches to the Netflix Challenge. The dataset is pretty large, so efficiency is a high priority. A good place to start might be the paper 'Matrix Factorization Techniques for Recommender Systems'.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文