是否有监督学习算法以标签作为输入,并产生概率作为输出?
假设我想确定我对 SO 问题投票的概率,仅基于哪些标签存在或不存在。
我们还假设我有大量关于我过去投票或未投票的问题的数据。
是否有一种机器学习算法可以获取这些历史数据,对其进行训练,然后能够预测我对未来问题的投票概率?请注意,它必须是概率,而不仅仅是某个任意分数。
我们假设与任何给定问题相关的标签最多有 7 个,这些标签是从数万个超集中抽取的。
我希望它能够在标签之间建立相当复杂的连接,而不是每个标签只是以“线性”方式对最终结果做出贡献(就像贝叶斯垃圾邮件过滤器中的单词一样)。
例如,“java”这个词可能会增加我的赞成票概率,除非它与“数据库”一起出现,但是当“数据库”与“ruby”一起出现时,可能会增加我的赞成票概率。
哦,它应该在计算上是合理的(在一两个小时内对数百万个问题进行训练)。
我应该在这里研究什么方法?
Let's say I want to determine the probability that I will upvote a question on SO, based only on which tags are present or absent.
Let's also imagine that I have plenty of data about past questions that I did or did not upvote.
Is there a machine learning algorithm that could take this historical data, train on it, and then be able to predict my upvote probability for future questions? Note that it must be the probability, not just some arbitrary score.
Let's assume that there will be up-to 7 tags associated with any given question, these being drawn from a superset of tens of thousands.
My hope is that it is able to make quite sophisticated connections between tags, rather than each tag simply contributing to the end result in a "linear" way (much as words do in a Bayesian spam filter).
So for example, it might be that the word "java" increases my upvote probability, except when it is present with "database", however "database" might increase my upvote probability when present with "ruby".
Oh, and it should be computationally reasonable (training within an hour or two on millions of questions).
What approaches should I research here?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
鉴于每条消息可能没有很多标签,您可以只创建“n-gram”标签并应用朴素贝叶斯。回归树还会在叶节点产生经验概率,使用 +1 表示赞成,使用 0 表示不赞成。请参阅http://www.stat.cmu.edu/~ cshalizi/350-2006/lecture-10.pdf 一些可读的讲义和 http:// /sites.google.com/site/rtranking/ 用于开源实现。
Given that there probably aren't many tags per message, you could just create "n-gram" tags and apply naive Bayes. Regression trees would also produce an empirical probability at the leaf nodes, using +1 for upvote and 0 for no upvote. See http://www.stat.cmu.edu/~cshalizi/350-2006/lecture-10.pdf for some readable lecture notes and http://sites.google.com/site/rtranking/ for an open source implementation.
您可以尝试多种方法(线性回归、SMV、神经网络)。输入向量应由所有可能的标签组成,其中每个标签代表一个维度。
然后,训练集中的每条记录都必须根据标签转换为输入向量。例如,假设您的训练集中有 4 个标签的不同组合(php、ruby、ms、sql),并且您定义了一个未加权的输入向量 [php、ruby、ms、sql]。假设您有以下 3 条记录,它们被转换为加权输入向量:
php、sql -> [1,0,0,1]
红宝石→ [0, 1, 0, 0]
ms, sql -> [0, 0, 1, 1]
如果您使用线性回归,您可以使用以下公式
y = k * X,
其中 y 代表您情况下的答案(赞成/反对)并插入已知值(X - 加权输入向量) 。
如果您使用线性回归,如何计算权重,您可以阅读此处,但重点是创建二进制输入向量的大小等于(或更大,如果考虑到一些其他变量)所有标签的数量,然后对于每个记录,为每个标签设置权重(如果不包含则为 0,否则为 1)。
You can try several methods (linear regression, SMV, neural networks). The input vector should consist of all possible tags, where each tag represents one dimension.
Then each record in a training set has to be transformed to the input vector according to the tags. For example let's say you have different combinations of 4 tags in your training set (php, ruby, ms, sql) and you define an unweighted input vector [php, ruby, ms, sql]. Let's say you have the following 3 records whic are transformed to weighted input vectors:
php, sql -> [1, 0, 0, 1]
ruby -> [0, 1, 0, 0]
ms, sql -> [0, 0, 1, 1]
In case you use linear regression you use the following formula
y = k * X
where y represents an answer (upvote/downvote) in your case and by inserting known values (X - weighted input vectors).
How ta calculate weights in case you use linear regression you can read here but the point is to create binary input vectors which size is equal (or larger in case you take into account some other variables) to the number of all tags and then for each record you set weights for each tag (0 if it is not included or 1 otherwise).