我计划实现一个使用神经网络的文档排序器。如何通过考虑类似文章的评级来对一篇文档进行评级?有什么好的 python 库可以做到这一点吗?谁能推荐一本关于人工智能的好书,有Python代码。
编辑
我计划制作一个推荐引擎,它可以向类似用户提供推荐,并使用使用标签聚集的数据。用户将有机会为文章投票。将会有大约十万篇文章。文档将根据其标签进行聚类。给定关键字的文章将根据其标签进行获取,并通过神经网络进行排名。
I'm planning of implementing a document ranker which uses neural networks. How can one rate a document by taking in to consideration the ratings of similar articles?. Any good python libraries for doing this?. Can anyone recommend a good book for AI, with python code.
EDIT
I'm planning to make a recommendation engine which would make recommendations from similar users as well as using the data clustered using tags. User would be given chance to vote for articles. There will be about hundred thousand articles. Documents would be clustered based on their tags. Given a keyword articles would be fetched based on their tags and passed through a neural network for ranking.
发布评论
评论(6)
您试图解决的问题称为“协作过滤”。
神经网络
最先进的神经网络方法之一是深度置信网络和受限玻尔兹曼机。有关 GPU (CUDA) 的快速 Python 实现,请参阅 PyBrain。
关于您的具体问题的学术论文:
这可能是神经网络和协作过滤(电影)的最新技术:
<块引用>
Salakhutdinov, R.、Mnih, A. Hinton, G、受限玻尔兹曼
协同过滤机器,出现在
第24届国际会议论文集
机器学习 2007。
PDF< /p>
<块引用>
Huang, Z.、Chen, H. 和 Zeng, D. 应用关联检索技术来缓解协同过滤中的稀疏性问题。
ACM 信息系统交易 (TOIS),22, 1,116--142,2004 年,ACM。 PDF< /p>
<块引用>
G.放大镜。协同过滤:可扩展
使用受限玻尔兹曼机的方法。
硕士论文,列日大学,2010年。
PDF
一篇关于受限玻尔兹曼机协同过滤的论文
目前还没有神经网络最先进的协同过滤。而且它们并不是最简单、广泛使用的解决方案。关于您对使用神经网络数据太少的原因的评论,神经网络在这种情况下没有固有的优势/劣势。因此,您可能需要考虑更简单的机器学习方法。
其他机器学习技术
当今最好的方法结合了k-最近邻和矩阵分解。
如果您锁定了 Python,请查看 pysuggest(SUGGEST 的 Python 包装器)推荐引擎)和 PyRSVD(主要针对协同过滤中的应用,特别是Netflix 竞赛)。
如果您愿意尝试其他开源技术,请查看:开源协同过滤框架和 http://www.infoanarchy.org/en/Collaborative_Filtering。
The problem you are trying to solve is called "collaborative filtering".
Neural Networks
One state-of-the-art neural network method is Deep Belief Networks and Restricted Boltzman Machines. For a fast python implementation for a GPU (CUDA) see here. Another option is PyBrain.
Academic papers on your specific problem:
This is probably the state-of-the-art of neural networks and collaborative filtering (of movies):
A Hopfield network implemented in Python:
A thesis on collaborative filtering with Restricted Boltzman Machines (they say Python is not practical for the job):
Neural networks are not currently the state-of-the-art in collaborative filtering. And they are not the simplest, wide-spread solutions. Regarding your comment about the reason for using NNs being having too little data, neural networks don't have an inherent advantage/disadvantage in that case. Therefore, you might want to consider simpler Machine Learning approaches.
Other Machine Learning Techniques
The best methods today mix k-Nearest Neighbors and Matrix Factorization.
If you are locked on Python, take a look at pysuggest (a Python wrapper for the SUGGEST recommendation engine) and PyRSVD (primarily aimed at applications in collaborative filtering, in particular the Netflix competition).
If you are open to try other open source technologies look at: Open Source collaborative filtering frameworks and http://www.infoanarchy.org/en/Collaborative_Filtering.
。
如果您不致力于神经网络,我在 SVM 方面很幸运,k 均值聚类也可能会有所帮助 这两者均由 Milk 提供。它还对特征选择进行逐步判别分析,如果您尝试按主题查找类似的文档。
如果您选择这条路线,上帝会帮助您,但是 ROOT 框架有一个强大的机器学习包,名为 TMVA提供了大量的分类方法,包括 SVM、NN 和 Boosted Decision Trees(也可能是一个不错的选择)。我还没有使用过它,但 pyROOT 提供了与 ROOT 功能的 python 绑定。公平地说,当我第一次使用 ROOT 时,我没有 C++ 知识,而且在概念上也超出了我的理解范围,所以这对你来说实际上可能是令人惊奇的。 ROOT拥有大量的数据处理工具。
(注意:我还使用卡方特征选择和余弦匹配编写了一个相当准确的文档语言标识符。显然你的问题更难,但考虑到你可能不需要非常强大的工具。)
存储与处理
你提到你的问题是:
正如另一个注意事项,关于机器学习,您应该了解的一件事是,训练和评估等过程往往需要一段时间。您可能应该考虑仅对每个标签的所有文档排序一次(假设您知道所有标签)并存储结果。一般来说,对于机器学习来说,使用更多的存储比更多的处理要好得多。
现在谈谈您的具体情况。您没有说明您有多少个标签,因此为了圆度,我们假设您有 1000 个标签。如果您将每个文档的排名结果存储在每个标签上,则可以存储 1 亿个浮点数。数据量很大,计算全部数据需要一段时间,但检索它们的速度非常快。相反,如果您根据需要重新计算每个文档的排名,则必须对其进行 1000 次传递,每个标签一次。根据您正在执行的操作类型和文档的大小,这可能需要几秒钟到几分钟的时间。如果这个过程足够简单,您可以等待代码按需执行其中几个评估而不会感到无聊,那么就去做吧,但是您应该在做出任何设计决策/编写之前对这个过程进行计时您不想使用的代码。
祝你好运!
Packages
If you're not committed to neural networks, I've had good luck with SVM, and k-means clustering might also be helpful. Both of these are provided by Milk. It also does Stepwise Discriminant Analysis for feature selection, which will definitely be useful to you if you're trying to find similar documents by topic.
God help you if you choose this route, but the ROOT framework has a powerful machine learning package called TMVA that provides a large number of classification methods, including SVM, NN, and Boosted Decision Trees (also possibly a good option). I haven't used it, but pyROOT provides python bindings to ROOT functionality. To be fair, when I first used ROOT I had no C++ knowledge and was in over my head conceptually too, so this might actually be amazing for you. ROOT has a HUGE number of data processing tools.
(NB: I've also written a fairly accurate document language identifier using chi-squared feature selection and cosine matching. Obviously your problem is harder, but consider that you might not need very hefty tools for it.)
Storage vs Processing
You mention in your question that:
Just as another NB, one thing you should know about machine learning is that processes like training and evaluating tend to take a while. You should probably consider ranking all documents for each tag only once (assuming you know all the tags) and storing the results. For machine learning generally, it's much better to use more storage than more processing.
Now to your specific case. You don't say how many tags you have, so let's assume you have 1000, for roundness. If you store the results of your ranking for each doc on each tag, that gives you 100 million floats to store. That's a lot of data, and calculating them all will take a while, but retrieving them is very fast. If instead you recalculate the ranking for each document on demand, you have to do 1000 passes of it, one for each tag. Depending on the kind of operations you're doing and the size of your docs, that could take a few seconds to a few minutes. If the process is simple enough that you can wait for your code to do several of these evaluations on demand without getting bored, then go for it, but you should time this process before making any design decisions / writing code you won't want to use.
Good luck!
如果我理解正确的话,您的任务与协作过滤相关。解决这个问题有很多可能的方法;我建议您关注维基百科页面以概述您可以选择的主要方法。
对于您的项目工作,我建议您使用简单的 BackProp 查看基于 Python 的神经网络简介神经网络实现和分类示例。这不是“最佳”解决方案,但也许您可以根据该示例构建您的系统,而不需要更大的框架。
If I understand correctly, your task is something related to Collaborative filtering. There are many possible approaches to this problem; I suggest you follow the wikipedia page to have an overview of the main approaches you can choose.
For your project work I can suggest looking at Python based intro to Neural Networks with a simple BackProp NN implementation and a classification example. This is not "the" solution, but perhaps you can build your system out of that example without the need for a bigger framework.
您可能想查看 PyBrain。
You might want to check out PyBrain.
FANN 库看起来也很有前途。
The FANN library also looks promising.
我不太确定神经网络是否是解决这个问题的最佳方法。我认为欧几里得距离得分或皮尔逊相关得分与基于项目或用户的过滤相结合将是一个好的开始。
关于该主题的一本优秀书籍是:Toby Segaran 的《集体智慧编程》
I am not really sure if a neural networks are the best way to solve this. I think Euclidean Distance Score or Pearson Correlation Score combined with item or user based filtering would be a good start.
An excellent book on the topic is: Programming Collective Intelligence from Toby Segaran