当前位置：文江博客话题详情

Python machine-learning neural-network

基于神经网络的文档排序

发布于 2024-12-06 07:04:17 字数 243 浏览 0 评论 0 原文

我计划实现一个使用神经网络的文档排序器。如何通过考虑类似文章的评级来对一篇文档进行评级？有什么好的 python 库可以做到这一点吗？谁能推荐一本关于人工智能的好书，有Python代码。

编辑

我计划制作一个推荐引擎，它可以向类似用户提供推荐，并使用使用标签聚集的数据。用户将有机会为文章投票。将会有大约十万篇文章。文档将根据其标签进行聚类。给定关键字的文章将根据其标签进行获取，并通过神经网络进行排名。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

伤感在游骋 2024-12-13 07:04:17

您试图解决的问题称为“协作过滤”。

神经网络

最先进的神经网络方法之一是深度置信网络和受限玻尔兹曼机。有关 GPU (CUDA) 的快速 Python 实现，请参阅 PyBrain。

关于您的具体问题的学术论文：

这可能是神经网络和协作过滤（电影）的最新技术：

<块引用>

Salakhutdinov, R.、Mnih, A. Hinton, G、受限玻尔兹曼
协同过滤机器，出现在
第24届国际会议论文集
机器学习 2007。
PDF< /p>
< p>用 Python 实现的 Hopfield 网络：

<块引用>

Huang, Z.、Chen, H. 和 Zeng, D. 应用关联检索技术来缓解协同过滤中的稀疏性问题。
ACM 信息系统交易 (TOIS)，22, 1,116--142，2004 年，ACM。 PDF< /p>
< p>关于使用受限玻尔兹曼机进行协同过滤的论文（他们说 Python 不适合这项工作）：

<块引用>

G.放大镜。协同过滤：可扩展
使用受限玻尔兹曼机的方法。
硕士论文，列日大学，2010年。
PDF

一篇关于受限玻尔兹曼机协同过滤的论文

目前还没有神经网络最先进的协同过滤。而且它们并不是最简单、广泛使用的解决方案。关于您对使用神经网络数据太少的原因的评论，神经网络在这种情况下没有固有的优势/劣势。因此，您可能需要考虑更简单的机器学习方法。

其他机器学习技术

当今最好的方法结合了k-最近邻和矩阵分解。

如果您锁定了 Python，请查看 pysuggest（SUGGEST 的 Python 包装器）推荐引擎）和 PyRSVD（主要针对协同过滤中的应用，特别是Netflix 竞赛）。

如果您愿意尝试其他开源技术，请查看：开源协同过滤框架和 http://www.infoanarchy.org/en/Collaborative_Filtering。

The problem you are trying to solve is called "collaborative filtering".

Neural Networks

One state-of-the-art neural network method is Deep Belief Networks and Restricted Boltzman Machines. For a fast python implementation for a GPU (CUDA) see here. Another option is PyBrain.

Academic papers on your specific problem:

This is probably the state-of-the-art of neural networks and collaborative filtering (of movies):

Salakhutdinov, R., Mnih, A. Hinton, G, Restricted Boltzman
Machines for Collaborative Filtering, To appear in
Proceedings of the 24th International Conference on
Machine Learning 2007.
PDF
A Hopfield network implemented in Python:

Huang, Z. and Chen, H. and Zeng, D. Applying associative retrieval techniques to alleviate the sparsity problem in collaborative filtering.
ACM Transactions on Information Systems (TOIS), 22, 1,116--142, 2004, ACM. PDF
A thesis on collaborative filtering with Restricted Boltzman Machines (they say Python is not practical for the job):

G. Louppe. Collaborative filtering: Scalable
approaches using restricted Boltzmann machines.
Master's thesis, Universite de Liege, 2010.
PDF

Neural networks are not currently the state-of-the-art in collaborative filtering. And they are not the simplest, wide-spread solutions. Regarding your comment about the reason for using NNs being having too little data, neural networks don't have an inherent advantage/disadvantage in that case. Therefore, you might want to consider simpler Machine Learning approaches.

Other Machine Learning Techniques

The best methods today mix k-Nearest Neighbors and Matrix Factorization.

If you are locked on Python, take a look at pysuggest (a Python wrapper for the SUGGEST recommendation engine) and PyRSVD (primarily aimed at applications in collaborative filtering, in particular the Netflix competition).

If you are open to try other open source technologies look at: Open Source collaborative filtering frameworks and http://www.infoanarchy.org/en/Collaborative_Filtering.

回复收藏 0 原文

随波逐流 2024-12-13 07:04:17

。

如果您不致力于神经网络，我在 SVM 方面很幸运，k 均值聚类也可能会有所帮助这两者均由 Milk 提供。它还对特征选择进行逐步判别分析，如果您尝试按主题查找类似的文档。

如果您选择这条路线，上帝会帮助您，但是 ROOT 框架有一个强大的机器学习包，名为 TMVA提供了大量的分类方法，包括 SVM、NN 和 Boosted Decision Trees（也可能是一个不错的选择）。我还没有使用过它，但 pyROOT 提供了与 ROOT 功能的 python 绑定。公平地说，当我第一次使用 ROOT 时，我没有 C++ 知识，而且在概念上也超出了我的理解范围，所以这对你来说实际上可能是令人惊奇的。 ROOT拥有大量的数据处理工具。

（注意：我还使用卡方特征选择和余弦匹配编写了一个相当准确的文档语言标识符。显然你的问题更难，但考虑到你可能不需要非常强大的工具。）

存储与处理

你提到你的问题是：

...文章将根据其标签进行获取，并通过神经网络进行排名。

正如另一个注意事项，关于机器学习，您应该了解的一件事是，训练和评估等过程往往需要一段时间。您可能应该考虑仅对每个标签的所有文档排序一次（假设您知道所有标签）并存储结果。一般来说，对于机器学习来说，使用更多的存储比更多的处理要好得多。

现在谈谈您的具体情况。您没有说明您有多少个标签，因此为了圆度，我们假设您有 1000 个标签。如果您将每个文档的排名结果存储在每个标签上，则可以存储 1 亿个浮点数。数据量很大，计算全部数据需要一段时间，但检索它们的速度非常快。相反，如果您根据需要重新计算每个文档的排名，则必须对其进行 1000 次传递，每个标签一次。根据您正在执行的操作类型和文档的大小，这可能需要几秒钟到几分钟的时间。如果这个过程足够简单，您可以等待代码按需执行其中几个评估而不会感到无聊，那么就去做吧，但是您应该在做出任何设计决策/编写之前对这个过程进行计时您不想使用的代码。

祝你好运！

Packages

If you're not committed to neural networks, I've had good luck with SVM, and k-means clustering might also be helpful. Both of these are provided by Milk. It also does Stepwise Discriminant Analysis for feature selection, which will definitely be useful to you if you're trying to find similar documents by topic.

God help you if you choose this route, but the ROOT framework has a powerful machine learning package called TMVA that provides a large number of classification methods, including SVM, NN, and Boosted Decision Trees (also possibly a good option). I haven't used it, but pyROOT provides python bindings to ROOT functionality. To be fair, when I first used ROOT I had no C++ knowledge and was in over my head conceptually too, so this might actually be amazing for you. ROOT has a HUGE number of data processing tools.

(NB: I've also written a fairly accurate document language identifier using chi-squared feature selection and cosine matching. Obviously your problem is harder, but consider that you might not need very hefty tools for it.)

Storage vs Processing

You mention in your question that:

...articles would be fetched based on their tags and passed through a neural network for ranking.

Just as another NB, one thing you should know about machine learning is that processes like training and evaluating tend to take a while. You should probably consider ranking all documents for each tag only once (assuming you know all the tags) and storing the results. For machine learning generally, it's much better to use more storage than more processing.

Now to your specific case. You don't say how many tags you have, so let's assume you have 1000, for roundness. If you store the results of your ranking for each doc on each tag, that gives you 100 million floats to store. That's a lot of data, and calculating them all will take a while, but retrieving them is very fast. If instead you recalculate the ranking for each document on demand, you have to do 1000 passes of it, one for each tag. Depending on the kind of operations you're doing and the size of your docs, that could take a few seconds to a few minutes. If the process is simple enough that you can wait for your code to do several of these evaluations on demand without getting bored, then go for it, but you should time this process before making any design decisions / writing code you won't want to use.

Good luck!

回复收藏 0 原文

第几種人 2024-12-13 07:04:17

如果我理解正确的话，您的任务与协作过滤相关。解决这个问题有很多可能的方法；我建议您关注维基百科页面以概述您可以选择的主要方法。

对于您的项目工作，我建议您使用简单的 BackProp 查看基于 Python 的神经网络简介神经网络实现和分类示例。这不是“最佳”解决方案，但也许您可以根据该示例构建您的系统，而不需要更大的框架。

回复收藏 0 原文

要走就滚别墨迹 2024-12-13 07:04:17

您可能想查看 PyBrain。

回复收藏 0 原文

骄兵必败 2024-12-13 07:04:17

FANN 库看起来也很有前途。

回复收藏 0 原文

无妨# 2024-12-13 07:04:17

我不太确定神经网络是否是解决这个问题的最佳方法。我认为欧几里得距离得分或皮尔逊相关得分与基于项目或用户的过滤相结合将是一个好的开始。

关于该主题的一本优秀书籍是：Toby Segaran 的《集体智慧编程》

回复收藏 0 原文

~没有更多了~

关于作者

我不在是我

暂无简介

0 文章

0 评论

24 人气

关注发私信

友情链接

文江博客

基于神经网络的文档排序

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（6）