根据内容确定文本的优先级
如果您有一个文本列表和一个对某些主题感兴趣的人,那么处理为给定人选择最相关文本的算法是什么?
我认为这是一个相当复杂的主题,作为答案,我希望有几个方向来研究文本分析、文本统计、人工智能等的各种方法。
谢谢
If you have a list of texts and a person interested in certain topics what are the algorithms dealing with choosing the most relevant text for a given person?
I believe that this is quite a complex topic and as an answer I expect a few directions to study various methodologies of text analysis, text statistics, artificial intelligence etc.
thank you
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
有很多算法可以完成这项任务。至少太多了,无法在这里一一列举。首先是一些出发点:
主题发现和推荐是两个截然不同的任务,尽管它们经常重叠。如果您拥有稳定的用户群,那么您可能无需进行任何主题发现即可提供非常好的推荐。
发现主题并为其分配名称也是两项不同的任务。这意味着通常能够更容易地判断出文本 A 和文本 B 共享相似的主题,而不是明确地陈述这个共同主题可能是什么。为主题命名最好由人类完成,例如让他们标记项目。
现在来看一些实际的例子。
TF-IDF 通常是一个很好的起点,但它也有严重的缺点。例如,它无法判断两个文本中的“car”和“truck”意味着这两个文本可能共享一个主题。
http://websom.hut.fi/websom/ 用于自动聚类数据的 Kohonen 地图。它学习主题,然后按主题组织文本。
http://de.wikipedia.org/wiki/Latent_Semantic_Analysis 将能够提升TF-IDF 通过检测不同单词之间的语义相似度。另请注意,这已获得专利,因此您可能无法使用它。
There are quite a few algorithms out there for this task. At least way too many to mention them all here. First some starting points:
Topic discovery and recommendation are two quite distinctive tasks, although they often overlap. If you have a stable userbase, you might be able to give very good recommendations without any topic discovery.
Discovering topics and assigning names to them are also two different tasks. This means it is often easier to be able to tell that text A and text B share a similar topic, than to explicetly be able to state what this common topic might be. Giving names to the topics is best done by humans, for example by having them tag the items.
Now to some actual examples.
TF-IDF is often a good starting point, however it also has severe drawbacks. For example it will not be able to tell that "car" and "truck" in two texts mean that these two probably share a topic.
http://websom.hut.fi/websom/ A Kohonen map for automatically clustering data. It learns the topics and then organizes the texts by topics.
http://de.wikipedia.org/wiki/Latent_Semantic_Analysis Will be able to boost TF-IDF by detecting semantic similarity among different words. Also note, that this has been patented, so you might not be able to use it.
Once you have a set of topics assigned by users or experts, you can also try almost any kind of machine learning method (for example SVM) to map the TF-IDF data to topics.
作为一名搜索引擎工程师,我认为结合使用两种技术可以最好地解决这个问题。
技术1,搜索(TF-IDF或其他算法)
使用搜索为没有用户统计信息的内容创建基线模型。有很多技术,但我认为 Apache Lucene/Solr 代码库是迄今为止最先进的最成熟、最稳定。
技术2,基于用户的推荐系统(k-最近邻其他算法)
当您开始获取用户统计信息时,请使用它来增强文本分析系统使用的相关性模型。用于解决此类问题的快速增长的代码库是 Apache Mahout 项目。
As a search engine engieneer I think this problem is best solved using two techniques in conjuction.
Technology 1, Search (TF-IDF or other algorithms)
Use search to create a baseline model for content where you dont have user statistics. There are a number of technologies out there but I think the Apache Lucene/Solr code base is by fare the most mature and stable.
Technology 2, User based recommenders (k-nearest neighborhood other algorithms)
When you start getting user statistics use this to enhance your relevance model used by the text analysis system. A fast growing codebase to solv these kinds of problem is the Apache Mahout project.
查看集体智能编程,这是对这些方面的各种技术的非常好的概述。也非常具有可读性。
Check out Programming Collective Intelligence, a really good overview of various techniques along these lines. Also very readable.