搜索排名/相关性算法
例如,在开发知识库中的文章数据库时 - 排序和显示与用户问题最相关的答案的最佳方法是什么?
您是否会使用其他数据,例如根据以前的用户是否找到帮助文章的关键字权重,或者您是否认为简单的关键字匹配算法就足够了?
When developing a database of articles in a Knowledge Base (for example) - what are the best ways to sort and display the most relevant answers to a users' question?
Would you use additional data such as keyword weighting based on whether previous users found the article of help, or do you find a simple keyword matching algorithm to be sufficient?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
也许最简单、最天真的方法会立即产生有用的结果,那就是实现 *tf-idf :
在我最近的一个相关问题中,我了解到一本关于这个主题的优秀免费书籍,您可以下载或在线阅读:
Perhaps the easiest and most naive approach that will give immediately useful results would be to implement *tf-idf:
In a recent related question of mine here I learned of an excellent free book on this topic which you can download or read online:
这是一个很难回答的问题,像谷歌这样的公司正在付出很多努力来解决这个问题。 请查看 Google Enterprise Search Appliance 或 Exalead 企业搜索。
然后,作为个人观点,与天真的关键字搜索和按文档查看次数排序相比,我认为任何“天真的”方法都不会改善结果。
如果您有可能将您的知识库公开到网络上,那么,就这样做,让您最喜欢的搜索引擎为您处理搜索。
That's a hard question, and companies like Google are pushing a lot of efforts to address this question. Have a look at Google Enterprise Search Appliance or Exalead Enterprise Search.
Then, as a personal opinion, I don't think that any "naive" approach is going to improve much the result compared to naive keyword search and ordering by the number of views on the documents.
If you have the possibility to expose your knowledge base to the web, then, just do it, and let your favorite search engine handles the search for you.
我认为这里的角度不是检索本身......而是对检索到的信息的相关性进行评分(一种更加被动和被动的方法),稍后可以使用它来改进搜索引擎。
我想你可以尝试 -
knn 在 tfidf 上检索信息
手动标记这些检索到的信息相关性分数
只是一个想法......
第三点实际上是基于Rocchio算法。 您可以在此处查看
I think the angle here is not the retrieval itself... its about scoring the relevence of the information retrieved (A more reactive and passive approach) which can be later used to improve the search engine.
I guess you can try -
knn on tfidf for retrieving information
Hand tagging these retrieved info a relevency score
Just a thought...
The third point is actually based on Rocchio algorithm. You can see it here
您的确切问题更具体一点会更好。 您可以使用许多不同的技术。 其中许多是由其他数据驱动的。 您当然可以使用 Lucene 并构建自己的索引。 许多语言都有与 lucene 的绑定。 向上还有 Solr 项目,它是 Lucene,具有许多工具和额外的功能。 这可能更符合您正在寻找的内容。
意图是很棘手的,大多数现代搜索引擎都依赖统计意图来帮助对结果进行排序。 您始终可以使用“本文是否有用”按钮并存储指向有用文档的查询文本。 然后,您可以向索引添加一层信息来增强特定的单词或短语并帮助它们指向某些文档。
需要考虑的一些事情...有多少文档? 平均长度是多少? 它们更新频繁吗? 用户如何处理文档? 独特单词在文档中的传播是什么样的? (更简单地说,很容易根据共同的独特功能将查询与特定文档进行匹配。)
如果是在网络上,您始终可以创建一个仅搜索您的网站的谷歌自定义搜索引擎,尽管您可能会发现这由于各种原因而未达到最佳状态。
您始终可以从简单的索引开始,然后通过与用户交谈和捕获数据逐渐使其变得更加复杂。
A little more specificity of your exact problem would be good. There are a lot of different techniques that you can use. Many of these are driven by other pieces of data. You can of course use Lucene and build your own indexes. There are bindings for many languages to lucene. Moving up there is also the Solr project which is Lucene with a lot of tools and extra functionality around it. That may be more along the lines of what you are looking for.
Intent is tricky and most modern search engines rely on statistical intent to aid in the ordering of results. You can always have an is this article useful button and store the query text that leads to useful documents. You could then add a layer of information to the index to boost specific words or phrases and help them point to certain documents.
Some things to think about...How many documents? What is the average length? Are they updated frequently? What do users do with the documents? What does the spread of unique words to documents look like? (More simply is it easy to match a query with a specific document(s) based on common unique features.)
If it is on the web you can always make a google custom search engine that just searches your site although you may find this to be sub-optimal for a variety of reasons.
You can always start with a simple index and gradually make it more sophisticated by talking with users and capturing data.
处理问题时,关键词匹配是不够的,你需要了解意图,正如 joannes 所说的一个搜索中非常热门的话题
keyword matching is not enough when dealing with questions, you need to understand intent, as joannes say a very hot topic in search