扩展/改变 Zend_Search_Lucene 的搜索方式
我目前正在使用 Zend_Search_Lucene 来索引和搜索当前大约 1000 个左右的文档。我想做的是改变引擎对文档的点击率的评分方式,而不是当前的默认值。
Zend_Search_Lucene 根据文档内的命中次数频率进行评分,因此具有 10 个单词 PHP 匹配的文档将比仅具有 3 个 PHP 匹配的文档得分更高。我想做的是传递一些关键词,并根据这些关键词的点击率进行评分。例如,
我传递了 5 个关键字,PHP、MySQL、Javascript、HTML 和 CSS 我根据索引进行搜索。一份文档有 3 个与这些关键词的匹配,一份文档有全部 4 个匹配,这 4 个匹配得分最高。文件中这些词出现的次数与我无关。
现在我已经快速浏览了 Zend_Search_Lucene_Search_Similarity 但我必须承认我不确定(或者不太聪明)知道如何使用它来实现我所追求的目标。
我想要使用 Lucene 做的事情是否可行,或者是否有更好的解决方案?
I am currently using Zend_Search_Lucene to index and search a number of documents currently at around a 1000 or so. What I would like to do is change how the engine scores hits on a document, from the current default.
Zend_Search_Lucene scores on the frequency of number of hits within a document, so a document that has 10 matches of the word PHP will score higher than a document with only 3 matches of PHP. What I am trying to do is pass a number of key words and score depending on the hits of those keywords. e.g.
I pass 5 key words say,PHP, MySQL, Javascript, HTML and CSS that I search against the index. One document has 3 matches to those key words and one document has all 4 matches, the 4 matches scores the highest. The number of instances of those words in the document do not concern me.
Now I've had a quick look at Zend_Search_Lucene_Search_Similarity however I have to confess that I am not sure (or that bright) to know how to use this to achieve what I am after.
Is what I want to do possible using Lucene or is there a better solution out there?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
对于我在 Zend_Search_Lucene_Search_Similarity 手册的部分,我首先扩展默认的相似性类来覆盖 tf(术语频率)方法,这样它就不会改变分数:
这样匹配的数量应该不予考虑。你认为这就足够了吗?
然后,在索引之前将其设置为默认相似度算法:
For what I've understood in the Zend_Search_Lucene_Search_Similarity section of the manual, I'd start by extending the default similarity class to override the tf (term frequency) method so that it doesn't alter the score:
This way the number of matches shouldn't be taken into account. Do you think this would be enough?
Then, set it to be the default similarity algorithm before indexing: