Zend_Search_Lucene改变词频问题
我正在尝试更新 Lucene 索引中文档术语的搜索。目前,搜索根据该术语在文档中出现的次数进行评分。我想做的是如果该术语存在则评分,而不是该术语存在的次数。因此,包含该术语的文档一次得分与包含该术语的文档 100 次得分相同。
我尝试用我自己的类扩展 Zend_Search_Lucene_Search_Similarity,但说实话,我不确定这是否正常工作,因为分数仍然很低。
class MySimilarity extends Zend_Search_Lucene_Search_Similarity{
//override the default frequency of searching
public function tf($freq){
return 1.0;
}
public function lengthNorm($fieldName, $numTerms) {
return 1.0/sqrt($numTerms);
}
public function queryNorm($sumOfSquaredWeights) {
return 1.0/sqrt($sumOfSquaredWeights);
}
public function sloppyFreq($distance) {
return 1.0;
}
public function idfFreq($docFreq, $numDocs) {
return log($numDocs/(float)($docFreq+1)) + 1.0;
}
public function coord($overlap, $maxOverlap) {
return $overlap/(float)$maxOverlap;
}
}
现在,这是根据我在搜索旧谷歌时发现的示例构建的。然而,我所做的唯一真正的改变是对 tf() 函数。
任何对此的帮助,我都会非常感激,因为目前它真的搞乱了我的搜索。
谢谢,
格兰特
I am trying to update the searching of terms of documents within my Lucene index. Currently the searches score on the number of times the term appears in the document. What I would like to do is score if the term exists, rather than the number of times the term exists. So a document with the term in it once scores the same as a document with the term in it 100 times.
I've tried to extend the Zend_Search_Lucene_Search_Similarity with my own class, but to be honest I am not sure if this is working correctly as the scores are still quite low.
class MySimilarity extends Zend_Search_Lucene_Search_Similarity{
//override the default frequency of searching
public function tf($freq){
return 1.0;
}
public function lengthNorm($fieldName, $numTerms) {
return 1.0/sqrt($numTerms);
}
public function queryNorm($sumOfSquaredWeights) {
return 1.0/sqrt($sumOfSquaredWeights);
}
public function sloppyFreq($distance) {
return 1.0;
}
public function idfFreq($docFreq, $numDocs) {
return log($numDocs/(float)($docFreq+1)) + 1.0;
}
public function coord($overlap, $maxOverlap) {
return $overlap/(float)$maxOverlap;
}
}
Now this is built from examples I have found when searching good old google. However the only real change I've done has been to the tf() function.
Any help with this and I would be really greatful as at the moment it's really messing up my searches.
Thanks,
Grant
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我会尝试两件事来调试这个:
总的来说,更改 tf() 函数似乎是正确的做法。前提是您只想要相对顺序而不关心绝对分数。
I would try two things to debug this:
Overall, changing the tf() function seems the right thing to do. This, provided that you only want a relative order and do not care about the absolute score.