Zend_Search_Lucene改变词频问题

发布于 2024-09-10 10:34:55 字数 962 浏览 9 评论 0原文

我正在尝试更新 Lucene 索引中文档术语的搜索。目前,搜索根据该术语在文档中出现的次数进行评分。我想做的是如果该术语存在则评分,而不是该术语存在的次数。因此,包含该术语的文档一次得分与包含该术语的文档 100 次得分相同。

我尝试用我自己的类扩展 Zend_Search_Lucene_Search_Similarity,但说实话,我不确定这是否正常工作,因为分数仍然很低。

class MySimilarity extends Zend_Search_Lucene_Search_Similarity{

//override the default frequency of searching
public function tf($freq){
    return 1.0; 
}

public function lengthNorm($fieldName, $numTerms) {
    return 1.0/sqrt($numTerms);
}

public function queryNorm($sumOfSquaredWeights) {
    return 1.0/sqrt($sumOfSquaredWeights);
}

public function sloppyFreq($distance) {
    return 1.0;
}

public function idfFreq($docFreq, $numDocs) {
    return log($numDocs/(float)($docFreq+1)) + 1.0;
}

public function coord($overlap, $maxOverlap) {
    return $overlap/(float)$maxOverlap;
}
}

现在,这是根据我在搜索旧谷歌时发现的示例构建的。然而,我所做的唯一真正的改变是对 tf() 函数。

任何对此的帮助,我都会非常感激,因为目前它真的搞乱了我的搜索。

谢谢,

格兰特

I am trying to update the searching of terms of documents within my Lucene index. Currently the searches score on the number of times the term appears in the document. What I would like to do is score if the term exists, rather than the number of times the term exists. So a document with the term in it once scores the same as a document with the term in it 100 times.

I've tried to extend the Zend_Search_Lucene_Search_Similarity with my own class, but to be honest I am not sure if this is working correctly as the scores are still quite low.

class MySimilarity extends Zend_Search_Lucene_Search_Similarity{

//override the default frequency of searching
public function tf($freq){
    return 1.0; 
}

public function lengthNorm($fieldName, $numTerms) {
    return 1.0/sqrt($numTerms);
}

public function queryNorm($sumOfSquaredWeights) {
    return 1.0/sqrt($sumOfSquaredWeights);
}

public function sloppyFreq($distance) {
    return 1.0;
}

public function idfFreq($docFreq, $numDocs) {
    return log($numDocs/(float)($docFreq+1)) + 1.0;
}

public function coord($overlap, $maxOverlap) {
    return $overlap/(float)$maxOverlap;
}
}

Now this is built from examples I have found when searching good old google. However the only real change I've done has been to the tf() function.

Any help with this and I would be really greatful as at the moment it's really messing up my searches.

Thanks,

Grant

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

萌能量女王 2024-09-17 10:34:56

我会尝试两件事来调试这个:

  1. 构建一个非常小的索引 - 两个文档,每个文档一个字段,第一个包含单词“boat”,第二个包含短语“boat Boat”。测试你的搜索。
  2. 尝试仅覆盖 tf() 函数。这就是你想要的改变。覆盖其他部分(例如规范)需要使用新的相似性函数重新索引。在重新索引之前确保您确实需要它。

总的来说,更改 tf() 函数似乎是正确的做法。前提是您只想要相对顺序而不关心绝对分数。

I would try two things to debug this:

  1. Build a really small index - two documents, a single field in each, the first having the word "boat", and the second the phrase "boat boat". Test your search on that.
  2. Try to override only the tf() function. This is the change you want. Overriding other parts, such as the norm, requires reindexing using the new similarity function. Make sure you actually need this before reindexing.

Overall, changing the tf() function seems the right thing to do. This, provided that you only want a relative order and do not care about the absolute score.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文