Lucene.Net:单词之间距离的相关性
我使用以下代码创建(并经常更新)用户索引(出于演示目的而稍微缩短):
Lucene.Net.Store.Directory directory = FSDirectory.Open(new System.IO.DirectoryInfo("TestLuceneIndex"));
StandardAnalyzer standardAnalyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29);
IndexWriter indexWriter = new IndexWriter(directory, standardAnalyzer, IndexWriter.MaxFieldLength.UNLIMITED);
Document doc = new Document();
doc.Add(new Field("UID", uid, Field.Store.YES, Field.Index.NOT_ANALYZED, Field.TermVector.NO));
doc.Add(new Field("GENDER", gender, Field.Store.YES, Field.Index.NOT_ANALYZED, Field.TermVector.NO));
doc.Add(new Field("COUNTRY", countrycode, Field.Store.YES, Field.Index.NOT_ANALYZED, Field.TermVector.NO));
doc.Add(new Field("CITY", citycode, Field.Store.YES, Field.Index.NOT_ANALYZED, Field.TermVector.NO));
doc.Add(new Field("USERDATA", userdata, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
doc.Add(new Field("USERINFO", userinfo, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
indexWriter.UpdateDocument(new Term("UID", uid), doc);
indexWriter.Optimize();
indexWriter.Commit();
indexWriter.Close();
存储在索引中的值如下:
UID - 用户 ID(字符串 GUID) GENDER - 性别 ID(字符串“0”(未识别)“1”(男)或“2”(女) COUNTRY - 国家代码(字符串如“US”、“FR”等) CITY - 城市代码(字符串“A121”、“C432”等) USERDATA - 用户详细信息的长字符串(类似于“John Doe [email protected] 设计师高等教育5年经验”) USERINFO - 关于用户的长文本字符串(例如“我的名字是 John Doe。我出生了......”)
然后我在索引中执行搜索。我会在两个字段(USERDATA 和 USERINFO)中进行搜索,只要有必要,我就会按性别、国家和城市过滤结果。结果我检索了 UID(我需要这个值来标识数据库中用户记录的 id)。
这是我用于搜索的代码:
Lucene.Net.Store.Directory directory = Lucene.Net.Store.FSDirectory.Open(new System.IO.DirectoryInfo("TestLuceneIndex");
standardAnalyzer = new Lucene.Net.Analysis.Standard.StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29);
Lucene.Net.Index.IndexReader indexReader = Lucene.Net.Index.IndexReader.Open(directory, true);
indexSearcher = new Lucene.Net.Search.IndexSearcher(indexReader);
Lucene.Net.Search.BooleanQuery booleanQuery = new Lucene.Net.Search.BooleanQuery();
Lucene.Net.QueryParsers.MultiFieldQueryParser queryTextParser = new Lucene.Net.QueryParsers.MultiFieldQueryParser(Lucene.Net.Util.Version.LUCENE_29, new string[] { "USERDATA", "USERINFO" }, standardAnalyzer);
Lucene.Net.Search.Query queryText = queryTextParser.Parse(SearchText);
booleanQuery.Add(queryText, Lucene.Net.Search.BooleanClause.Occur.MUST);
if (searchGender != "0")
{
Lucene.Net.Index.Term termGender = new Lucene.Net.Index.Term("GENDER", searchGender);
Lucene.Net.Search.Query queryGender = new Lucene.Net.Search.TermQuery(termGender);
booleanQuery.Add(queryGender, Lucene.Net.Search.BooleanClause.Occur.MUST);
}
if (searchCity != "0")
{
Lucene.Net.Index.Term termCity = new Lucene.Net.Index.Term("CITY", searchCity);
Lucene.Net.Search.Query queryCity = new Lucene.Net.Search.TermQuery(termCity);
booleanQuery.Add(queryCity, Lucene.Net.Search.BooleanClause.Occur.MUST);
}
if (searchCountry != "0")
{
Lucene.Net.Index.Term termCountry = new Lucene.Net.Index.Term("COUNTRY", searchCountry);
Lucene.Net.Search.Query queryCountry = new Lucene.Net.Search.TermQuery(termCountry);
booleanQuery.Add(queryCountry, Lucene.Net.Search.BooleanClause.Occur.MUST);
}
Lucene.Net.Search.TopScoreDocCollector collector = Lucene.Net.Search.TopScoreDocCollector.create(indexReader.MaxDoc(), true);
indexSearcher.Search(booleanQuery, collector);
Lucene.Net.Search.ScoreDoc[] scoreDocs=collector.TopDocs().scoreDocs;
Lucene.Net.Highlight.Formatter formatter = new Lucene.Net.Highlight.SimpleHTMLFormatter("<b>", "</b>");
Lucene.Net.Highlight.QueryScorer queryScorer = new Lucene.Net.Highlight.QueryScorer(booleanQuery);
highlighter = new Lucene.Net.Highlight.Highlighter(formatter, queryScorer);
Lucene.Net.Highlight.Fragmenter fragmenter = new Lucene.Net.Highlight.SimpleFragmenter(150);
highlighter.SetTextFragmenter(fragmenter);
除了使用多个单词时的相关性质量之外,一切都运行良好: 例如,当我搜索(microsoft .net 程序员)时,包含确切子字符串的结果的得分并不高于在文本的不同位置包含这些单词的结果。据我了解,这是由简单的事实引起的,即分数计算基于文本中搜索字符串的百分比因素,而不是字符串重合的准确性。但如何强制评分算法让资产的准确性更有价值呢?即如何强制发现单词之间的距离在相关性计算中被视为更重要?
I create (and update frequently) the index of users using following code (a bit shortened for demonstration purposes here):
Lucene.Net.Store.Directory directory = FSDirectory.Open(new System.IO.DirectoryInfo("TestLuceneIndex"));
StandardAnalyzer standardAnalyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29);
IndexWriter indexWriter = new IndexWriter(directory, standardAnalyzer, IndexWriter.MaxFieldLength.UNLIMITED);
Document doc = new Document();
doc.Add(new Field("UID", uid, Field.Store.YES, Field.Index.NOT_ANALYZED, Field.TermVector.NO));
doc.Add(new Field("GENDER", gender, Field.Store.YES, Field.Index.NOT_ANALYZED, Field.TermVector.NO));
doc.Add(new Field("COUNTRY", countrycode, Field.Store.YES, Field.Index.NOT_ANALYZED, Field.TermVector.NO));
doc.Add(new Field("CITY", citycode, Field.Store.YES, Field.Index.NOT_ANALYZED, Field.TermVector.NO));
doc.Add(new Field("USERDATA", userdata, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
doc.Add(new Field("USERINFO", userinfo, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
indexWriter.UpdateDocument(new Term("UID", uid), doc);
indexWriter.Optimize();
indexWriter.Commit();
indexWriter.Close();
The values, stored in index are as follows:
UID - user id (string GUID)
GENDER - id of gender (string "0" (unidentified) "1" (male) or "2" (female)
COUNTRY - country code (string like "US", "FR", etc)
CITY - city code (string "A121", "C432", etc)
USERDATA - long string of user detailes (something like "John Doe [email protected] designer high education 5 years of experience")
USERINFO - long string of text about user (something like "My name is John Doe. I was born ...")
Then I perform search in index. I do search in two fields (USERDATA and USERINFO) and whenever it is necessary I do filter the results by GENDER, COUNTRY and CITY. As the result I retrieve UID (I need this value to identify the id of record of user in DB).
This is a code I use for search:
Lucene.Net.Store.Directory directory = Lucene.Net.Store.FSDirectory.Open(new System.IO.DirectoryInfo("TestLuceneIndex");
standardAnalyzer = new Lucene.Net.Analysis.Standard.StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29);
Lucene.Net.Index.IndexReader indexReader = Lucene.Net.Index.IndexReader.Open(directory, true);
indexSearcher = new Lucene.Net.Search.IndexSearcher(indexReader);
Lucene.Net.Search.BooleanQuery booleanQuery = new Lucene.Net.Search.BooleanQuery();
Lucene.Net.QueryParsers.MultiFieldQueryParser queryTextParser = new Lucene.Net.QueryParsers.MultiFieldQueryParser(Lucene.Net.Util.Version.LUCENE_29, new string[] { "USERDATA", "USERINFO" }, standardAnalyzer);
Lucene.Net.Search.Query queryText = queryTextParser.Parse(SearchText);
booleanQuery.Add(queryText, Lucene.Net.Search.BooleanClause.Occur.MUST);
if (searchGender != "0")
{
Lucene.Net.Index.Term termGender = new Lucene.Net.Index.Term("GENDER", searchGender);
Lucene.Net.Search.Query queryGender = new Lucene.Net.Search.TermQuery(termGender);
booleanQuery.Add(queryGender, Lucene.Net.Search.BooleanClause.Occur.MUST);
}
if (searchCity != "0")
{
Lucene.Net.Index.Term termCity = new Lucene.Net.Index.Term("CITY", searchCity);
Lucene.Net.Search.Query queryCity = new Lucene.Net.Search.TermQuery(termCity);
booleanQuery.Add(queryCity, Lucene.Net.Search.BooleanClause.Occur.MUST);
}
if (searchCountry != "0")
{
Lucene.Net.Index.Term termCountry = new Lucene.Net.Index.Term("COUNTRY", searchCountry);
Lucene.Net.Search.Query queryCountry = new Lucene.Net.Search.TermQuery(termCountry);
booleanQuery.Add(queryCountry, Lucene.Net.Search.BooleanClause.Occur.MUST);
}
Lucene.Net.Search.TopScoreDocCollector collector = Lucene.Net.Search.TopScoreDocCollector.create(indexReader.MaxDoc(), true);
indexSearcher.Search(booleanQuery, collector);
Lucene.Net.Search.ScoreDoc[] scoreDocs=collector.TopDocs().scoreDocs;
Lucene.Net.Highlight.Formatter formatter = new Lucene.Net.Highlight.SimpleHTMLFormatter("<b>", "</b>");
Lucene.Net.Highlight.QueryScorer queryScorer = new Lucene.Net.Highlight.QueryScorer(booleanQuery);
highlighter = new Lucene.Net.Highlight.Highlighter(formatter, queryScorer);
Lucene.Net.Highlight.Fragmenter fragmenter = new Lucene.Net.Highlight.SimpleFragmenter(150);
highlighter.SetTextFragmenter(fragmenter);
Everything works well enough except the quality of relevance when using several words:
When I search for instance for (microsoft .net programmer) the results, containing exact substring are not scored higher, than results, containing those words in different places of text. I understand, that this is caused by simple fact that score calculation is based on factor of percentage of searching string in text rather than exactness of coincidence of strings. But how to force scoring algorithm to asset exactness more valuable ? I.e. how to force the distance between words found to be considered as more important in calculation of relevancy ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
最有效(也是最费力的方法)是编写自己的查询对象,这将提高为单词接近的文档分配更高的相关性。 将是一个很好的起点。
最简单的方法是使用邻近搜索和常规布尔查询:
("search text"~10 || (search && text))
。这将使邻近短语匹配更高。由于您正在构建自己的查询,因此您甚至可以将
"search text"~10
提升到比"search text"~20
多,即提升高于(搜索&&文本)
。The most effective (and most labor-intensive way) would be to write your own query object that would boost assign higher relevance to documents with the words in close proximity. SpanQuery would be a good place to start.
The easiest way would be to use a proximity search along with the regular boolean query:
("search text"~10 || (search && text))
. This will bring the proximity phrase matches higher.Since you are building your own query, you could even boost
"search text"~10
more than"search text"~20
which is boosted higher than(search && text)
.