Lucene.NET 荧光笔插件突出显示奇怪
我正在尝试将 Lucene.NET 荧光笔添加到我的搜索中,但是它做了一些非常奇怪的突出显示,我做错了什么?
这是突出显示的代码:
// stuff here to get scoreDocs
var content = doc.GetField("content").StringValue();
// content = "Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been"
var highlighter = new Highlighter(new StrongFormatter(), new HtmlEncoder(), new QueryScorer(query.Rewrite(indexSearcher.GetIndexReader())));
highlighter.SetTextFragmenter(new SimpleFragmenter(100));
var tokenStream = analyzer.TokenStream("content", new StringReader(content));
var bestFragment = highlighter.GetBestFragment(tokenStream, content);
搜索 "lorem"
给出了这个 bestFragment 值:
<strong>Lorem</strong> <strong>Ipsum</strong> is <strong>simply</strong> <strong>dummy</strong> <strong>text</strong> of the <strong>printing</strong> and <strong>typesetting</strong> <strong>industry</strong>. <strong>Lorem</strong> <strong>Ipsum</strong> <strong>has</strong> <strong>been</strong>
如您所见,它突出显示的不仅仅是 "Lorem"
。为什么?
我该如何让这种行为变得明智?
我正在使用 StandardAnalyzer
,我的查询类似于 "content:lorem"
编辑: 我正在使用 Lucene.NET 2.9.2
I'm trying to add the Lucene.NET Highlighter to my search, however it's doing some really strange highlighting, what am I doing wrong?
Here's the highlighting code:
// stuff here to get scoreDocs
var content = doc.GetField("content").StringValue();
// content = "Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been"
var highlighter = new Highlighter(new StrongFormatter(), new HtmlEncoder(), new QueryScorer(query.Rewrite(indexSearcher.GetIndexReader())));
highlighter.SetTextFragmenter(new SimpleFragmenter(100));
var tokenStream = analyzer.TokenStream("content", new StringReader(content));
var bestFragment = highlighter.GetBestFragment(tokenStream, content);
Searching for "lorem"
gives me this bestFragment value:
<strong>Lorem</strong> <strong>Ipsum</strong> is <strong>simply</strong> <strong>dummy</strong> <strong>text</strong> of the <strong>printing</strong> and <strong>typesetting</strong> <strong>industry</strong>. <strong>Lorem</strong> <strong>Ipsum</strong> <strong>has</strong> <strong>been</strong>
As you can see, its highlighted much more than just "Lorem"
. Why?
How do I make this behave sensibly?
I'm using a StandardAnalyzer
and my query looks like "content:lorem"
Edit: I'm using Lucene.NET 2.9.2
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您尚未提交
StrongFormatter
或HtmlEncoder
的实现,但我想说您的实现错误位于第一个。它需要检查传递的 TokenGroup 的分数来决定是否需要任何格式化。但是,您并不是第一个想要将匹配项包装在 html 元素中的人。您可以只使用Highlighter.Net 附带的
SimpleHTMLFormatter
格式化程序。同时,还有一个SimpleHTMLEncoder
可能会执行您的 HtmlEncoder 的操作。You haven't submitted your implementation of
StrongFormatter
orHtmlEncoder
, but I would say that your implementation error is in the first one. It needs to check the score of the passedTokenGroup
to decide if any formatting is needed.However, you're not the first one that wants to wrap matches in a html element. You could just use the
SimpleHTMLFormatter
formatter that comes with Highlighter.Net. And while at it, there's also aSimpleHTMLEncoder
which probably does what your HtmlEncoder does.