Lucene.NET 荧光笔插件突出显示奇怪

发布于 2024-09-28 00:22:09 字数 1377 浏览 0 评论 0原文

我正在尝试将 Lucene.NET 荧光笔添加到我的搜索中,但是它做了一些非常奇怪的突出显示,我做错了什么?

这是突出显示的代码:

// stuff here to get scoreDocs

var content = doc.GetField("content").StringValue();
// content = "Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been"

  
var highlighter = new Highlighter(new StrongFormatter(), new HtmlEncoder(), new QueryScorer(query.Rewrite(indexSearcher.GetIndexReader())));
highlighter.SetTextFragmenter(new SimpleFragmenter(100));
var tokenStream = analyzer.TokenStream("content", new StringReader(content));

var bestFragment = highlighter.GetBestFragment(tokenStream, content);

搜索 "lorem" 给出了这个 bestFragment 值:

<strong>Lorem</strong> <strong>Ipsum</strong> is <strong>simply</strong> <strong>dummy</strong> <strong>text</strong> of the <strong>printing</strong> and <strong>typesetting</strong> <strong>industry</strong>. <strong>Lorem</strong> <strong>Ipsum</strong> <strong>has</strong> <strong>been</strong>

如您所见,它突出显示的不仅仅是 "Lorem"。为什么?

我该如何让这种行为变得明智?

我正在使用 StandardAnalyzer,我的查询类似于 "content:lorem"

编辑: 我正在使用 Lucene.NET 2.9.2

I'm trying to add the Lucene.NET Highlighter to my search, however it's doing some really strange highlighting, what am I doing wrong?

Here's the highlighting code:

// stuff here to get scoreDocs

var content = doc.GetField("content").StringValue();
// content = "Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been"

  
var highlighter = new Highlighter(new StrongFormatter(), new HtmlEncoder(), new QueryScorer(query.Rewrite(indexSearcher.GetIndexReader())));
highlighter.SetTextFragmenter(new SimpleFragmenter(100));
var tokenStream = analyzer.TokenStream("content", new StringReader(content));

var bestFragment = highlighter.GetBestFragment(tokenStream, content);

Searching for "lorem" gives me this bestFragment value:

<strong>Lorem</strong> <strong>Ipsum</strong> is <strong>simply</strong> <strong>dummy</strong> <strong>text</strong> of the <strong>printing</strong> and <strong>typesetting</strong> <strong>industry</strong>. <strong>Lorem</strong> <strong>Ipsum</strong> <strong>has</strong> <strong>been</strong>

As you can see, its highlighted much more than just "Lorem". Why?

How do I make this behave sensibly?

I'm using a StandardAnalyzer and my query looks like "content:lorem"

Edit: I'm using Lucene.NET 2.9.2

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

神也荒唐 2024-10-05 00:22:09

您尚未提交 StrongFormatterHtmlEncoder 的实现,但我想说您的实现错误位于第一个。它需要检查传递的 TokenGroup 的分数来决定是否需要任何格式化。

public class StrongFormatter : Formatter {
    public String HighlightTerm(String originalText, TokenGroup tokenGroup) {
        var score = tokenGroup.GetTotalScore();
        if (score == 0)
            return originalText;

        return String.Concat("<strong>", originalText, "<strong>");
    }
}

但是,您并不是第一个想要将匹配项包装在 html 元素中的人。您可以只使用Highlighter.Net 附带的SimpleHTMLFormatter 格式化程序。同时,还有一个 SimpleHTMLEncoder 可能会执行您的 HtmlEncoder 的操作。

You haven't submitted your implementation of StrongFormatter or HtmlEncoder, but I would say that your implementation error is in the first one. It needs to check the score of the passed TokenGroup to decide if any formatting is needed.

public class StrongFormatter : Formatter {
    public String HighlightTerm(String originalText, TokenGroup tokenGroup) {
        var score = tokenGroup.GetTotalScore();
        if (score == 0)
            return originalText;

        return String.Concat("<strong>", originalText, "<strong>");
    }
}

However, you're not the first one that wants to wrap matches in a html element. You could just use the SimpleHTMLFormatter formatter that comes with Highlighter.Net. And while at it, there's also a SimpleHTMLEncoder which probably does what your HtmlEncoder does.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文