让 Lucene 索引一个值并存储另一个值

发布于 2024-10-09 14:13:18 字数 456 浏览 5 评论 0原文

我希望 Lucene.NET 存储一个值,同时对存储值的修改后的精简版本建立索引。例如考虑这个值:

this_example-has some/weird (chars) 100%

我希望它像这样存储(这样我就可以准确地检索到在结果列表中显示的值),但我希望 lucene 将其索引为:(

this example has some weird chars 100

你看,就像原始版本的“净化”版本值)以简化搜索。

我认为这将是分析器的工作,但我不想搞乱自己的工作。理想情况下,解决方案应删除除字母、数字或引号之外的所有内容,并在索引之前用空格替换删除的字符。

关于如何实施有什么建议吗?

这是因为我正在为电子商务搜索建立索引,有些产品的名称确实令人毛骨悚然。我认为这会提高搜索的自信。

提前致谢。

I want Lucene.NET to store a value while indexing a modified, stripped-down version of the stored value. e.g. Consider the value:

this_example-has some/weird (chars) 100%

I want it stored right like that (so that I can retrieve exactly that for showing in the results list), but I want lucene to index it as:

this example has some weird chars 100

(you see, like a "sanitized" version of the original value) for a simplified search.

I figure this would be the job of an analyzer, but I don't want to mess with rolling my own. Ideally, the solution should remove everything that is not a letter, a number or quotes, replacing the removed chars by a white-space before indexing.

Any suggestions on how to implement that?

This is because I am indexing products for an e-commerce search, and some have realy creepy names. I think this would improve search assertiveness.

Thanks in advance.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

躲猫猫 2024-10-16 14:13:18

如果您不需要自定义分析器,请尝试将值存储为单独的非索引字段,并使用简单的正则表达式生成清理版本。

var input = "this_example-has some/weird (chars) 100%";
var output = Regex.Replace(input, @"[\W_]+", " ");

您提到您需要另一个分析器来实现某些搜索功能。不要忘记 PerFieldAnalyzerWrapper,它将允许您在同一文档中使用不同的分析器。

public static void Main() {
    var wrapper = new PerFieldAnalyzerWrapper(defaultAnalyzer: new StandardAnalyzer(Version.LUCENE_29));
    wrapper.AddAnalyzer(fieldName: "id", analyzer: new KeywordAnalyzer());

    IndexWriter writer = null; // TODO: Retrieve these.
    Document document = null;
    writer.AddDocument(document, analyzer: wrapper);
}

If you don't want a custom analyzer, try storing the value as a separate non-indexed field, and use a simple regex to generate the sanitized version.

var input = "this_example-has some/weird (chars) 100%";
var output = Regex.Replace(input, @"[\W_]+", " ");

You mention that you need another Analyzer for some searching functionality. Dont forget the PerFieldAnalyzerWrapper which will allow you to use different analyzers within the same document.

public static void Main() {
    var wrapper = new PerFieldAnalyzerWrapper(defaultAnalyzer: new StandardAnalyzer(Version.LUCENE_29));
    wrapper.AddAnalyzer(fieldName: "id", analyzer: new KeywordAnalyzer());

    IndexWriter writer = null; // TODO: Retrieve these.
    Document document = null;
    writer.AddDocument(document, analyzer: wrapper);
}
山有枢 2024-10-16 14:13:18

你是对的,这是分析器的工作。我首先使用 luke 这样的工具来查看标准分析器对您的在讨论使用什么之前,术语——它往往可以很好地去除噪音字符和单词。

You are correct that this is the work of the analyzer. And I'd start by using a tool like luke to see what the standard analyzer does with your term before getting into what to use -- it tends to do a good job stripping noise characters and words.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文