让 Lucene 索引一个值并存储另一个值

发布于 2024-10-09 14:13:18 字数 456 浏览 5 评论 0原文

我希望 Lucene.NET 存储一个值，同时对存储值的修改后的精简版本建立索引。例如考虑这个值：

this_example-has some/weird (chars) 100%

我希望它像这样存储（这样我就可以准确地检索到在结果列表中显示的值），但我希望 lucene 将其索引为：（

this example has some weird chars 100

你看，就像原始版本的“净化”版本值）以简化搜索。

我认为这将是分析器的工作，但我不想搞乱自己的工作。理想情况下，解决方案应删除除字母、数字或引号之外的所有内容，并在索引之前用空格替换删除的字符。

关于如何实施有什么建议吗？

这是因为我正在为电子商务搜索建立索引，有些产品的名称确实令人毛骨悚然。我认为这会提高搜索的自信。

提前致谢。

原文

I want Lucene.NET to store a value while indexing a modified, stripped-down version of the stored value. e.g. Consider the value:

this_example-has some/weird (chars) 100%

I want it stored right like that (so that I can retrieve exactly that for showing in the results list), but I want lucene to index it as:

this example has some weird chars 100

(you see, like a "sanitized" version of the original value) for a simplified search.

I figure this would be the job of an analyzer, but I don't want to mess with rolling my own. Ideally, the solution should remove everything that is not a letter, a number or quotes, replacing the removed chars by a white-space before indexing.

Any suggestions on how to implement that?

This is because I am indexing products for an e-commerce search, and some have realy creepy names. I think this would improve search assertiveness.

Thanks in advance.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

躲猫猫 2024-10-16 14:13:18

如果您不需要自定义分析器，请尝试将值存储为单独的非索引字段，并使用简单的正则表达式生成清理版本。

var input = "this_example-has some/weird (chars) 100%";
var output = Regex.Replace(input, @"[\W_]+", " ");

您提到您需要另一个分析器来实现某些搜索功能。不要忘记 PerFieldAnalyzerWrapper，它将允许您在同一文档中使用不同的分析器。

public static void Main() {
    var wrapper = new PerFieldAnalyzerWrapper(defaultAnalyzer: new StandardAnalyzer(Version.LUCENE_29));
    wrapper.AddAnalyzer(fieldName: "id", analyzer: new KeywordAnalyzer());

    IndexWriter writer = null; // TODO: Retrieve these.
    Document document = null;
    writer.AddDocument(document, analyzer: wrapper);
}

If you don't want a custom analyzer, try storing the value as a separate non-indexed field, and use a simple regex to generate the sanitized version.

var input = "this_example-has some/weird (chars) 100%";
var output = Regex.Replace(input, @"[\W_]+", " ");

You mention that you need another Analyzer for some searching functionality. Dont forget the PerFieldAnalyzerWrapper which will allow you to use different analyzers within the same document.

public static void Main() {
    var wrapper = new PerFieldAnalyzerWrapper(defaultAnalyzer: new StandardAnalyzer(Version.LUCENE_29));
    wrapper.AddAnalyzer(fieldName: "id", analyzer: new KeywordAnalyzer());

    IndexWriter writer = null; // TODO: Retrieve these.
    Document document = null;
    writer.AddDocument(document, analyzer: wrapper);
}

回复收藏 0 原文