Lucene SpanishAnalyzer 类带有重音词的奇怪行为

发布于 2024-12-18 01:37:40 字数 782 浏览 2 评论 0原文

我在 Lucene 3.4 中使用SpanishAnalyzer 类。当我想解析带重音的单词时，我得到了一个奇怪的结果。例如，如果我解析这两个单词：“comunicación”和“comunicacion”，我得到的词干是“comun”和“comunicacion”。如果我解析“maratón”和“maraton”，我会得到两个单词相同的词干（“maraton”）。

因此，至少在我看来，很奇怪的是，同一个单词“comunicación”根据其是否重音而给出不同的结果。如果我搜索“comunicacion”这个词，无论是否带重音，我都应该得到相同的结果。

我正在使用的代码是下一个：

SpanishAnalyzer sa = new SpanishAnalzyer(Version.LUCENE_34);
QueryParser parser = new QueryParser(Version.LUCENE_34, "content", sa);
String str = "comunicación";
String str2 = "comunicacion";
System.out.println("first: " + parser.parse(str)); //stem = comun
System.out.println("second: " + parser.parse(str2)); //stem = comunicacion

我发现能够获取共享“comunicacion”词干的每个单词（无论是否带重音）的解决方案是第一步去掉重音，然后用分析器解析它，但我不知道这是否是正确的方法。

请问有人可以帮助我吗？

原文

I'm using the SpanishAnalyzer class in Lucene 3.4. When I want to parse accented words, I'm having a strange result. If I parse, for example, these two words: "comunicación" and "comunicacion", the stems I'm getting are "comun" and "comunicacion". If I instead parse "maratón" and "maraton", I'm getting the same stem for both words ("maraton").

So, at least in my opinion, it's very strange that the same word, "comunicación", gives different results depending on it is accented or not. If I search the word "comunicacion", I should get the same result regardless of whether it's accented or not.

The code I'm using is the next one:

SpanishAnalyzer sa = new SpanishAnalzyer(Version.LUCENE_34);
QueryParser parser = new QueryParser(Version.LUCENE_34, "content", sa);
String str = "comunicación";
String str2 = "comunicacion";
System.out.println("first: " + parser.parse(str)); //stem = comun
System.out.println("second: " + parser.parse(str2)); //stem = comunicacion

The solution I've found to be able to get every single word that shares the stem of "comunicacion", accented or not, is to take off accents in a first step, and then parse it with the Analyzer, but I don't know if it's the right way.

Please, can anyone help me?

分享到QQ

分享到微博