Hibernate Search ShingleAnalyzerWrapper 工作示例

发布于 2024-10-21 07:02:36 字数 839 浏览 5 评论 0原文

我正在使用 hibernate-search-3.2.1.Final 并希望将我的输入解析为带状疱疹。从我在文档中看到的,ShingleAnalyzerWrapper 似乎正是我所需要的。我已经使用 WhitespaceAnalyzer、StandardAnalyzer 和 SnowballAnalyzer 作为 ShingleAnalyzerWrapper 的默认分析器进行了测试。

Version luceneVersion = Version.LUCENE_29;
SnowballAnalyzer keywordAnalyzer= new SnowballAnalyzer(luceneVersion, "English", StopAnalyzer.ENGLISH_STOP_WORDS_SET);
ShingleAnalyzerWrapper shingleAnalyzer = new ShingleAnalyzerWrapper(keywordAnalyzer, 4);
shingleAnalyzer.setOutputUnigrams(false);
QueryParser keywordParser = new QueryParser(luceneVersion, "keyword", keywordAnalyzer);
Query keywordQuery = keywordParser.parse(QueryParser.escape(keyword.toLowerCase()));

然而,查询返回空。我期待像“你好世界,这是 Lucene”这样的关键字会导致带状疱疹 [这是你好世界,世界这是 lucene,这是 lucene]

让我知道我对 ShingleAnalyzerWrapper 的期望和用法是否正确。

谢谢, 瑞安

I am using hibernate-search-3.2.1.Final and would like to parse my input into shingles. From what i can see in the documentation, ShingleAnalyzerWrapper seem to be exactly what I needed. I have tested with both WhitespaceAnalyzer, StandardAnalyzer, and SnowballAnalyzer as the default analyzer for the ShingleAnalyzerWrapper.

Version luceneVersion = Version.LUCENE_29;
SnowballAnalyzer keywordAnalyzer= new SnowballAnalyzer(luceneVersion, "English", StopAnalyzer.ENGLISH_STOP_WORDS_SET);
ShingleAnalyzerWrapper shingleAnalyzer = new ShingleAnalyzerWrapper(keywordAnalyzer, 4);
shingleAnalyzer.setOutputUnigrams(false);
QueryParser keywordParser = new QueryParser(luceneVersion, "keyword", keywordAnalyzer);
Query keywordQuery = keywordParser.parse(QueryParser.escape(keyword.toLowerCase()));

However, the query came back empty. I was expecting keyword like "hello world, this is Lucene" to result in shingles [hello world this is, world this is lucene, this is lucene]

Let me know if my expectation and usage of ShingleAnalyzerWrapper is correct.

Thanks,
Ryan

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

雨巷深深 2024-10-28 07:02:36

也许这是复制/粘贴错误,但在您的代码片段中,shingleAnalyzer 实际上并未被使用,因为您将变量 keywordsAnalyzer 传递给查询解析器。您在索引时使用什么分析器?

如果您使用过滤掉停用词的分析器作为ShingleAnalyzerWrapper的委托分析器,停用词(示例中的“this”和“is”)将在shingle分析器有机会从中创建shingles之前被删除。

调试分析器的一个好方法是使用类似“Lucene in Action”中描述的AnalyzerUtils之类的东西。您可以在此处获取示例代码: http://java .codefetch.com/example/in/LuceneInAction/src/lia/analysis/AnalyzerUtils.java

尼基塔

Maybe it's a copy/paste error, but in your code snippet, the shingleAnalyzer is not actually being used because you're passing the variable keywordAnalyzer to the query parser. What analyzer are you using at indexing time?

If you use an analyzer that filters out stop words as the delegate analyzer for ShingleAnalyzerWrapper, stop words ("this" and "is" in your example) will be dropped before the shingle analyzer has a chance to create shingles from them.

A good way to debug analyzers is to use something like AnalyzerUtils described in "Lucene in Action". You can get the sample code here: http://java.codefetch.com/example/in/LuceneInAction/src/lia/analysis/AnalyzerUtils.java

Nikita

胡渣熟男 2024-10-28 07:02:36

谢谢尼基塔!是的,这是一个复制粘贴错误,尽管正确的版本仍然会产生正确的结果。

您在AnalyzerUtils 上的链接很有帮助,因为我能够使用以下代码生成Shingles:

ShingleAnalyzerWrapper shingleAnalyzer = new ShingleAnalyzerWrapper(4);
shingleAnalyzer.setOutputUnigrams(false);

TokenStream stream = shingleAnalyzer.tokenStream("contents", new StringReader("red dress shoes with black laces"));
ArrayList tokenList = new ArrayList();
while (true) {
    Token token = null;
    try {
        token = stream.next();
    } catch (IOException e) {
        e.printStackTrace();  
    }
    if (token == null) break;
        tokenList.add(token);
}

生成:

[(red dress,0,9,type=shingle), (red dress shoes,0,15,type=shingle,posIncr=0), (red dress shoes black,0,26,type=shingle,posIncr=0), (dress shoes,4,15,type=shingle), (dress shoes black,4,26,type=shingle,posIncr=0), (dress shoes black laces,4,32,type=shingle,posIncr=0), (shoes black,10,26,type=shingle), (shoes black laces,10,32,type=shingle,posIncr=0), (black laces,21,32,type=shingle)]

问题不在于ShingleAnalyzerWrapper 本身,而在于QueryParser。我需要更多的挖掘来找出根本原因,但你让我知道了从哪里开始。

Thanks Nikita! Yes, it was an copy-n-paste error, though the correct version still does produce the right results.

Your link on AnalyzerUtils was a great help, as I was able to use the following code to generate Shingles:

ShingleAnalyzerWrapper shingleAnalyzer = new ShingleAnalyzerWrapper(4);
shingleAnalyzer.setOutputUnigrams(false);

TokenStream stream = shingleAnalyzer.tokenStream("contents", new StringReader("red dress shoes with black laces"));
ArrayList tokenList = new ArrayList();
while (true) {
    Token token = null;
    try {
        token = stream.next();
    } catch (IOException e) {
        e.printStackTrace();  
    }
    if (token == null) break;
        tokenList.add(token);
}

Which produces:

[(red dress,0,9,type=shingle), (red dress shoes,0,15,type=shingle,posIncr=0), (red dress shoes black,0,26,type=shingle,posIncr=0), (dress shoes,4,15,type=shingle), (dress shoes black,4,26,type=shingle,posIncr=0), (dress shoes black laces,4,32,type=shingle,posIncr=0), (shoes black,10,26,type=shingle), (shoes black laces,10,32,type=shingle,posIncr=0), (black laces,21,32,type=shingle)]

The problem was not with the ShingleAnalyzerWrapper itself, but the QueryParser. I will need some more digging to figure out what's the underlying cause, but you got me some where to start from.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文