如何使用 Lucene 分析器来标记字符串?
有没有一种简单的方法可以使用 Lucene 的 Analyzer
的任何子类来解析/标记 String
?
像这样的东西:
String to_be_parsed = "car window seven";
Analyzer analyzer = new StandardAnalyzer(...);
List<String> tokenized_string = analyzer.analyze(to_be_parsed);
Is there a simple way I could use any subclass of Lucene's Analyzer
to parse/tokenize a String
?
Something like:
String to_be_parsed = "car window seven";
Analyzer analyzer = new StandardAnalyzer(...);
List<String> tokenized_string = analyzer.analyze(to_be_parsed);
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
根据上面的答案,稍加修改即可与 Lucene 4.0 一起使用。
Based off of the answer above, this is slightly modified to work with Lucene 4.0.
据我所知,你必须自己编写循环。像这样的东西(直接取自我的源代码树):
As far as I know, you have to write the loop yourself. Something like this (taken straight from my source tree):
正如 另一个 Stack Overflow 答案 所示,最新的最佳实践似乎是向令牌流添加一个属性,然后访问该属性,而不是直接从令牌流中获取属性。为了更好地测量,您可以确保分析仪关闭。使用最新的 Lucene(当前为 v8.6.2),代码将如下所示:
代码完成后,
tokens
将包含已解析令牌的列表。另请参阅: Lucene分析概述。
警告:我刚刚开始编写 Lucene 代码,所以我没有很多 Lucene 经验。不过,我花时间研究了最新的文档和相关帖子,并且我相信我放在这里的代码遵循最新推荐的做法,比当前的答案稍好一些。
The latest best practices, as another Stack Overflow answer indicates, seems to be to add an attribute to the token stream and later access that attribute, rather than getting an attribute directly from the token stream. And for good measure you can make sure the analyzer gets closed. Using the very latest Lucene (currently v8.6.2) the code would look like this:
After that code is finished,
tokens
will contain a list of parsed tokens.See also: Lucene Analysis Overview.
Caveat: I'm just starting to write Lucene code, so I don't have a lot of Lucene experience. I have taken the time to research the latest documentation and related posts, however, and I believe that the code I've placed here follows the latest recommended practices slightly better than the current answers.
使用 try-with-resources 效果会更好!这样您就不必显式调用更高版本的库中所需的
.close()
。以及分词器版本:
Even better by using try-with-resources! This way you don't have to explicitly call
.close()
that is required in higher versions of the library.And the Tokenizer version: