Lucene StandardAnalyzer 3.5 类型属性
我最近注意到 Lucene StandardAnalyzer 的行为自 3.1 版本以来发生了一些变化。具体来说,3.0 及之前的版本将电子邮件、IP 地址、公司名称等识别为单独的词汇类型,而更高版本则不这样做。
例如,对于输入文本:“[email protected] 127.0 .0.1 H&M”, 3.0 分析器将识别以下类型:
1: [email protected]: 0 ->16:<电子邮件>
2:127.0.0.1:17->26:<主机>
3:h&m:27->30:<公司>
但是,版本 3.1 及更高版本为相同的输入文本提供以下输出:
1: example: 0->7: <ALPHANUM>
2:mail.com:8->16:<ALPHANUM>
3: 127.0.0.1: 17->26: <NUM>
我的问题是,如何使用新版本的 Lucene 库实现旧的 StandardAnalyzer 行为?是否有一些标准的 TokenFilters 可以帮助我实现这一目标,或者我是否需要实现自定义过滤器?
I have recently noticed that the behavior of the Lucene StandardAnalyzer have changed somewhat since version 3.1. Concretely, 3.0 and previous versions recognized e-mails, IP addresses, company names etc as separate lexical types, while later versions don't.
For example, for input text : "[email protected] 127.0.0.1 H&M", the 3.0 analyzer would recognize the following types:
1: [email protected]: 0->16: <EMAIL>
2: 127.0.0.1: 17->26: <HOST>
3: h&m: 27->30: <COMPANY>
However, version 3.1 and later give the following output for the same input text:
1: example: 0->7: <ALPHANUM>
2: mail.com: 8->16: <ALPHANUM>
3: 127.0.0.1: 17->26: <NUM>
My question is, how can I implement the old StandardAnalyzer behavior with newer version of the Lucene library? Are there some standard TokenFilters that can help me achieve this, or do I need to implement custom filters?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
请参阅 StandardAnalyzer 的 javadocs:从 3.1 开始,StandardTokenizer 实现了 Unicode 文本分段...ClassicTokenizer 和 ClassicAnalyzer 是 StandardTokenizer 和 StandardAnalyzer 的 3.1 之前的实现。
或者,您可以将 LUCENE_30 版本传递给 StandardAnalyzer,您也可以获得以前的行为。这就是这些版本常量的目的,以便现有用户的行为保持一致,并且您可以决定何时升级应用程序以更改行为。
See the javadocs for StandardAnalyzer: As of 3.1, StandardTokenizer implements Unicode text segmentation.... ClassicTokenizer and ClassicAnalyzer are the pre-3.1 implementations of StandardTokenizer and StandardAnalyzer.
Alternatively, you can pass LUCENE_30 version to StandardAnalyzer and you also get the previous behavior. Thats the purpose of these version constants, so that behavior stays consistent for existing users, and you decide when to upgrade your app to changed behavior.