通过 lucene 使用预标记化文本
我的数据已经使用外部资源标记化,我想在 lucene 中使用该数据。我的第一个想法是使用 \x01
连接这些字符串,并使用 WhiteSpaceTokenizer
再次拆分它们。有更好的主意吗? (输入是 XML 格式)
作为奖励,这个带注释的数据还包含同义词,我将如何注入它们(表示为 XML 标签)。
My data is already tokenized with an external resource and I'd like to use that data within lucene. My first idea would be to join those strings with a \x01
and use a WhiteSpaceTokenizer
to split them again. Is there a better idea? (the input is in XML)
As bonus, this annotated data also contains synonyms, how would I inject them (represented as XML tags).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
Lucene 允许您向字段提供自己的令牌流,绕过令牌化步骤。为此,您可以创建自己的 TokenStream 子类,实现incrementToken(),然后调用 field.setTokenStream(new MyTokenStream(yourTokens)):
Lucene allows you to provide your own stream of tokens to the field, bypassing the tokenization step. To do that you can create your own subclass of TokenStream implementing incrementToken() and then call field.setTokenStream(new MyTokenStream(yourTokens)):
WhitespaceTokenizer
不适合使用0x01
连接的字符串。相反,派生自 < code>CharTokenizer,覆盖isTokenChar
。这种方法的主要问题是加入然后再次分裂可能会很昂贵;如果它变得太昂贵,您可以实现一个简单的 TokenStream ,它只从其输入中发出令牌。
如果通过同义词您的意思是“程序员”之类的术语扩展为一组术语,例如{“程序员”,“开发人员”,“黑客”},那么我建议在同一位置发出这些术语。您可以使用
PositionIncrementAttribute
来控制它。有关
PositionIncrementAttribute
用法的示例,请参阅 我的词形还原TokenStream
它发出全文中找到的单词形式及其同一位置的引理。WhitespaceTokenizer
is unfit for strings joined with0x01
. Instead, derive fromCharTokenizer
, overridingisTokenChar
.The main problem with this approach is that joining and then splitting again migth be expensive; if it turns to be too expensive, you can implement a trivial
TokenStream
that just emits the tokens from its input.If by synonyms you mean that a term like "programmer" is expanded to a set of terms, say, {"programmer", "developer", "hacker"}, then I recommend emitting these at the same position. You can use a
PositionIncrementAttribute
to control this.For an example of
PositionIncrementAttribute
usage, see my lemmatizingTokenStream
which emits both word forms found in full text and their lemmas at the same position.