如何扩展WhitespaceTokenizer?
我需要使用一个分词器来分割空格上的单词,但如果空格位于双括号内,则不会分割。这里有一个例子:
My input-> term1 term2 term3 ((term4 term5)) term6
应该产生这个令牌列表:
term1, term2, term3, ((term4 term5)), term6.
我认为我可以通过扩展 Lucene WhiteSpaceTokenizer。我如何执行此扩展?
还有其他解决方案吗?
提前致谢。
I need to use a tokenizer that splits words on whitespace but that doesn't split if the whitespace is whithin double parenthesis. Here an example:
My input-> term1 term2 term3 ((term4 term5)) term6
should produce this list of tokens:
term1, term2, term3, ((term4 term5)), term6.
I think that I can obtain this behaviour by extending Lucene WhiteSpaceTokenizer. How can I perform this extension?
Is there some other solutions?
Thanks in advance.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我没有尝试扩展 Tokenizer,但我这里有一个很好的(我认为)带有正则表达式的解决方案:
以及一种通过来自 reg ex 返回数组的匹配组分割字符串的方法。代码示例:
}
输出:
希望这对您有帮助。
请注意,以下代码与通常的
split()
:不会返回任何内容或不会返回您想要的内容,因为它只检查分隔符。
I haven't tried to extend the Tokenizer, but i have here a nice (i think) solution with a regular expression:
And a method that split a string by matched groups from the reg ex returning an array. Code example:
}
The output:
Hope this help you.
Please note that the following code with the usual
split()
:will return you nothing or not what you want beacuse it only check delimiters.
您可以通过扩展
WhitespaceTokenizer
来做到这一点,但我希望如果您编写TokenFilter
从WhitespaceTokenizer
读取并根据数字将连续标记粘贴在一起括号。重写
incrementToken
是编写类似Tokenizer
的类时的主要任务。我自己也曾经这样做过; 结果 可以作为一个例子(尽管出于技术原因,我无法让我的类成为 TokenFilter )。You can do this by extending
WhitespaceTokenizer
, but I expect it will be easier if you write aTokenFilter
that reads from aWhitespaceTokenizer
and pastes together consecutive tokens based on the number of parentheses.Overriding
incrementToken
is the main task when writing aTokenizer
-like class. I once did this myself; the result might serve as an example (though for technical reasons, I couldn't make my class aTokenFilter
).